PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams
- When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem.
- We propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews.
relative python packages are summerized in
- Flink v1.13
- Python 3.7
- Java 8
- Dataset quick access on https://course.fast.ai/datasets
- 1.6 million labeled Tweets:
- 280,000 training and 19,000 test samples in each polarity
- Source:Yelp Review Polarity
- 1,800,000 training and 200,000 testing samples in each polarity
- Source:Amazon product review polarity
quick try PLStream on yelp review dataset
cd PLStream weget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz tar zxvf yelp_review_polarity_csv.tgz mv yelp_review_polarity_csv/train.csv train.csv
1. Install required environment of PLStream
- please make sure Environment Requirements mentioned above is ready.
pip install -r requirements.txt
2. Start Redis-Server in a terminal
3. Run PLStream
- The outputs’ form is “original text” + “label” + “@@@@”:
- With help of a split(“@@@@”) function we can further reorganize the labelled dataset.
to see the labelling accuracy, simply run: