Tweets-Classification-with-BERT

Main Objectives and Related Research Questions

Tweet text classification with BERT, XGBoost and Random Forest.
Text Categories: Hate, Offensive, Profanity or None.

Research Questions:

  1. How would attention-based models perform when fine-tuned and tested with different dataset combinations from different time periods?
  2. Is the models’ performance independent of the language used ?
  3. Is BERT a better solution than traditional machine learning models for text classification

Datasets

HASOC 2019
HASOC 2020
link to data sources: HASOC DATASETS MAIN SOURCE

BERT MDOELS CONSIDERED

  1. bert-base-multilingual-uncased
  2. bert-base-multilingual-cased
  3. bert-base-german-cased

Classification Pipeline

Classification Pipeline

RESULTS – DE

Training Testing bert-german-cased bert-base-multilingual-cased bert-base-multilingual-uncased
HASOC 2019 HASOC 2019 83.7 84 84
HASOC 2020 HASOC 2020 79 80.16 76.53
HASOC 2019 HASOC 2020 73.63 71.52 71.52
HASOC 2020 HASOC 2019 83.05 84 81.17
HASOC 2019 + HASOC 2019 HASOC 2019 96.63 84.85 86.67
HASOC 2019 + HASOC 2020 HASOC 2019 80.44 78.77 79.6
HASOC 2019 + HASOC 2020 HASOC 2019 + HASOC 2020 91 82 84.31

RESULTS – EN

Training Testing bert-base-multilingual-uncased
HASOC 2019 HASOC 2019 73.8
HASOC 2020 HASOC 2020 81.53
HASOC 2019 HASOC 2020 76.39
HASOC 2020 HASOC 2019 75.54
HASOC 2019 + HASOC 2019 HASOC 2019 79
HASOC 2019 + HASOC 2020 HASOC 2019 81.1
HASOC 2019 + HASOC 2020 HASOC 2019 + HASOC 2020 79.85

GitHub

https://github.com/Sayed-Code/Tweets-Classification-with-BERT