DistilBERT-Text-mining-authorship-attribution

Dataset used: https://www.kaggle.com/azimulh/tweets-data-for-authorship-attribution-modelling/version/2 DistilBERT: https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation

dataset – Contains useful functions relating to the datasets.

feature_extraction_selection – Plots all models using best dataset and parameters. Used to compare feature extraction methods. (Code inspired by: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb)

dataset_selection – Prints out the average accuracy for every dataset.

ds_exploration – Used to print the shapes of each dataset and plot the class distributions.

baseline_process – Run gridsearch cv on all ml models (excluding BERT variants)

bert – Run gridsearch cv on all ml models (BERT variants) (Code inspired by: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb)

ml – Contains useful functions relating to the machine learning.

Project done for the TDDE16 – Text mining course at Linköpings university.

GitHub

View Github