NYCU-110-2-Natural-Language-Processing

This project is to identify emotion through corpus, and it is related to kaggle competition.

The model in this project was built by HuggingFace and Pytorch.

In addition, please refer to the following report link for detailed report and description of the experimental results.

Reproducing Submission

Please do the following steps to reproduce the submission without retraining.

  1. Requirement
  2. Repository Structure
  3. Ensemble

Hardware

In this project, the following machine was used to train the emotion classification model.

Operating System CPU GPU
Machine 1 Ubuntu 20.04.3 LTS Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz NVIDIA GeForce GTX TITAN X
Machine 2 Ubuntu 18.04.5 LTS Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz NVIDIA GeForce GTX 1080
Machine 3 Ubuntu 20.04.3 LTS AMD Ryzen 5 5600X 6-Core Processor NVIDIA GeForce RTX 2080 Ti

Requirement

In this project, the conda and pip toolkit was used to build the environment.

The following two options are provided for building the environment.

First Option

conda env create -f environment.yml

Second Option

conda create --name nlp python=3.8
conda activate nlp
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch
conda install matplotlib pandas scikit-learn -y
pip install tqdm
pip install transformers
pip install sentencepiece
pip install emoji
pip install nltk

Repository Structure

The ERNIE, RoBERTa and XLNet directories can be downloaded from the following link. Please put them under the corresponding directory. https://drive.google.com/drive/folders/1RL8fe4Q6cFrMA9M2vysXNAHUE2C1wIm_?usp=sharing

├─ data
│  ├─ best_data
│     ├─ fixed_group_test.csv
│     ├─ fixed_group_train.csv
│     └─ fixed_group_valid.csv
│  ├─ 1(utterance+prompt)
│     ├─ fixed_group_test.csv
│     ├─ fixed_group_train.csv
│     └─ fixed_group_valid.csv
│  ├─ ...
│  ├─ ...
│  ├─ ...
│  └─ 1+2+3+4+5+6+7+8+9(utterance+prompt)
│     ├─ fixed_group_test.csv
│     ├─ fixed_group_train.csv
│     └─ fixed_group_valid.csv
├─ train.py
├─ test1.py
├─ test2.py
├─ ensemble1.py
├─ ensemble2.py
├─ parameters.yaml
├─ nlpdatasets.py
├─ total_model.py
├─ fixed_test.csv
├─ fixed_train.csv
├─ fixed_valid.csv
├─ ensemble2
│  └─ RoBERTa+ERNIE+XLNet
│     └─ submission.csv
├─ experiment
│  ├─ ERNIE
│     ├─ figure
│        ├─ accuracy.png
│        └─ loss.png 
│     ├─ best_model_state.bin
│     ├─ parameters.yaml
│     ├─ record.csv
│     ├─ submission1.csv
│     └─ submission2.csv
│  ├─ RoBERTa
│     ├─ figure
│        ├─ accuracy.png
│        └─ loss.png 
│     ├─ best_model_state.bin
│     ├─ parameters.yaml
│     ├─ record.csv
│     ├─ submission1.csv
│     └─ submission2.csv
│  └─ XLNet
│     ├─ figure
│        ├─ accuracy.png
│        └─ loss.png 
│     ├─ best_model_state.bin
│     ├─ parameters.yaml
│     ├─ record.csv
│     ├─ submission1.csv
│     └─ submission2.csv
└─ README.md

Procedure

Data preprocess

In this project, we use the following data preprocess method to deal with the data, and we find that the fifth data preprocess method is the best method in this project. After doing the data preprocess, the sentences were separated into different tokens.

NO. Method Before After
1 replace _comma_ to , There was a lot of people_comma_ but it only felt like us in the world. There was a lot of people, but it only felt like us in the world.
2 replace / to or /Did you get him a teacher? or Did you get him a teacher?
3 replace & to and I believe it’s because you miss family & friends I believe it’s because you miss family and friends
4 remove emoji I just got new neighbors and they are so loud.,I know there probably isnt much you can do. :/ I just got new neighbors and they are so loud.,I know there probably isnt much you can do.
5 restore he’s to he is Yeah_comma_ fortunately he’s very small so he doesn’t have as many joint problems as the bigger dogs I thnik at least. Yeah_comma_ fortunately he is very small so he does not have as many joint problems as the bigger dogs I thnik at least.
6 remove punctuation The love towards my wife is become more and it tends to uncountable now! The love towards my wife is become more and it tends to uncountable now
7 replace integer to #number It’s really sleek and fun to drive,I got the new 2018 Honda Accord LX. It’s really sleek and fun to drive,I got the new #number Honda Accord LX.
8 remove stopword Was invited to a friends house after work. invited to friends house after work.
9 lemmatize football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. football be a family of team sport that involve, to vary degree, kick a ball to score a goal.

Model Architecture

In this project, we used the following three pretrained models for transfer learning, which include RoBERTa-Large[1], ERNIE 2.0-Large[2] and XLNet[3].

The framework of these three classification models is as follows (only for demonstrate): the linear layer would be added to the model for transfer learning.

RoBERTa-Large

class RoBERTa(nn.Module):

    def __init__(self, n_classes):
        super(RoBERTa, self).__init__()
        self.model = RobertaModel.from_pretrained("roberta-large", hidden_dropout_prob=0.2, attention_probs_dropout_prob=0.2)
        self.out = nn.Linear(self.model.config.hidden_size, n_classes)
        
    def forward(self, input_ids, attention_mask):
        last_hidden_state, pooled_output = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict=False
        )
        return self.out(pooled_output)

ERNIE 2.0-Large

class ERNIE(nn.Module):

    def __init__(self, n_classes):
        super(ERNIE, self).__init__()
        self.model = AutoModel.from_pretrained("nghuyong/ernie-2.0-large-en", hidden_dropout_prob=0.2, attention_probs_dropout_prob=0.2)
        self.out = nn.Linear(self.model.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        last_hidden_state, pooled_output = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict=False
        )
        return self.out(pooled_output)

XLNet

class XLNet(nn.Module):

    def __init__(self, n_classes):
        super(XLNet, self).__init__()
        self.model = XLNetModel.from_pretrained("xlnet-large-cased")
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.model.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        pooled_output = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            return_dict=False
        )
        return self.out(pooled_output[0][:, -1, :])

Training

To train the model, please follow these steps.

1. Setting arguments through parameters.yaml file.

You can set the training arguments that you prefer through the parameters.yml file.

model: roberta-large # bert-base, ernie-base, roberta-base, xlnet-base, bert-large, ernie-large, roberta-large, xlnet-large, YOSO
USING_DATA: best_data # 1(utterance+prompt), 2(utterance+prompt)...
EPOCHS: 20 
BATCH_SIZE: 8 # 1 2 4 8(large model) 32(recommend) 64(recommend) 
LR: 2e-6 # 2e-3 2e-5(recommend) 2e-6(large model)
MAX_LEN: 160 # 100 128 160 256 512
FREEZE: [] #[], [embeddings], [encoder], [pooler], [embeddings, encoder], [encoder, pooler], [embeddings, encoder, pooler]...
DROPOUT_RATE: None #None or values
HIDDEN_DROPOUT_PROB: 0.2 #None or values
ATTENTION_PROBS_DROPOUT_PROB: 0.2 #None or values

2. Input Commend

You don’t need to add any argument behind the train.py.

python train.py

3. The position of experiment result

├─ data
├─ ...
├─ ...
├─ experiment
│  ├─ exp1                <- You can find the experiment result in this position.(The directory can be exp1 exp2 exp3, etc. According to the highest directory number).
│     ├─ figure
│        ├─ accuracy.png
│        └─ loss.png 
│     ├─ best_model_state.bin
│     ├─ parameters.yaml
│     ├─ confusion_matrix.csv
│     └─ record.csv
│  ├─ ERNIE
│  ├─ RoBERTa
│  └─ XLNet
└─ README.md

Testing

To generate the submission, please follow these steps.

1. Input Commend

These directory should have parameters.yaml and best_model_state.bin file.

The argument behind the –directory is the directory under the experiment. The argument behind weight is the weight that needs to multiplicate with predicted probability.

We recommend that the weight of RoBERTa should be 0.4, the weight of ERNIE should be 0.35, and the weight of XLNet should be 0.25.

python test2.py --directory RoBERTa --weight 0.4

2. The position of the testing result

The submission2.csv will be stored under the path of experiment/specified_directory.

├─ data
├─ ...
├─ experiment
│  ├─ ERNIE
│  ├─ RoBERTa
│     ├─ figure
│        ├─ accuracy.png
│        └─ loss.png 
│     ├─ best_model_state.bin     <- You should have this file under this directory.
│     ├─ parameters.yaml          <- You should have this file under this directory.
│     ├─ record.csv
│     └─ submission2.csv          <- You can find the testing result in this position. 
│  └─ XLNet
└─ README.md

Ensemble

To integrate different submission, please follow the steps below.

1. Input Commend

The argument behind the ensemble2.py is the directory under the experiment. These directory should have the submission2.csv file.

python ensemble2.py RoBERTa ERNIE XLNet

2. The position of the ensemble result

The ensemble submission2.csv will be stored under the path of experiment/ensemble2/RoBERTa+ERNIE+XLNet.

├─ data
├─ ...
├─ experiment
│  ├─ ERNIE
│     ├─ ...
│     ├─ ...
│     └─ submission2.csv       <- You should have this file under this directory.
│  ├─ RoBERTa
│     ├─ ...
│     ├─ ...
│     └─ submission2.csv       <- You should have this file under this directory.
│  └─ XLNet
│     ├─ ...
│     ├─ ...
│     └─ submission2.csv       <- You should have this file under this directory.
├─ ensemble2                   <- The program will create ensemble2 directory.
│  └─ RoBERTa+ERNIE+XLNet      <- The program will create RoBERTa+ERNIE+XLNet directory.
│     └─ submission.csv        <- You can find the ensemble result in this position. 
└─ README.md

Experiment Result

In this project, we use four experiments to verify our method has the best performance. The value of the accuracy which is shown below is the average of the multiple accuracy values.

1. Data Column

We found that the combination of Utterance and Prompt is suitable for all of the models we use in this project.

Utterance Prompt Utterance+Prompt
Accuracy 0.6126 0.6137 0.6649

2. Data Preprocess Method

We found that NO.5 is suitable for Roberta and ERNIE, but the combination of NO.3 and NO.5 is suitable for XLNet. Only the method NO. 3 and method NO. 5 is better than nothing to do.

NO. 1 NO. 2 NO. 3 NO. 4 NO. 5 NO. 6 NO. 7 NO. 8 NO. 9 NO. 10
Accuracy 0.6606 0.6595 0.6635 0.6592 0.6649 0.6542 0.6552 0.6238 0.6527 0.6624

3. Maximum number of tokens in one sentence

We found that 160 is suitable for RoBERTa and EERNIE, but 256 is more suitable for XLNet.

100 128 160 256 512
Accuracy 0.6439 0.6458 0.6477 0.65308 0.6480

4. Pretrained Model

We found that large model is better than base model, and the top three models in this project is RoBERTa-large, ERNIE-2.0-large and XLNet-large.

BERT-BASE RoBERTa-BASE ERNIE-BASE XLNet-BASE BERT-LARGE RoBERTa-LARGE ERNIE-LARGE XLNet-LARGE YOSO
Accuracy 0.5227 0.6018 0.6181 0.6025 0.6166 0.6649 0.6397 0.6379 0.5567

5. Dropout

We found that dropout are suitable for ERNIE and XLNet. In contrast, dropout is not suitable for RoBERTa. The hidden dropout and attention dropout are suitable for RoBERTa and ERNIE.

None dropout hidden_dropout_prob attention_probs_dropout_prob hidden_dropout_prob and attention_probs_dropout_prob
Accuracy 0.6445 0.6477 0.6471 0.6451 0.6524

6. Ensemble

In this section, we use ensemble method with the top 3 models.

RoBERTa-LARGE ERNIE-LARGE XLNet-LARGE Ensemble
F1-score 0.64239 0.62535 0.61961 0.65633

Reference

[1] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv, arXiv:1907.11692, Jul. 2019. doi: 10.48550/arXiv.1907.11692.

[2] Y. Sun et al., “ERNIE 2.0: A Continual Pre-training Framework for Language Understanding,” arXiv, arXiv:1907.12412, Nov. 2019. doi: 10.48550/arXiv.1907.12412.

[3] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” arXiv, arXiv:1906.08237, Jan. 2020. doi: 10.48550/arXiv.1906.08237.

GitHub

View Github