Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

This is the Pytorch implementation for sparse progressive distillation (SPD). For more details about the motivation, techniques and experimental results, refer to our paper here.

Running

  • Environment Preparation (using python3)

    pip install -r requirements.txt
  • Dataset Preparation

    The original GLUE dataset could be downloaded here.

BERT_base fine-tuning on GLUE

We use finetuned BERT_base as the teacher. For each task of GLUE benchmark, we obtain the finetuned model using the original huggingface transformers code with the following script.

python run_glue.py \
          --model_name_or_path $INT_DIR \
          --task_name $TASK_NAME \
          --do_train \
          --do_eval \
          --data_dir $GLUE_DIR/$TASK_NAME/ \
          --max_seq_length 128 \
          --per_gpu_train_batch_size 32 \
          --per_gpu_eval_batch_size 32 \
          --learning_rate 3e-5 \
          --num_train_epochs 4.0 \
          --output_dir $OUT_DIR \
          --evaluate_during_training \
          --overwrite_output_dir \
          --logging_steps 400 \
          --logging_dir $OUT_DIR \
          --save_steps 10000

Sparse Progressive Distillation

We use run_glue.py to run the sparse progressive distillation. –num_prune_epochs is the epochs for pruning. –num_train_epochs is the total number of epochs (pruning, progressive distillation, finetuning).

python run_glue.py \
  --model_name_or_path PATH_TO_FINETUNED_MODEL \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/$TASK_NAME/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 6.4e-4 \
  --save_steps 50 \
  --num_prune_epochs 30 \
  --num_train_epochs 60 \
  --sparsity 0.9 \
  --output_dir $OUT_DIR \
  --evaluate_during_training \
  --replacing_rate 0.8 \
  --overwrite_output_dir \
  --steps_for_replacing 0 \
  --scheduler_type linear

To Dos

  • Provide our teacher model for each task.

  • Provide best performed model checkpoint for each task.

GitHub

View Github