Awesome Efficient PLM Papers

Must-read papers on improving efficiency for pre-trained language models.

The paper list is mainly mantained by Lei Li and Shuhuai Ren.

Knowledge Distillation

  1. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter NeurIPS workshop

    Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf [pdf] [project]

  2. Patient Knowledge Distillation for BERT Model Compression EMNLP 2019

    Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu [pdf] [project]

  3. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models Preprint

    Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova [pdf] [project]

  4. TinyBERT: Distilling BERT for Natural Language Understanding Findings of EMNLP 2020

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu [pdf] [project]

  5. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing EMNLP 2020

    Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou [pdf] [project]

  6. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers NeurIPS 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, Ming Zhou [pdf] [project]

  7. BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance EMNLP 2020

    Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang, Yaohong Jin [pdf] [project]

  8. MixKD: Towards Efficient Distillation of Large-scale Language Models ICLR 2021

    Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, Lawrence Carin [pdf]

  9. Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains ACL-IJCNLP 2021

    Haojie Pan, Chengyu Wang, Minghui Qiu, Yichang Zhang, Yaliang Li, Jun Huang [pdf]

  10. MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation ACL-IJCNLP 2021

    Ahmad Rashid, Vasileios Lioutas, Mehdi Rezagholizadeh [pdf]

  11. Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor ACL-IJCNLP 2021

    Xinyu Wang, Yong Jiang, Zhaohui Yan, Zixia Jia, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, Kewei Tu [pdf] [project]

  12. Weight Distillation: Transferring the Knowledge in Neural Network Parameters ACL-IJCNLP 2021

    Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, Jingbo Zhu [pdf]

  13. Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation ACL-IJCNLP 2021

    Yuanxin Liu, Fandong Meng, Zheng Lin, Weiping Wang, Jie Zhou [pdf]

  14. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers Findings of ACL-IJCNLP 2021

    Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, Furu Wei [pdf] [project]

  15. One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers Findings of ACL-IJCNLP 2021

    Chuhan Wu, Fangzhao Wu, Yongfeng Huang [pdf]

Dynamic Early Exiting

  1. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference ACL 2020

    Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin [pdf] [project]

  2. FastBERT: a Self-distilling BERT with Adaptive Inference Time ACL 2020

    Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, Qi Ju [pdf] [project]

  3. The Right Tool for the Job: Matching Model and Instance Complexities ACL 2020

    Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, Noah A. Smith [pdf] [project]

  4. A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models NAACL 2021

    Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He [pdf] [project]

  5. CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade Preprint

    Lei Li, Yankai Lin, Deli Chen, Shuhuai Ren, Peng Li, Jie Zhou, Xu Sun [pdf] [project]

  6. Early Exiting BERT for Efficient Document Ranking SustaiNLP 2020

    Ji Xin, Rodrigo Nogueira, Yaoliang Yu, and Jimmy Lin [pdf] [project]

  7. BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression EACL 2021

    Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin [pdf] [project]

  8. Accelerating BERT Inference for Sequence Labeling via Early-Exit ACL 2021

    Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu, Xuanjing Huang [pdf] [project]

  9. BERT Loses Patience: Fast and Robust Inference with Early Exit NeurIPS 2020

    Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, Furu Wei [pdf] [project]

  10. Early Exiting with Ensemble Internal Classifiers Preprint

    Tianxiang Sun, Yunhua Zhou, Xiangyang Liu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu [pdf]

Quantization

  1. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT AAAI 2020

    Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer [pdf] [project]

  2. TernaryBERT: Distillation-aware Ultra-low Bit BERT EMNLP 2020

    Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, Qun Liu [pdf] [project]

  3. Q8BERT: Quantized 8Bit BERT NeurIPS 2019 Workshop

    Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat [pdf] [project]

  4. BinaryBERT: Pushing the Limit of BERT Quantization EMNLP 2020

    Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jing Jin, Xin Jiang, Qun Liu, Michael Lyu, Irwin King [pdf] [project]

  5. I-BERT: Integer-only BERT Quantization ICML 2021

    Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer [pdf] [project]

Pruning

  1. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned ACL 2019

    Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov [pdf] [project]

  2. Are Sixteen Heads Really Better than One? NeurIPS 2019

    Paul Michel, Omer Levy, Graham Neubig [pdf] [project]

  3. The Lottery Ticket Hypothesis for Pre-trained BERT Networks NeurIPS 2020

    Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, Michael Carbin [pdf] [project]

  4. Movement Pruning: Adaptive Sparsity by Fine-Tuning NeurIPS 2020

    Victor Sanh, Thomas Wolf, Alexander M. Rush [pdf] [project]

  5. Reducing Transformer Depth on Demand with Structured Dropout Preprint

    Angela Fan, Edouard Grave, Armand Joulin [pdf]

  6. When BERT Plays the Lottery, All Tickets Are Winning EMNLP 2020

    Sai Prasanna, Anna Rogers, Anna Rumshisky [pdf] [project]

  7. Structured Pruning of a BERT-based Question Answering Model Preprint

    J.S. McCarley, Rishav Chakravarti, Avirup Sil [pdf]

  8. Structured Pruning of Large Language Models EMNLP 2020

    Ziheng Wang, Jeremy Wohlwend, Tao Lei [pdf] [project]

  9. Rethinking Network Pruning — under the Pre-train and Fine-tune Paradigm NAACL 2021

    Dongkuan Xu, Ian E.H. Yen, Jinxi Zhao, Zhibin Xiao [pdf]

  10. Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization ACL 2021

    Chen Liang, Simiao Zuo, Minshuo Chen, Haoming Jiang, Xiaodong Liu, Pengcheng He, Tuo Zhao, Weizhu Chen [pdf] []

Contribution

If you find any related work not included in the list, do not hesitate to raise a PR to help us complete the list.

GitHub

GitHub - TobiasLee/Awesome-Efficient-PLM: Must-read papers on improving efficiency for pre-trained language models.
Must-read papers on improving efficiency for pre-trained language models. - GitHub - TobiasLee/Awesome-Efficient-PLM: Must-read papers on improving efficiency for pre-trained language models.