Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter

Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"

Arxiv link of the paper: https://arxiv.org/abs/2105.07148

If any questions, please contact the email: [email protected]

Requirement

  • Python 3.7.0
  • Transformer 3.4.0
  • Numpy 1.18.5
  • Packaging 17.1
  • skicit-learn 0.23.2
  • torch 1.6.0+cu92
  • tqdm 4.50.2
  • multiprocess 0.70.10
  • tensorflow 2.3.1
  • tensorboardX 2.1
  • seqeval 1.2.1

Input Format

CoNLL format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.

美   B-LOC  
国   E-LOC  
的   O  
华   B-PER  
莱   I-PER  
士   E-PER  

我   O  
跟   O  
他   O  
谈   O  
笑   O  
风   O  
生   O   

Chinese BERT,Chinese Word Embedding, and Checkpoints

Chinese BERT

Chinese BERT: https://huggingface.co/bert-base-chinese/tree/main

Chinese word embedding:

Word Embedding: https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz

Checkpoints and Shells

Directory Structure of data

  • berts
    • bert
      • config.json
      • vocab.txt
      • pytorch_model.bin
  • dataset, you can download from here
    • NER
      • weibo
      • note4
      • msra
      • resume
    • POS
      • ctb5
      • ctb6
      • ud1
      • ud2
    • CWS
      • ctb6
      • msr
      • pku
  • vocab
    • tencent_vocab.txt, the vocab of pre-trained word embedding table, downlaod from here.
  • embedding
    • word_embedding.txt
  • result
    • NER
      • weibo
      • note4
      • msra
      • resume
    • POS
      • ctb5
      • ctb6
      • ud1
      • ud2
    • CWS
      • ctb6
      • msr
      • pku
  • log

Run

  • 1.Convert .char.bmes file to .json file, python3 to_json.py

  • 2.run the shell, sh run_demo.sh

If you want to load my checkpoints, you need to make some revisions to your transformers.

My model is trained in distribution mode so it can not be directly loaded by single-GPU mode. You can follow the below steps to revise the transformers before load my checkpoints.

  • Enter the source code director of Transformer, cd source/transformers-master

  • Find the modeling_util.py, and positioned to about 995 lines

  • change the code as follows:
    image

  • Compile the revised source code and install. python3 setup.py install

Cite

@misc{liu2021lexicon,
      title={Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter}, 
      author={Wei Liu and Xiyan Fu and Yue Zhang and Wenming Xiao},
      year={2021},
      eprint={2105.07148},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

GitHub

https://github.com/liuwei1206/LEBERT