SapBERT: Self-alignment pretraining for BERT

SapBERT

This repo holds code, data, and pretrained weights for (1) the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations; (2) the cross-lingual SapBERT and a cross-lingual biomedical entity linking benchmark (XL-BEL) proposed in our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking.

Huggingface Models

English Models: [SapBERT] and [SapBERT-mean-token]

Standard SapBERT as described in [Liu et al., NAACL 2021]. Trained with UMLS 2020AA (English only), using microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. For [SapBERT], use [CLS] (before pooler) as the representation of the input; for [SapBERT-mean-token], use mean-pooling across all tokens.

Cross-Lingual Models: [SapBERT-XLMR] and [SapBERT-XLMR-large]

Cross-lingual SapBERT as described in [Liu et al., ACL 2021]. Trained with UMLS 2020AB (all languages), using xlm-roberta-base/xlm-roberta-large as the base model. Use [CLS] (before pooler) as the representation of the input.

Environment

The code is tested with python 3.8, torch 1.7.0 and huggingface transformers 4.4.2. Please view requirements.txt for more details.

Train SapBERT

Extract training data from UMLS as insrtructed in training_data/generate_pretraining_data.ipynb (we cannot directly release the training file due to licensing issues).

Run:

cd train/
./pretrain.sh 0,1

where 0,1 specifies the GPU devices.

For cross-lingual SAP-tuning with general domain parallel data (muse, wiki titles, or both), the data can be found in training_data/general_domain_parallel_data/. An example script: train/xling_train.sh.

Evaluate SapBERT

For evaluation (both monlingual and corss-lingual), please view evaluation/README.md for details. evaluation/xl_bel/ contains the XL-BEL benchmark proposed in [Liu et al., ACL 2021].

Citations

SapBERT:

@inproceedings{liu2021self,
	title={Self-Alignment Pretraining for Biomedical Entity Representations},
	author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
	booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
	pages={4228--4238},
	month = jun,
	year={2021}
}

Cross-lingual SapBERT and XL-BEL:

@inproceedings{liu2021learning,
	title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
	author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
	booktitle={Proceedings of ACL-IJCNLP 2021},
	month = aug,
	year={2021}
}