An open-source toolkit built on top of PyTorch and is developed

Glyce

Glyce is an open-source toolkit built on top of PyTorch and is developed by Shannon.AI.

What is Glyce ?

Glyce is a Chinese char representation based on Chinese glyph information. Glyce Chinese char embeddings are composed by two parts: (1) glyph-embeddings and (2) char-ID embeddings. The two parts are combined using concatenation, a highway network or a fully connected layer. Glyce word embeddings are also composed by two parts: (1) word glyph-embeddings and (2) word-ID embeddings. Glyph word embedding is the output of forwarding glyph embeddings for each char in the word through the max-pooling layer. Then glyph word embedding combine word-ID embedding using concatenation to get glyce word embedding.

Here are the some highlights in glyce:

1. Utilize Chinese Logographic Information

Glyce utilizes useful logographic information by encoding images of historical and contemporary scripts, along with the scripts of different writing styles.

2. Combine Glyce with Chinese Pre-trained BERT Model

We combine Glyce with Pre-trained Chinese BERT model and adopt specific layer to downstream tasks. The Glyce-BERT model outperforms BERT and sets new SOTA results for tagging (NER, CWS, POS), sentence pair classification, single sentence classification tasks.

3. Propose Tianzige-CNN(田字格) to Model Chinese Char

We propose the Tianzige-CNN(田字格) structure, which is tailored to Chinese character modeling. Tianzige-CNN(田字格) tackle the issue of small number of Chinese characters and small sacle of image compared to Imagenet.

4. Auxiliary Task Performs As a Regularizer

During training process, image-classification loss performs as an auxiliary training objective with the purpose of preventing overfitting and promiting the model's ability to generalize.

Experiment Results

1. Sequence Labeling Tasks

Named Entity Recognition (NER)

MSRA(Levow, 2006), OntoNotes 4.0(Weischedel et al., 2011), Resume(Zhang et al., 2018).

Model	P(onto)	R(onto)	F1(onto)	P(msra)	R(msra)	F1(msra)
CRF-LSTM	74.36	69.43	71.81	92.97	90.62	90.95
Lattice-LSTM	76.35	71.56	73.88	93.57	92.79	93.18
Lattice-LSTM+Glyce	82.06	68.74	74.81	93.86	93.92	93.89
			(+0.93)			(+0.71)
BERT	78.01	80.35	79.16	94.97	94.62	94.80
ERNIE	-	-	-	-	-	93.8
Glyce+BERT	81.87	81.40	80.62	95.57	95.51	95.54
			(+1.46)			(+0.74)

Model	P(resume)	R(resume)	F1(resume)	P(weibo)	R(weibo)	F1(weibo)
CRF-LSTM	94.53	94.29	94.41	51.16	51.07	50.95
Lattice-LSTM	94.81	94.11	94.46	52.71	53.92	53.13
Lattice-LSTM+Glyce	95.72	95.63	95.67	53.69	55.30	54.32
			(+1.21)			(+1.19)
BERT	96.12	95.45	95.78	67.12	66.88	67.33
Glyce+BERT	96.62	96.48	96.54	67.68	67.71	67.60
			(+0.76)			(+0.76)

Chinese Part-Of-Speech Tagging (POS)

CTB5, CTB6, CTB9 and UD1

Model	P(ctb5)	R(ctb5)	F1(ctb5)	P(ctb6)	R(ctb6)	F1(ctb6)
(Shao, 2017) (sig)	93.68	94.47	94.07	-	-	90.81
(Shao, 2017) (ens)	93.95	94.81	94.38	-	-
Lattice-LSTM	94.77	95.51	95.14	92.00	90.86	91.43
Glyce+Lattice-LSTM	95.49	95.72	95.61	92.72	91.14	91.92
			(+0.47)			(+0.49)
BERT	95.86	96.26	96.06	94.91	94.63	94.77
Glyce+BERT	96.50	96.74	96.61	95.56	95.26	95.41
			(+0.55)			(+0.55)

Model	P(ctb9)	R(ctb9)	F1(ctb9)	P(ud1)	R(ud1)	F1(ud1)
(Shao, 2017) (sig)	91.81	94.47	91.89	89.28	89.54	89.41
(Shao, 2017) (ens)	92.28	92.40	92.34	89.67	89.86	89.75
Lattice-LSTM	92.53	91.73	92.13	90.47	89.70	90.09
Glyce+Lattice-LSTM	92.28	92.85	92.38	91.57	90.19	90.87
			(+0.25)			(+0.78)
BERT	92.43	92.15	92.29	95.42	94.97	95.19
Glyce+BERT	93.49	92.84	93.15	96.19	96.10	96.14
			(+0.86)			(+1.35)

Chinese Word Segmentation (CWS)

PKU, CITYU, MSR and AS.

Model	P(pku)	R(pku)	F1(pku)	P(cityu)	R(cityu)	F1(cityu)
BERT	96.8	96.3	96.5	97.5	97.7	97.6
Glyce+BERT	97.1	96.4	96.7	97.9	98.0	97.9
			(+0.2)			(+0.3)

Model	P(msr)	R(msr)	F1(msr)	P(as)	R(as)	F1(as)
BERT	98.1	98.2	98.1	96.7	96.4	96.5
Glyce+BERT	98.2	98.3	98.3	96.6	96.8	96.7
			(+0.2)			(+0.2)

2. Sentence Pair Classification

Dataset Description

The BQ corpus, XNLI, LCQMC, NLPCC-DBQA

Model	P(bq)	R(bq)	F1(bq)	Acc(bq)	P(lcqmc)	R(lcqmc)	F1(lcqmc)	Acc(lcqmc)
BiMPM	82.3	81.2	81.7	81.9	77.6	93.9	85.0	83.4
BiMPM+Glyce	81.9	85.5	83.7	83.3	80.4	93.4	86.4	85.3
			(+2.0)	(+1.4)			(+1.4)	(+1.9)
BERT	83.5	85.7	84.6	84.8	83.2	94.2	88.2	87.5
ERNIE	-	-	-	-	-	-	-	87.4
Glyce+BERT	84.2	86.9	85.5	85.8	86.8	91.2	88.8	88.7
			(+0.9)	(+1.0)			(+0.6)	(+1.2)

Model	Acc(xnli)	P(nlpcc-dbqa)	R(nlpcc-dbqa)	F1(nlpcc-dbqa)
BIMPM	67.5	78.8	56.5	65.8
BIMPM+Glyce	67.7	76.3	59.9	67.1
	(+0.2)			(+1.3)
BERT	78.4	79.6	86.0	82.7
ERNIE	78.4	-	-	82.7
Glyce+BERT	79.2	81.1	85.8	83.4
	(+0.8)			(+0.7)

3.Single Sentence Classification

Dataset Description

ChnSentiCorp, Fudan and Ifeng.

Model	ChnSentiCorp	Fudan	iFeng
LSTM	91.7	95.8	84.9
LSTM+Glyce	93.1	96.3	85.8
	(+1.4)	(+0.5)	(+0.9)
BERT	95.4	99.5	87.1
ERNIE	95.4	-	-
Glyce+BERT	95.9	99.8	87.5
	(+0.5)	(+0.3)	(+0.4)

4. Chinese Semantic Role Labeling

Dataset Description

CoNLL-2009

Model Performance

Model	Precision	Recall	F1
Roth and Lapata (2016)	76.9	73.8	75.3
Marcheggiani and Titov (2017)	84.6	80.4	82.5
K-order pruning (He et al., 2018)	84.2	81.5	82.8
K-order pruning + Glyce-word	85.4	82.1	83.7
	(+0.8)	(+0.6)	(+0.9)

5. Chinese Dependency Parsing

Dataset Description

Chinese Penn TreeBank 5.1. Dataset splits follows (Dozat and Manning, 2016).

Model Performance

Model	UAS	LAS
Ballesteros et al. (2016)	87.7	86.2
Kiperwasser and Goldberg (2016)	87.6	86.1
Cheng et al. (2016)	88.1	85.7
Biaffine	89.3	88.2
Biaffine+Glyce-word	90.2	89.0
	(+0.9)	(+0.8)

Requirements

Python Version >= 3.6
GPU, Use NVIDIA TITAN Xp with 12G RAM
Chinese scripts could be found in Google Drive. Please refer to the description and download scripts files to glyce/glyce/fonts/.

NOTE: Some experimental results are obtained by training on multi-GPU machines. May use DIFFERENT PyTorch versions refer to previous open-source SOTA models. Experiment environment for Glyce-BERT is Python 3.6 and PyTorch 1.10.

Installation

# Clone glyce 
git clone [email protected]:ShannonAI/glyce.git
cd glyce 
python3.6 setup.py develop 

# Install package dependency 
pip install -r requirements.txt

Quick start of Glyce

Usage of Glyce Char/Word Embedding


import torch 
import torch.nn as nn 
from torch.nn import CrossEntropyLoss 


from glyce import GlyceConfig 
from glyce import CharGlyceEmbedding, WordGlyceEmbedding 


# load glyce hyper-params
glyce_config = GlyceConfig()


# if input ids is char-level
glyce_embedding = CharGlyceEmbedding(glyce_config)
# elif input ids is word-level 
glyce_embedding = WordGlyceEmbedding(glyce_config)


# forward input_ids into embedding layer 
glyce_embedding, glyph_loss = glyce_embedding(input_ids)


# utilize image classifiation loss as the auxiliary loss 
glyph_decay = 0.1 
glyph_ratio = 0.01 # the proportion of image classification loss to total loss 
current_epoch = 3 


# compute loss 
loss_fct = CrossEntropyLoss()
task_loss = loss_fct(logits, labels)
total_loss = task_loss * (1 - glyce_ratio) + glyph_ratio * glyph_decay ** (current_epoch+1) * glyph_loss

Usage of Glyce-BERT

1. Preparation

Download and unzip BERT-Base, Chinese pretrained model.
Install PyTorch pretrained bert by pip as follows:

pip install pytorch-pretrained-bert

Convert TF checkpoint into PyTorch

export BERT_BASE_DIR=/path/to/bert/chinese_L-12_H-768_A-12

pytorch_pretrained_bert convert_tf_checkpoint_to_pytorch \
$BERT_BASE_DIR/bert_model.ckpt \
$BERT_BASE_DIR/bert_config.json \
$BERT_BASE_DIR/pytorch_model.bin

2. Train/Dev/Test DataFormat

Sequence Labeling Task

模型的输入: [CLS] 我 爱 北 京 天 安 门
模型的输出: ‘我爱北京天安门' 对应的命名实体标签 O  O  B-LOC M-LOC M-LOC M-LOC E-LOC 

Model Input: [CLS] I like Bei Jing Tian An Men
Model Output: labels corresponds to 'I like BeiJing Tian An men' are O  O  B-LOC M-LOC M-LOC M-LOC E-LOC

Sentence Pair Classification

模型的输入: [CLS] 我 爱 北 京 天 安 门 [SEP] 北 京 欢 迎 您
模型的输出: [CLS] 位置对应的输出,假如是语意匹配任务则为 0,代表两句话的语意不相同

Model Input: [CLS] I like Bei Jing Tian An Men [SEP] Bei Jing Wel ##come you 
Model Output: the output in [CLS] position. 0 if task sign is semantic matching. It means that the semantic of two sentences are different.

Single Sentence Classification

模型的输入: [CLS] 我 爱 北 京 天 安 门
模型的输出: [CLS] 位置对应的输出,假如是情感分类任务则为1,代表情感为积极

Model Input: [CLS] I like Bei Jing Tian An Men. 
Model Output: the output in [CLS] position. 1 if task is sentiment analysis. It means the sentiment polarity is positive.

3. Start Train and Evaluate Glyce-BERT

scritps/*_bert.sh are the commands we used to finetune BERT.
scripts/*_glyce_bert.sh are the commands we used to obtained the results of Glyce-BERT.
scripts/ctb5_binaffine.sh is the command that we used to reimplement PREVIOUS SOTA result on CTB5 for dependency parsing.
scripts/ctb5_glyce_binaffine.sh is the command that we used to obtain the SOTA result on CTB5 for dependency parsing.

For example, training command of Glyce-BERT for sentence pair dataset BQ is included in scripts/glyce_bert/bq_glyce_bert.sh.
Start train and evaluate Glyce-BERT on BQ by,

bash scripts/glyce_bert/bq_glyce_bert.sh

Notes:

repo_path refer to work directory of glyce.
config_path refer to the path of configuration file.
bert_model refer to the the directory of the pre-trained Chinese BERT model.
task_name refer to the task signiature.
data_dir and output_dir refer to the directories of the "raw data" and "intermediate checkpoints" respectively.

An open-source toolkit built on top of PyTorch and is developed

Glyce

What is Glyce ?

1. Utilize Chinese Logographic Information

2. Combine Glyce with Chinese Pre-trained BERT Model

3. Propose Tianzige-CNN(田字格) to Model Chinese Char

4. Auxiliary Task Performs As a Regularizer

Experiment Results

1. Sequence Labeling Tasks

Named Entity Recognition (NER)

Chinese Part-Of-Speech Tagging (POS)

Chinese Word Segmentation (CWS)

2. Sentence Pair Classification

Dataset Description

3.Single Sentence Classification

Dataset Description

4. Chinese Semantic Role Labeling

Dataset Description

Model Performance

5. Chinese Dependency Parsing

Dataset Description

Model Performance

Requirements

Installation

Quick start of Glyce

Usage of Glyce Char/Word Embedding

Usage of Glyce-BERT

1. Preparation

2. Train/Dev/Test DataFormat

3. Start Train and Evaluate Glyce-BERT

GitHub

John

Create Rounded Segmented Column Charts

A simple neural network for python autocompletion

Glyce

What is Glyce ?

1. Utilize Chinese Logographic Information

2. Combine Glyce with Chinese Pre-trained BERT Model

3. Propose Tianzige-CNN(田字格) to Model Chinese Char

4. Auxiliary Task Performs As a Regularizer

Experiment Results

1. Sequence Labeling Tasks

Named Entity Recognition (NER)

Chinese Part-Of-Speech Tagging (POS)

Chinese Word Segmentation (CWS)

2. Sentence Pair Classification

Dataset Description

3.Single Sentence Classification

Dataset Description

4. Chinese Semantic Role Labeling

Dataset Description

Model Performance

5. Chinese Dependency Parsing

Dataset Description

Model Performance

Requirements

Installation

Quick start of Glyce

Usage of Glyce Char/Word Embedding

Usage of Glyce-BERT

1. Preparation

2. Train/Dev/Test DataFormat

3. Start Train and Evaluate Glyce-BERT

GitHub

Create Rounded Segmented Column Charts

A simple neural network for python autocompletion

You might also like...