Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

lower: keep lowercase
upper: convert to upper case
capitalize: set first letter as upper case
other: left as is

And the following punctuation labels:

o: no punctuation
period: .
comma: ,
question: ?
exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
Iterations: 19000
Batch size: 16
Max length: 256
Seed: 871253
Cut-drop probability: 0.1
Train loss: 0.021128975618630648
Valid loss: 0.015684964135289192
Recasing accuracy: 96.73
Punctuation accuracy: 95.02
All punctuation F-score: 67.79
Comma F-score: 67.94
Period F-score: 72.91
Question F-score: 57.57
Exclamation mark F-score: 15.78
Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y

Model for recasing and repunctuating ASR transcripts

Recasing and punctuation model based on Bert

Installation

Prediction

Models

Training

John

A web app to scan crypto markets based on candlestick pattern recognition from

Python tool for dumping flash via uboot reliably

Recasing and punctuation model based on Bert

Installation

Prediction

Models

Training

A web app to scan crypto markets based on candlestick pattern recognition from

Python tool for dumping flash via uboot reliably

You might also like...