Fast State-of-the-Art Tokenizers optimized for Research and Production

Jun 17, 2021 1 min read

tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the
original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!):

Rust (Original implementation)
Python
Node.js

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you ? ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the python documentation or the
python quicktour to learn
more!

GitHub

https://github.com/huggingface/tokenizers

Natural Language Processing

John was the first writer to have joined pythonawesome.com. He has since then inculcated very effective writing and reviewing culture at pythonawesome which rivals have found impossible to imitate.

Fast State-of-the-Art Tokenizers optimized for Research and Production

tokenizers

Main features:

Bindings

Quick example using Python:

GitHub

John

An open-source tool for data science and machine learning projects

A Terminal Client for MySQL with AutoCompletion and Syntax Highlighting

tokenizers

Main features:

Bindings

Quick example using Python:

GitHub

An open-source tool for data science and machine learning projects

A Terminal Client for MySQL with AutoCompletion and Syntax Highlighting

You might also like...