PyThaiNLP
PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language.
Getting Started
- PyThaiNLP 2 requires Python 3.6+. Python 2.7 users can use PyThaiNLP 1.6. See 2.0 change log | Upgrading from 1.7 | Upgrading ThaiNER from 1.7
- PyThaiNLP Get Started notebook | API document | Tutorials
- Official website | PyPI | Facebook page
- Who uses PyThaiNLP?
- Model cards - for technical details, caveats, and ethical considerations of the models developed and used in PyThaiNLP
Capabilities
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
List of Features
- Convenient character and word classes, like Thai consonants (
pythainlp.thai_consonants
), vowels (pythainlp.thai_vowels
), digits (pythainlp.thai_digits
), and stop words (pythainlp.corpus.thai_stopwords
) -- comparable to constants likestring.letters
,string.digits
, andstring.punctuation
- Thai linguistic unit segmentation/tokenization, including sentence (
sent_tokenize
), word (word_tokenize
), and subword segmentations based on Thai Character Cluster (subword_tokenize
) - Thai part-of-speech tagging (
pos_tag
) - Thai spelling suggestion and correction (
spell
andcorrect
) - Thai transliteration (
transliterate
) - Thai soundex (
soundex
) with three engines (lk82
,udom83
,metasound
) - Thai collation (sort by dictionary order) (
collate
) - Read out number to Thai words (
bahttext
,num_to_thaiword
) - Thai datetime formatting (
thai_strftime
) - Thai-English keyboard misswitched fix (
eng_to_thai
,thai_to_eng
) - Command-line interface for basic functions, like tokenization and pos tagging (run
thainlp
in your shell)
Installation
pip install --upgrade pythainlp
This will install the latest stable release of PyThaiNLP.
Install different releases:
- Stable release:
pip install --upgrade pythainlp
- Pre-release (near ready):
pip install --upgrade --pre pythainlp
- Development (likely to break things):
pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
Installation Options
Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name]
immediately after pythainlp
:
pip install pythainlp[extra1,extra2,...]
List of possible `extras`
full
(install everything)attacut
(to support attacut, a fast and accurate tokenizer)benchmarks
(for word tokenization benchmarking)icu
(for ICU, International Components for Unicode, support in transliteration and tokenization)ipa
(for IPA, International Phonetic Alphabet, support in transliteration)ml
(to support ULMFiT models for classification)thai2fit
(for Thai word vector)thai2rom
(for machine-learnt romanization)wordnet
(for Thai WordNet API)
For dependency details, look at extras
variable in setup.py
.
Data directory
- Some additional data, like word lists and language models, may get automatically download during runtime.
- PyThaiNLP caches these data under the directory
~/pythainlp-data
by default. - Data directory can be changed by specifying the environment variable
PYTHAINLP_DATA_DIR
. - See the data catalog (
db.json
) at https://github.com/PyThaiNLP/pythainlp-corpus
Command-Line Interface
Some of PyThaiNLP functionalities can be used at command line, using thainlp
command.
For example, displaying a catalog of datasets:
thainlp data catalog
Showing how to use:
thainlp help