NLPretext

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Installation

This package has been tested on Python 3.6, 3.7 and 3.8.

We strongly advise you to do the remaining steps in a virtual environnement.

To install this library you just have to run the following command:

pip install nlpretext

This library uses Spacy as tokenizer. Current models supported are en_core_web_sm and fr_core_news_sm. If not installed, run the following commands:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz
pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.3.0/fr_core_news_sm-2.3.0.tar.gz

Preprocessing pipeline

Default pipeline

Need to preprocess your text data but no clue about what function to use and in which order? The default preprocessing pipeline got you covered:

from nlpretext import Preprocessor
text = "I just got the best dinner in my life @latourdargent !!! I  recommend πŸ˜€ #food #paris \n"
preprocessor = Preprocessor()
text = preprocessor.run(text)
print(text)
# "I just got the best dinner in my life !!! I recommend"

Create your custom pipeline

Another possibility is to create your custom pipeline if you know exactly what function to apply on your data, here's an example:

from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters,
remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji
text = "I just got the best dinner in my life @latourdargent !!! I  recommend πŸ˜€ #food #paris \n"
preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)
text = preprocessor.run(text)
print(text)
# "dinner life recommend"

Take a look at all the functions that are available here in the preprocess.py scripts in the different folders: basic, social, token.

Individual Functions

Replacing emails

from nlpretext.basic.preprocess import replace_emails
example = "I have forwarded this email to [email protected]"
example = replace_emails(example, replace_with="*EMAIL*")
print(example)
# "I have forwarded this email to *EMAIL*"

Replacing phone numbers

from nlpretext.basic.preprocess import replace_phone_numbers
example = "My phone number is 0606060606"
example = replace_phone_numbers(example, country_to_detect=["FR"], replace_with="*PHONE*")
print(example)
# "My phone number is *PHONE*"

Removing Hashtags

from nlpretext.social.preprocess import remove_hashtag
example = "This restaurant was amazing #food #foodie #foodstagram #dinner"
example = remove_hashtag(example)
print(example)
# "This restaurant was amazing"

Extracting emojis

from nlpretext.social.preprocess import extract_emojis
example = "I take care of my skin πŸ˜€"
example = extract_emojis(example)
print(example)
# [':grinning_face:']

Data augmentation

The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.

from nlpretext.augmentation.text_augmentation import augment_text
example = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=”wordnet_synonym”, entities=entities)
print(example)
# "I need to buy a small black pocketbook please."

Make HTML documentation

In order to make the html Sphinx documentation, you need to run at the nlpretext root path:
sphinx-apidoc -f nlpretext -o docs/
This will generate the .rst files.
You can generate the doc with
cd docs && make html

You can now open the file index.html located in the build folder.

Project Organization


β”œβ”€β”€ LICENSE
β”œβ”€β”€ VERSION
β”œβ”€β”€ CONTRIBUTING.md     <- Contribution guidelines
β”œβ”€β”€ README.md           <- The top-level README for developers using this project.
β”œβ”€β”€ .github/workflows   <- Where the CI lives
β”œβ”€β”€ datasets/external   <- Bash scripts to download external datasets
β”œβ”€β”€ docs                <- Sphinx HTML documentation
β”œβ”€β”€ nlpretext           <- Main Package. This is where the code lives
β”‚Β Β  β”œβ”€β”€ preprocessor.py <- Main preprocessing script
β”‚Β Β  β”œβ”€β”€ augmentation    <- Text augmentation script
β”‚Β Β  β”œβ”€β”€ basic           <- Basic text preprocessing 
β”‚Β Β  β”œβ”€β”€ social          <- Social text preprocessing
β”‚Β Β  β”œβ”€β”€ token           <- Token text preprocessing
β”‚Β Β  β”œβ”€β”€ _config         <- Where the configuration and constants live
β”‚Β Β  └── _utils          <- Where preprocessing utils scripts lives
β”œβ”€β”€ tests               <- Where the tests lives
β”œβ”€β”€ setup.py            <- makes project pip installable (pip install -e .) so the package can be imported
β”œβ”€β”€ requirements.txt    <- The requirements file for reproducing the analysis environment, e.g.
β”‚                          generated with `pip freeze > requirements.txt`
└── pylintrc            <- The linting configuration file

GitHub

https://github.com/artefactory/NLPretext