/ Machine Learning

A toolkit to integrate Excel to spaCy NLP training experiences

A toolkit to integrate Excel to spaCy NLP training experiences

ExcelCy

ExcelCy is a toolkit to integrate Excel to spaCy NLP training experiences. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG.

ExcelCy has pipeline to match Entity with PhraseMatcher or Matcher in regular expression.

ExcelCy is Powerful

Simple Style Training, from spaCy documentation, demonstrates how to train NER using spaCy:

TRAIN_DATA = [
     ("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}),
     ("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})]

nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')

The TRAIN_DATA, describes sentences and annotated entities to be trained. It is cumbersome to always count the characters. With ExcelCy, (start,end) characters can be omitted.

from excelcy import ExcelCy
# collect sentences, annotate Entities and train NER using spaCy
excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')
# use the nlp object as per spaCy API
doc = excelcy.nlp('Google rebrands its business apps')
# or save it for faster bootstrap for application
excelcy.nlp.to_disk('/model')

ExcelCy is Friendly

ExcelCy training is divided into phases, the example Excel file can be found in tests/data/test_data_01.xlsx :

  1. Discovery
    The first phase is to collect sentences from data source in sheet "source". The data source can be either:

Text: Direct sentence values.
Files: PDF, DOCX, PPT, PNG or JPG will be parsed using textract.
2. Preparation
Next phase, the sentences will be analysed in sheet "prepare", based on:

Current Data Model: Using spaCy API of nlp(sentence).ents
Phrase pattern: Robertus Johansyah, Uber, Google, Amazon
Regex pattern: ^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$
3. Training
Main phase of NER training, which described in Simple Style Training. The data is iterated from sheet "train", check sheet "config" to control the parameters.

  1. Consolidation
    The last phase, is to test/save the results and repeat the phases if required.

ExcelCy is Comprehensive

Under the hood, ExcelCy has strong and well-defined data storage. At any given phase above, the data can be inspected.


from excelcy import ExcelCy

excelcy = ExcelCy()
# load configuration from XLSX or YML or JSON
# excelcy.load(file_path='test_data_01.xlsx')
# or define manually
excelcy.storage.config = Config(nlp_base='en_core_web_sm', train_iteration=2, train_drop=0.2)
print(json.dumps(excelcy.storage.items(), indent=2))

# add sources
excelcy.storage.source.add(kind='text', value='Robertus Johansyah is the maintainer ExcelCy')
excelcy.storage.source.add(kind='textract', value='tests/data/source/test_source_01.txt')
excelcy.discover()
print(json.dumps(excelcy.storage.items(), indent=2))

# add phrase matcher Robertus Johansyah -> PERSON
excelcy.storage.prepare.add(kind='phrase', value='Robertus Johansyah', entity='PERSON')
excelcy.prepare()
print(json.dumps(excelcy.storage.items(), indent=2))

# train it
excelcy.train()
print(json.dumps(excelcy.storage.items(), indent=2))

# test it
doc = excelcy.nlp('Robertus Johansyah is maintainer ExcelCy')
print(json.dumps(excelcy.storage.items(), indent=2))

Features

  • Load multiple data sources such as Word documents, PowerPoint presentations, PDF or images.
  • Import/Export configuration with JSON, YML or Excel.
  • Add custom Entity labels.
  • Rule based phrase matching using PhraseMatcher
  • Rule based matching using regex + Matcher
  • Train Named Entity Recogniser with ease

Install

Either use the famous pip or clone this repository and execute the setup.py file.

$ pip install excelcy
# ensure you have the language model installed before
$ spacy download en

Train

To train the spaCy model:

from excelcy import ExcelCy
excelcy = ExcelCy.execute(file_path='test_data_01.xlsx')

GitHub