Patent Classification
Goal: To train a machine learning classifier that can classify international patents into one of eight categories based on the content of their titles/abstracts. More information on the taxonomy of the patent classes is available on the WIPO website.
- The patent data is available as raw XML from this URL: https://bulkdata.uspto.gov/
- Each large zipped file contains a single file, with multiple XML blocks
- This repo contains preprocessing code (
preproc.py
) to organize these XML blocks into a form that can be parsed, and the relevant information extracted for classification purposes.
Installation
This step assumes that Python 3.9+ is installed. Set up a virtual environment and install from requirements.txt:
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip3 install -r requirements.txt
For further development, simply activate the existing virtual environment.
$ source .venv/bin/activate
Preprocessing
The preprocessing script requires that an unzipped raw XML file (with information on hundreds of patents) exists in the raw_data/
directory. As an example, the following file is downloaded from the source, uncompressed, and stored in the below path in XML format:
raw_data/ipgb20200107_wk01/ipgb20200107.xml
Because the large XML file is not directly parsable, it needs to be broken down into individual blocks, each of which constitute a valid XML tree. This can then be parsed, and the relevant information extracted. Using this approach, we can organize the information into a form that can be used to train an ML classifier.
Run the preprocessing script (after editing the path to the raw data appropriately) as follows:
$ python3 preproc.py
This produces a new directory with clean, parsable XML files, and writes out the data to a JSON file (data.json
). The JSON data consists of the following key-value pairs:
data = {
"doc_id": doc_id,
"title": title,
"abstract": abstract,
"label": section_label,
}
Note that the section_label
field refers to the top-level of the classification hierarchy, which belongs to one of eight classes: A, B, C, D, E, F, G or H. Each letter refers to a particular section label from the IPC hierarchy (Physics, Chemistry, Engineering, etc.). More information on this can be found on the WIPO website:
Guide to the International Patent Classification, 2020 Edition, part II, p5.