Camelot
PDF Table Extraction for Humans.
Camelot is a Python library that makes it easy for anyone to extract tables from PDF files!
Here's how you can extract tables from PDF files. Check out the PDF used in this example here.
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html >>> tables[0].df # get a pandas DataFrame!
Cycle Name | KI (1/km) | Distance (mi) | Percent Fuel Savings | |||
---|---|---|---|---|---|---|
Improved Speed | Decreased Accel | Eliminate Stops | Decreased Idle | |||
2012_2 | 3.30 | 1.3 | 5.9% | 9.5% | 29.2% | 17.4% |
2145_1 | 0.68 | 11.2 | 2.4% | 0.1% | 9.5% | 2.7% |
4234_1 | 0.59 | 58.7 | 8.5% | 1.3% | 8.5% | 3.3% |
2032_2 | 0.17 | 57.8 | 21.7% | 0.3% | 2.7% | 1.2% |
4171_1 | 0.07 | 173.9 | 58.1% | 1.6% | 2.1% | 0.5% |
There's a command-line interface too!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
Why Camelot?
- You are in control.: Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
- Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
- Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
- Export to multiple formats, including JSON, Excel and HTML.
See comparison with other PDF table extraction libraries and tools.
Installation
Using conda
The easiest way to install Camelot is to install it with conda, which is the package manager that the Anaconda distribution is built upon.
First, let's add the conda-forge channel to conda's config:
$ conda config --add channels conda-forge
Now, you can simply use conda to install Camelot:
$ conda install -c camelot-dev camelot-py
Using pip
After installing the dependencies (tk and ghostscript), you can simply use pip to install Camelot:
$ pip install camelot-py[all]
From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/socialcopsdev/camelot
and install Camelot using pip:
$ cd camelot $ pip install ".[all]"
Development
The Contributor's Guide has detailed information about contributing code, documentation, tests and more. We've included some basic information in this README.
Source code
You can check the latest sources with:
$ git clone https://www.github.com/socialcopsdev/camelot
Setting up a development environment
You can install the development dependencies easily, using pip:
$ pip install camelot-py[dev]
Testing
After installation, you can run tests using:
$ python setup.py test