Requirements:
- Python 3 (probably at least 3.4)
- pipenv (
pip3 install pipenv
) - tesseract (
brew install tesseract
, at least if you have a mac and homebrew working) - imagemagick / ghostscript
Using this repository:
The working/ subfolders contain a folder for each page. Each contains a page.png file that’s the
baseline page. It’ll attempt to auto-deskew and crop each page. If you want to manually override
this process, create a page-handcrop.png file in the working directory. Some already have them.
pipenv install
make all
at the top level should attempt to deskew, crop, split, and OCR everything, building
CSV output in each working dir.
pipenv shell
make setup
make all
After that, concatenating all the page.csv files in each working dir should work.
csvstack working/*/page.csv > all_data.csv