cpi4all
Repository with processed files from COVID's CPI to facilitate analysis
Organization
On the Senate website it is possible to find the list of [all documents] (https://legis.senado.leg.br/comissoes/docsRecCPI?codcol=2441) collected by the CPI of COVID.
The table on the website has the following structure:
No | Archives | Date of receipt | Sender | Origin | Description | Box | In Reply |
---|---|---|---|---|---|---|---|
1 | Link1 | ... | ... | ... | ... | ... | ... |
2 | Link2/link3 | ... | ... | ... | ... | ... | ... |
These links lead to the download of PDF files with the documents in question.
In this repository you can find the txt version of these files. The filename in this repository is formed by <Document No>_<link number>
.
For example:
link1 = 1_1 because it is relative to file #1, and it is the first link.
link2 = 2_1 because it is relative to file No 2, and is the first link on that line.
link3 = 2_2 because it is relative to file No 2, and is the second link in the line.
The text version of all documents is in the database/txts/ folder.
Examples:
Note 1: Not all files have been converted yet
Note 2: The conversion uses image recognition and can get really bad at times, causing misspellings or unrelated words.
For developers
The scripts work in the following sequence:
extract_rows.py
: Go to the senate website and extract the information from each row in the table. All data is saved indatabase/rows
.extract_headers.py
: For each link on each line, this script takes metadata from the file (size, type) that will be useful later. These data are saved indatabase/headers
.download_pdfs.py
: Download all PDFs described indatabase/headers
and save indatabase/pdfs
.convert_pdf_to_jpg.py
: Convert all PDFs indatabase/pdfs
to images indatabase/jpgs
.convert_jpg_to_txt.py
: Convert all images indatabase/jpgs
to text indatabase/txt
.
For performance reasons, only the database/rows
, database/headers
and database/txts
folders are saved in this repository.
ALL:
0. Improve this readme :)
- Use githubpages to generate a static site that allows searching all txt
- Finish converting all files
- Investigate files where the conversion was bad.
- Automatically extract dates and provide a json with the chronological order of the files.