Repository with processed files from COVID's CPI to facilitate analysis
On the Senate website it is possible to find the list of [all documents] (https://legis.senado.leg.br/comissoes/docsRecCPI?codcol=2441) collected by the CPI of COVID.
The table on the website has the following structure:
|No||Archives||Date of receipt||Sender||Origin||Description||Box||In Reply|
These links lead to the download of PDF files with the documents in question.
In this repository you can find the txt version of these files. The filename in this repository is formed by
<Document No>_<link number>.
link1 = 1_1 because it is relative to file #1, and it is the first link.
link2 = 2_1 because it is relative to file No 2, and is the first link on that line.
link3 = 2_2 because it is relative to file No 2, and is the second link in the line.
The text version of all documents is in the database/txts/ folder.
Note 1: Not all files have been converted yet
Note 2: The conversion uses image recognition and can get really bad at times, causing misspellings or unrelated words.
The scripts work in the following sequence:
extract_rows.py: Go to the senate website and extract the information from each row in the table. All data is saved in
extract_headers.py: For each link on each line, this script takes metadata from the file (size, type) that will be useful later. These data are saved in
download_pdfs.py: Download all PDFs described in
database/headersand save in
convert_pdf_to_jpg.py: Convert all PDFs in
database/pdfsto images in
convert_jpg_to_txt.py: Convert all images in
database/jpgsto text in
For performance reasons, only the
database/txts folders are saved in this repository.
0. Improve this readme :)
- Use githubpages to generate a static site that allows searching all txt
- Finish converting all files
- Investigate files where the conversion was bad.
- Automatically extract dates and provide a json with the chronological order of the files.