cpi4all

Repository with processed files from COVID's CPI to facilitate analysis

Organization

On the Senate website it is possible to find the list of [all documents] (https://legis.senado.leg.br/comissoes/docsRecCPI?codcol=2441) collected by the CPI of COVID.

The table on the website has the following structure:

No Archives Date of receipt Sender Origin Description Box In Reply
1 Link1 ... ... ... ... ... ...
2 Link2/link3 ... ... ... ... ... ...

These links lead to the download of PDF files with the documents in question.

In this repository you can find the txt version of these files. The filename in this repository is formed by <Document No>_<link number>.
For example:

link1 = 1_1 because it is relative to file #1, and it is the first link.

link2 = 2_1 because it is relative to file No 2, and is the first link on that line.

link3 = 2_2 because it is relative to file No 2, and is the second link in the line.

The text version of all documents is in the database/txts/ folder.

Examples:

File No. 1, first link: 1_1

File No 4, fourth link: 3_4

Note 1: Not all files have been converted yet

Note 2: The conversion uses image recognition and can get really bad at times, causing misspellings or unrelated words.

For developers

The scripts work in the following sequence:

  1. extract_rows.py: Go to the senate website and extract the information from each row in the table. All data is saved in database/rows.
  2. extract_headers.py: For each link on each line, this script takes metadata from the file (size, type) that will be useful later. These data are saved in database/headers.
  3. download_pdfs.py: Download all PDFs described in database/headers and save in database/pdfs.
  4. convert_pdf_to_jpg.py: Convert all PDFs in database/pdfs to images in database/jpgs.
  5. convert_jpg_to_txt.py: Convert all images in database/jpgs to text in database/txt.

For performance reasons, only the database/rows, database/headers and database/txts folders are saved in this repository.

ALL:
0. Improve this readme :)

  1. Use githubpages to generate a static site that allows searching all txt
  2. Finish converting all files
  3. Investigate files where the conversion was bad.
  4. Automatically extract dates and provide a json with the chronological order of the files.