Public CNPJ Data

  • Official source of the Federal Revenue of Brazil, here.
  • Layout of files, here.

The Brazilian Federal Revenue Service provides databases with public data from the national registry of legal entities (CNPJ).

In general, they contain the same information that we can see on the CNPJ card, when we make an individual consultation, plus other data from Simples Nacional, partners, etc. Very rich analyzes can come out of this data, from economic, market to investigations.

This repository contains an ETL process to i) download the files; ii) unzip; iii) read, treat and iv) insert into a PostgreSQL relational database.


Infrastructure required:

  • Python 3.8 - libraries:

    • wget
    • pandas
    • ftplib
    • datetime
    • gzip
    • urllib
    • bs4
      -re
    • you
    • zipfile
    • sqlalchemy
    • psycopg2
    • team
  • Database:


How to use:

  1. With Postgre installed, start the server instance (may be local) and create the database as the database.sql file.

  2. Depending on your environment, replace the variables below in the ETL_collect_data_and_write_BD.py file:

    • output_files: destination directory for downloading files
    • user: database user created by the database.sql file
    • passw: DB user password
    • host: DB connection host
    • port: DB connection port
    • database: name of the database in the instance (Data_RFB - according to the database.sql file)
  3. Execute the file ETL_collect_data_and_write_BD.py and wait for the process to finish.

    • Files are large: depending on the infrastructure this might take many hours to complete.
    • Files from 05/08/2021: 4.68 GB compressed and 17.1 GB uncompressed.

Generated tables:

  • For more information, see the layout.

    • company: company registration data at the matrix level
    • establishment: analytical data of the company by unit / establishment (phone numbers, address, branch, etc.)
    • partners: registration data of company partners
    • simple: MEI and Simple National data
    • cnae: code and description of CNAEs
    • quals: qualification table of individuals - partners, responsible and legal representative.
    • natju: table of legal natures - code and description.
    • moti: table of reasons for the registration situation - code and description.
    • country: country table - code and description.
    • munic: table of municipalities - code and description.
  • Due to the volume of data, the company, establishment, partners and simple tables have indexes for the cnpj_basico column, which is the main linking key between them.

GitHub

https://github.com/aphonsoar/Receita_Federal_do_Brasil_-_Dados_Publicos_CNPJ