DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application ‘Sparkify’. This data model is implemented on AWS cloud infrastructure with following components –

  • AWS S3 – Source datasets.
  • AWS Redshift
    >for staging extracted data
    >for storing the resultant data model (facts and dimensions)

Data model designed for this project consists of a star schema.

Table and attribute details are –

  • Fact Table
    songplays: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

  • Dimension Tables
    users: user_id, first_name, last_name, gender, level
    songs: song_id, title, artist_id, year, duration
    artists: artist_id, name, location, lattitude, longitude
    time: start_time, hour, day, week, month, year, weekday

Source datasets to be extracted into dimension model are –

There are two json files for

  • Song data: s3://udacity-dend/song_data – Data for all songs with their respective artists available in application library.
  • Log data: s3://udacity-dend/log_data – Data for user events and activity activity on the application.

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

  • Create sql_queries.py – to design and build tables for proposed data model
  • Run create_tables.py – to create tables by implementing the database queries from sql_queries.py
  • Run etl.py – to implement the data pipeline built over the data model which extract, stage and load data from AWS S3 to DWH on AWS Redshift
  • Design and fire analytical queries on the populated data model to gain insights of user events over streaming application

GitHub

GitHub - abhishekthakur-1594/data_engineering at pythonawesome.com
Projects that implement various aspects of Data Engineering. - GitHub - abhishekthakur-1594/data_engineering at pythonawesome.com