Untidy

A Python library for uncleaning your dataset.

Check status

Overview

Have you ever wondered how to introduce specific problems to your clean data? Now you can apply our out-of-the-box solution to untidy your data according to your needs.

The solution can be used primarily for educational purposes, where clean example data is made more realistic.

Real world data is often poised with missing values, datetime issues, data type mismatches, string encoding problems.

You can introduce the following problems to your data:

  • Adding missing values
  • Adding outliers
  • Changing the encoding of strings
  • Changing the data type of numeric columns to strings
  • Adding duplicate rows
  • Adding duplicate columns
  • Adding extra characters to strings

The package is designed to work with pandas DataFrames.

from untidy import untidyfy
messy_df = untidyfy(clean_df, 
                    corruption_level=4, # how much mess you want (0-10)
                    nans=True,
                    outliers=True,
                    text_noise=True,
                    mess_with_numbers=True,
                    mess_with_string_encodings=True,
                    duplicate_rows=True,
                    duplicate_columns=True)

Installation

Can be installed via directly via pip or by downloading the untidy-{release-version}.tar.gz file under release section. Run the command

pip install `untidy-{release-version}.tar.gz`

DAIN logo

GitHub

View Github