compose
"Build better training examples in a fraction of the time."
Compose is a machine learning tool for automated prediction engineering. It allows you to structure prediction problems and generate labels for supervised learning. An end user defines an outcome of interest by writing a labeling function, then runs a search to automatically extract training examples from historical data. Its result is then provided to Featuretools for automated feature engineering and subsequently to EvalML for automated machine learning. The workflow of an applied machine learning engineer then becomes:
Install
Compose is available on PyPI and Conda-forge for Python 3.6 or later.
pip
To install from PyPI, run the command:
pip install composeml
conda
To install from Conda-forge, run the command:
conda install -c conda-forge composeml
Example
Will a customer spend more than 300 in the next hour of transactions?
In this example, we automatically generate new training examples from a historical dataset of transactions.
import composeml as cp
df = cp.demos.load_transactions()
df = df[df.columns[:7]]
df.head()
transaction_id | session_id | transaction_time | product_id | amount | customer_id | device |
---|---|---|---|---|---|---|
298 | 1 | 2014-01-01 00:00:00 | 5 | 127.64 | 2 | desktop |
10 | 1 | 2014-01-01 00:09:45 | 5 | 57.39 | 2 | desktop |
495 | 1 | 2014-01-01 00:14:05 | 5 | 69.45 | 2 | desktop |
460 | 10 | 2014-01-01 02:33:50 | 5 | 123.19 | 2 | tablet |
302 | 10 | 2014-01-01 02:37:05 | 5 | 64.47 | 2 | tablet |
First, we represent the prediction problem with a labeling function and a label maker.
def total_spent(ds):
return ds['amount'].sum()
label_maker = cp.LabelMaker(
target_entity="customer_id",
time_index="transaction_time",
labeling_function=total_spent,
window_size="1h",
)
Then, we run a search to automatically generate the training examples.
label_times = label_maker.search(
df.sort_values('transaction_time'),
num_examples_per_instance=2,
minimum_data='2014-01-01',
drop_empty=False,
verbose=False,
)
label_times = label_times.threshold(300)
label_times.head()
customer_id | time | total_spent |
---|---|---|
1 | 2014-01-01 00:00:00 | True |
1 | 2014-01-01 01:00:00 | True |
2 | 2014-01-01 00:00:00 | False |
2 | 2014-01-01 01:00:00 | False |
3 | 2014-01-01 00:00:00 | False |