Python package for Uplift Modeling in real-world business; applicable for both A/B testing and observational data.
If you are simply building a Machine Learning model and executing promotion campaigns to the customers who are predicted to buy a product, for example, it is not efficient.
Some customers will buy a product anyway even without promotion campaigns (called "Sure things").
It is even possible that the campaign triggers some customers to churn (called "Do Not Disturbs" or "Sleeping Dogs").
The solution is Uplift Modeling.
What is Uplift Modeling?
Uplift Modeling is a Machine Learning technique to find which customers (individuals) should be targeted ("treated") and which customers should not be targeted.
Uplift Modeling is also known as persuasion modeling, incremental modeling, treatment effects modeling, true lift modeling, net modeling.
Applications of Uplift Modeling for business include:
- Increase revenue by finding which customers should be targeted for advertising/marketing campaigns and which customers should not.
- Retain revenue by finding which customers should be contacted to prevent churn and which customers should not.
How does Uplift Modeling work?
Uplift Modeling estimates uplift scores (a.k.a. CATE: Conditional Average Treatment Effect or ITE:
Individual Treatment Effect). Uplift score is how much the estimated conversion rate will increase
by the campaign.
Suppose you are in charge of a marketing campaign to sell a product, and the estimated conversion
rate (probability to buy a product) of a customer is 50 % if targeted and the estimated conversion
rate is 40 % if not targeted, then the uplift score of the customer is (50–40) = +10 % points.
Likewise, suppose the estimated conversion rate if targeted is 20 % and the estimated conversion
rate if not targeted is 80%, the uplift score is (20–80) = -60 % points (negative value).
The range of uplift scores is between -100 and +100 % points (-1 and +1).
It is recommended to target customers with high uplift scores and avoid customers with negative
uplift scores to optimize the marketing campaign.
What are the advantages of "CausalLift" package?
- CausalLift works with both A/B testing results and observational datasets.
- CausalLift can output intuitive metrics for evaluation.
Why CausalLift was developed?
In a word, to use for real-world business.
Existing packages for Uplift Modeling assumes the dataset is from A/B Testing (a.k.a. Randomized
Controlled Trial). In real-world business, however, observational datasets in which treatment
(campaign) targets were not chosen randomly are more common especially in the early stage of
evidence-based decision making. CausalLift supports observational datasets using a basic
methodology in Causal Inference called "Inverse Probability Weighting" based on the assumption that
propensity to be treated can be inferred from the available features.
There are 2 challenges of Uplift Modeling; explainability of the model and evaluation. CausalLift
utilizes a basic methodology of Uplift Modeling called Two Models approach (training 2 models
independently for treated and untreated samples to compute the CATE (Conditional Average Treatment
Effects) or uplift scores) to address these challenges.
[Explainability of the model] Since it is relatively simple, it is less challenging to
explain how it works to stakeholders in the business.
[Explainability of evaluation] To evaluate Uplift Modeling, metrics such as Qini and AUUC
(Area Under the Uplift Curve) are used in research, but these metrics are difficult to explain
to the stakeholders. For business, a metric that can estimate how much more profit can be
earned is more practical. Since CausalLift adopted the Two-Model approach, the 2 models can be
reused to simulate the outcome of following the recommendation by the Uplift Model and can
estimate how much conversion rate (the proportion of people who took the desired action such as
buying a product) will increase using the uplift model.
What kind of data can be fed to CausalLift?
Table data including the following columns:
- a.k.a independent variables, explanatory variables, covariates
- e.g. customer gender, age range, etc.
- Note: Categorical variables need to be one-hot coded so propensity can be estimated using
logistic regression. pandas.get_dummies can be used.
- Outcome: binary (0 or 1)
- a.k.a dependent variable, target variable, label
- e.g. whether the customer bought a product, clicked a link, etc.
- Treatment: binary (0 or 1)
- a variable you can control and want to optimize for each individual (customer)
- a.k.a intervention
- e.g. whether an advertising campaign was executed, whether a discount was offered, etc.
- Note: if you cannot find a treatment column, you may need to ask stakeholders to get the data, which might take hours to years.
- [Optional] Propensity: continuous between 0 and 1
- propensity (or probability) to be treated for observational datasets (not needed for A/B Testing results)
- If not provided, CausalLift can estimate from the features using logistic regression.
Example table data
How to install CausalLift?
Option 1: install from the PyPI
pip3 install causallift
Option 2: install from the GitHub repository
pip3 install git+https://github.com/Minyus/causallift.git
Option 3: clone the GitHub repository, cd into the
downloaded repository, and run:
python setup.py install
How to install the latest pre-release version 1.x of CausalLift?
Option 1: install from the GitHub repository
pip3 install git+git://github.com/Minyus/[email protected]
Option 2: clone
v1.0 branch of the GitHub repository,
cd into the downloaded repository, and run:
python setup.py install
Optional but recommended dependencies
How to use CausalLift?
Please see the demo code in Google Colab (free cloud CPU/GPU environment) :
To run the code, navigate to "Runtime" >> "Run all".
To download the notebook file, navigate to "File" >> "Download .ipynb".
Here are the basic steps to use.
""" Step 0. Import CausalLift """ from causallift import CausalLift """ Step 1. Feed datasets and optionally compute estimated propensity scores using logistic regression if set enable_ipw = True. """ cl = CausalLift(train_df, test_df, enable_ipw=True) """ Step 2. Train 2 classification models (XGBoost) for treated and untreated samples independently and compute estimated CATE (Conditional Average Treatment Effect), ITE (Individual Treatment Effect), or uplift score. """ train_df, test_df = cl.estimate_cate_by_2_models() """ Step 3. Estimate how much conversion rate will increase by selecting treatment (campaign) targets as recommended by the uplift modeling. """ estimated_effect_df = cl.estimate_recommendation_impact()
CausalLift flow diagram
New features introduced in version 1.0.0
CausalLift version 1.0.0 adopted Kedro to add the following new
- [Parallel execution] Train the 2 models in parallel
- [File management] Save and load intermediate files such as the trained models
- [Documentation] Generate the API document by Sphinx and visualize the process flow
Other enhancements include:
- [Logging] Show and/or log processing status such as timestamp and the running task
- [Model options] Specify models other than XGBoost and Logistic Regression for uplift
modeling and propensity modeling, respectively.
Details about the parameters
Please see [CausalLift API reference].
Supported Python versions
- Python 3.5
- Python 3.6
- Python 3.7