missing blocks for sklearn pipelines.
We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to attempt to consolidate these into a package that offers code quality/testing. This project is a collaboration between multiple companies in the Netherlands.
Note that we're not formally affiliated with the scikit-learn project at all. Same holds with lego.
scikit-lego via pip with
pip install scikit-lego
Alternatively, to edit and contribute you can fork/clone and run:
pip install -e ".[dev]" python setup.py develop
The documentation can be found here.
# the scikit learn stuff we love from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline # from scikit lego stuff we add from sklego.preprocessing import RandomAdder from sklego.mixture import GMMClassifier ... mod = Pipeline([ ("scale", StandardScaler()), ("random_noise", RandomAdder()), ("model", GMMClassifier()) ]) ...
Here's a list of features that this library currently offers:
sklego.preprocessing.PatsyTransformerapplies a patsy formula
sklego.preprocessing.RandomAdderadds randomness in training
sklego.preprocessing.PandasTypeSelectorselects columns based on pandas type
sklego.preprocessing.ColumnSelectorselects columns based on column name
sklego.preprocessing.ColumnCapperlimits extreme values of the model features
sklego.preprocessing.OrthogonalTransformermakes all features linearly independant
sklego.dummy.RandomRegressorbenchmark that predicts random values
sklego.naive_bayes.GaussianMixtureNBclassifies by training a 1D GMM per column per class
sklego.mixture.GMMClassifierclassifies by training a GMM per class
sklego.mixture.GMMOutlierDetectordetects outliers based on a trained GMM
sklego.pipeline.DebugPipelineadds debug information to make debugging easier
sklego.meta.DecayEstimatoradds decay to the sample_weight that the model accepts
sklego.meta.GroupedEstimatorcan split the data into runs and run a model on each
sklego.meta.EstimatorTransformeradds a model output as a feature
sklego.metrics.correlation_scorecalculates correlation between model output and feature
sklego.metrics.p_percent_scoreproxy for model fairness with regards to sensitive attribute
sklego.datasets.load_chickenloads in the joyful chickweight dataset
sklego.datasets.make_simpleseriesmake a simulated timeseries
sklego.pandas_utils.log_stepa simple logger-decorator for pandas pipeline steps
sklego.pandas_utils.add_lagsadds lag values of certain columns in pandas
We want to be rather open here in what we accept but we do demand three
things before they become added to the project:
- any new feature contributes towards a demonstratable real-world usecase
- any new feature passes standard unit tests (we have a few for transformers and predictors)
- the feature has been discussed in the issue list beforehand