sk-dist
sk-dist is a Python module for machine learning built on top of scikit-learn and is distributed under the Apache 2.0 software license. The sk-dist module can be thought of as "distributed scikit-learn" as its core functionality is to extend the scikit-learn built-in joblib parallelization of meta-estimator training to spark.
Main Features
-
Distributed Training -
sk-dist
parallelizes the training of
scikit-learn
meta-estimators with PySpark. This allows
distributed training of these estimators without any constraint on
the physical resources of any one machine. In all cases, spark
artifacts are automatically stripped from the fitted estimator. These
estimators can then be pickled and un-pickled for prediction tasks,
operating identically at predict time to theirscikit-learn
counterparts. Supported tasks are:- Grid Search:
Hyperparameter optimization techniques <https://scikit-learn.org/stable/modules/grid_search.html>
,
particularly
GridSearchCV <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>
and
RandomizedSeachCV <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV>
__,
are distributed such that each parameter set candidate is trained
in parallel. - Multiclass Strategies:
Multiclass classification strategies <https://scikit-learn.org/stable/modules/multiclass.html>
,
particularly
OneVsRestClassifier <https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier>
and
OneVsOneClassifier <https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html#sklearn.multiclass.OneVsOneClassifier>
__,
are distributed such that each binary probelm is trained in
parallel. - Tree Ensembles:
Decision tree ensembles <https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees>
__
for classification and regression, particularly
RandomForest <https://scikit-learn.org/stable/modules/ensemble.html#random-forests>
__
and
ExtraTrees <https://scikit-learn.org/stable/modules/ensemble.html#extremely-randomized-trees>
__,
are distributed such that each tree is trained in parallel.
- Grid Search:
-
Distributed Prediction -
sk-dist
provides a prediction module
which buildsvectorized UDFs <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs>
__
for
PySpark <https://spark.apache.org/docs/latest/api/python/index.html>
__
DataFrames <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame>
__
using fittedscikit-learn
estimators. This distributes the
predict
andpredict_proba
methods ofscikit-learn
estimators, enabling large scale prediction withscikit-learn
. -
Feature Encoding -
sk-dist
provides a flexible feature
encoding utility calledEncoderizer
which encodes mix-typed
feature spaces using either default behavior or user defined
customizable settings. It is particularly aimed at text features, but
it additionally handles numeric and dictionary type feature spaces.