/ Machine Learning

Distributed scikit-learn meta-estimators in PySpark

Distributed scikit-learn meta-estimators in PySpark

sk-dist

sk-dist is a Python module for machine learning built on top of scikit-learn and is distributed under the Apache 2.0 software license. The sk-dist module can be thought of as "distributed scikit-learn" as its core functionality is to extend the scikit-learn built-in joblib parallelization of meta-estimator training to spark.

Main Features

  • Distributed Training - sk-dist parallelizes the training of
    scikit-learn meta-estimators with PySpark. This allows
    distributed training of these estimators without any constraint on
    the physical resources of any one machine. In all cases, spark
    artifacts are automatically stripped from the fitted estimator. These
    estimators can then be pickled and un-pickled for prediction tasks,
    operating identically at predict time to their scikit-learn
    counterparts. Supported tasks are:

    • Grid Search: Hyperparameter optimization techniques <https://scikit-learn.org/stable/modules/grid_search.html>,
      particularly
      GridSearchCV <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>

      and
      RandomizedSeachCV <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV>__,
      are distributed such that each parameter set candidate is trained
      in parallel.
    • Multiclass Strategies: Multiclass classification strategies <https://scikit-learn.org/stable/modules/multiclass.html>,
      particularly
      OneVsRestClassifier <https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier>

      and
      OneVsOneClassifier <https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html#sklearn.multiclass.OneVsOneClassifier>__,
      are distributed such that each binary probelm is trained in
      parallel.
    • Tree Ensembles: Decision tree ensembles <https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees>__
      for classification and regression, particularly
      RandomForest <https://scikit-learn.org/stable/modules/ensemble.html#random-forests>__
      and
      ExtraTrees <https://scikit-learn.org/stable/modules/ensemble.html#extremely-randomized-trees>__,
      are distributed such that each tree is trained in parallel.
  • Distributed Prediction - sk-dist provides a prediction module
    which builds vectorized UDFs <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs>__
    for
    PySpark <https://spark.apache.org/docs/latest/api/python/index.html>__
    DataFrames <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame>__
    using fitted scikit-learn estimators. This distributes the
    predict and predict_proba methods of scikit-learn
    estimators, enabling large scale prediction with scikit-learn.

  • Feature Encoding - sk-dist provides a flexible feature
    encoding utility called Encoderizer which encodes mix-typed
    feature spaces using either default behavior or user defined
    customizable settings. It is particularly aimed at text features, but
    it additionally handles numeric and dictionary type feature spaces.

GitHub