Distributed scikit-learn meta-estimators in PySpark

sk-dist

sk-dist is a Python module for machine learning built on top of scikit-learn and is distributed under the Apache 2.0 software license. The sk-dist module can be thought of as "distributed scikit-learn" as its core functionality is to extend the scikit-learn built-in joblib parallelization of meta-estimator training to spark.

Main Features

Distributed Training - sk-dist parallelizes the training of
scikit-learn meta-estimators with PySpark. This allows
distributed training of these estimators without any constraint on
the physical resources of any one machine. In all cases, spark
artifacts are automatically stripped from the fitted estimator. These
estimators can then be pickled and un-pickled for prediction tasks,
operating identically at predict time to their scikit-learn
counterparts. Supported tasks are:
- Grid Search: Hyperparameter optimization techniques <https://scikit-learn.org/stable/modules/grid_search.html>,
  particularly
  GridSearchCV <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>
  and
  RandomizedSeachCV <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV>__,
  are distributed such that each parameter set candidate is trained
  in parallel.
- Multiclass Strategies: Multiclass classification strategies <https://scikit-learn.org/stable/modules/multiclass.html>,
  particularly
  OneVsRestClassifier <https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier>
  and
  OneVsOneClassifier <https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html#sklearn.multiclass.OneVsOneClassifier>__,
  are distributed such that each binary probelm is trained in
  parallel.
- Tree Ensembles: Decision tree ensembles <https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees>__
  for classification and regression, particularly
  RandomForest <https://scikit-learn.org/stable/modules/ensemble.html#random-forests>__
  and
  ExtraTrees <https://scikit-learn.org/stable/modules/ensemble.html#extremely-randomized-trees>__,
  are distributed such that each tree is trained in parallel.
Distributed Prediction - sk-dist provides a prediction module
which builds vectorized UDFs <https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs>__
for
PySpark <https://spark.apache.org/docs/latest/api/python/index.html>__
DataFrames <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame>__
using fitted scikit-learn estimators. This distributes the
predict and predict_proba methods of scikit-learn
estimators, enabling large scale prediction with scikit-learn.
Feature Encoding - sk-dist provides a flexible feature
encoding utility called Encoderizer which encodes mix-typed
feature spaces using either default behavior or user defined
customizable settings. It is particularly aimed at text features, but
it additionally handles numeric and dictionary type feature spaces.

Distributed scikit-learn meta-estimators in PySpark

sk-dist

Main Features

GitHub

John

A python package to analyze and compare voices with deep learning

Beautiful spinners for terminal, IPython and Jupyter

sk-dist

Main Features

GitHub

A python package to analyze and compare voices with deep learning

Beautiful spinners for terminal, IPython and Jupyter

You might also like...