lazycluster

lazycluster is a Python library intended to liberate data scientists and machine learning engineers by abstracting away cluster management and configuration so that they are able to focus on their actual tasks. Especially, the easy and convenient cluster setup with Python for various distributed machine learning frameworks is emphasized.

Highlights

  • High-Level API for starting clusters:
  • DASK
  • Hyperopt
  • More lazyclusters (e.g. Ray, PyTorch, Tensorflow, Horovod, Spark) to come ...
  • Lower-level API for:
  • Managing Runtimes or RuntimeGroups to:
  • A-/synchronously execute RuntimeTasks by leveraging the power of ssh
  • Expose services (e.g. a DB) from or to a Runtime or in a whole RuntimeGroup
  • Command line interface (CLI)
  • List all available Runtimes
  • Add a Runtime configuration
  • Delete a Runtime configuration


API layer
Concept Definition: Runtime
A Runtime is the logical representation of a remote host. Typically, the host is another server or a virtual machine / container on another server. This python class provides several methods for utilizing remote resources such as the port exposure from / to a Runtime as well as the execution of RuntimeTasks. A Runtime has a working directory. Usually, the execution of a RuntimeTask is conducted relatively to this directory if no other path is explicitly given. The working directory can be manually set during the initialization. Otherwise, a temporary directory gets created that might eventually be removed.
Concept Definition: RuntimeGroup
A RuntimeGroup is the representation of logically related Runtimes and provides convenient methods for managing those related Runtimes. Most methods are wrappers around their counterparts in the Runtime class. Typical usage examples are exposing a port (i.e. a service such as a DB) in the RuntimeGroup, transfer files, or execute a RuntimeTask on the Runtimes. Additionally, all concrete RuntimeCluster (e.g. the HyperoptCluster) implementations rely on RuntimeGroups for example.
Concept Definition: Manager
The manager refers to the host where you are actually using the lazycluster library, since all desired lazycluster entities are managed from here. Caution: It is not to be confused with the RuntimeManager class.
Concept Definition: RuntimeTask
A RuntimeTask is a composition of multiple elemantary task steps, namely send file, get file, run command (shell), run function (python). A RuntimeTask can be executed on a remote host either by handing it over to a Runtime object or standalone by handing over a fabric Connection object to the execute method of the RuntimeTask. Consequently, all invididual task steps are executed sequentially. Moreover, a RuntimeTask object captures the output (stdout/stderr) of the remote execution in its execution log. An example for a RuntimeTask could be to send a csv file to a Runtime, execute a python function that is transforming the csv file and finally get the file back.


Getting started

Installation

pip install lazycluster
# Most up-to-date development version
pip install --upgrade git+https://github.com/ml-tooling/lazycluster.git@develop

Prerequisites

For lazycluster usage on the manager:

Unix based OS

Python >= 3.6

ssh client (e.g. openssh-client)

Passwordless ssh access to the Runtime hosts (recommended)
Configure passwordless ssh access (click to expand...)

  • Create a key pair on the manager as described here or use an existing one
  • Install lazycluster on the manager
  • Create the ssh configuration for each host to be used as Runtime by using the lazycluster CLI command lazycluster add-runtime as described here and do not forget to specify the --id-file argument.
  • Finally, enable the passwordless ssh access by copying the public key to each Runtime as descibed here

Runtime host requirements:

  • Unix based OS
  • Python >= 3.6
  • ssh server (e.g. openssh-server)

Note:

Passwordless ssh needs to be setup for the hosts to be used as Runtimes for the most convenient user experience. Otherwise, you need to pass the connection details to Runtime.__init__ via connection_kwargs. These parameters will be passed on to the fabric.Connection.

Usage example high-level API

Start a Dask cluster.

from lazycluster import RuntimeManager
from lazycluster.cluster.dask_cluster import DaskCluster

# Automatically generate a group based on the ssh configuration
runtime_manager = RuntimeManager()
runtime_group = runtime_manager.create_group()

# Start the Dask cluster instances using the RuntimeGroup
dask_cluster = DaskCluster(runtime_group)
dask_cluster.start()

# => Now, you can start using the running Dask cluster

# Get Dask client to interact with the cluster
# Note: This will give you a dask.distributed.Client which is not
#       a lazycluster cluster but a Dask one instead
client = cluster.get_client()

Usage example lower-level API

Execute a Python function on a remote host and access the return data.

from lazycluster import RuntimeTask, Runtime

# Define a Python function which will be executed remotely
def hello(name:str):
    return 'Hello ' + name + '!'

# Compose a `RuntimeTask`
task = RuntimeTask('my-first_task').run_command('echo Hello World!') \
                                   .run_function(hello, name='World')

# Actually execute it remotely in a `Runtime`
task = Runtime('host-1').execute_task(task, execute_async=False)

# The stdout from from the executing `Runtime` can be accessed
# via the execution log of the `RuntimeTask`
task.print_log()

# Print the return of the `hello()` call
generator = task.function_returns
print(next(generator))

GitHub

GitHub - ml-tooling/lazycluster: ? Distributed machine learning made simple.
? Distributed machine learning made simple. Contribute to ml-tooling/lazycluster development by creating an account on GitHub.