lazycluster
lazycluster is a Python library intended to liberate data scientists and machine learning engineers by abstracting away cluster management and configuration so that they are able to focus on their actual tasks. Especially, the easy and convenient cluster setup with Python for various distributed machine learning frameworks is emphasized.
Highlights
- High-Level API for starting clusters:
- DASK
- Hyperopt
- More lazyclusters (e.g. Ray, PyTorch, Tensorflow, Horovod, Spark) to come ...
- Lower-level API for:
- Managing Runtimes or RuntimeGroups to:
- A-/synchronously execute RuntimeTasks by leveraging the power of ssh
- Expose services (e.g. a DB) from or to a
Runtime
or in a wholeRuntimeGroup
- Command line interface (CLI)
- List all available
Runtimes
- Add a
Runtime
configuration - Delete a
Runtime
configuration
Concept Definition: Runtime
ARuntime
is the logical representation of a remote host. Typically, the host is another server or a virtual machine / container on another server. This python class provides several methods for utilizing remote resources such as the port exposure from / to aRuntime
as well as the execution of RuntimeTasks. ARuntime
has a working directory. Usually, the execution of aRuntimeTask
is conducted relatively to this directory if no other path is explicitly given. The working directory can be manually set during the initialization. Otherwise, a temporary directory gets created that might eventually be removed.
Concept Definition: RuntimeGroup
ARuntimeGroup
is the representation of logically relatedRuntimes
and provides convenient methods for managing those relatedRuntimes
. Most methods are wrappers around their counterparts in theRuntime
class. Typical usage examples are exposing a port (i.e. a service such as a DB) in theRuntimeGroup
, transfer files, or execute aRuntimeTask
on theRuntimes
. Additionally, all concrete RuntimeCluster (e.g. the HyperoptCluster) implementations rely onRuntimeGroups
for example.
Concept Definition: Manager
The manager
refers to the host where you are actually using the lazycluster library, since all desired lazycluster entities are managed from here. Caution: It is not to be confused with the RuntimeManager class.
Concept Definition: RuntimeTask
ARuntimeTask
is a composition of multiple elemantary task steps, namelysend file
,get file
,run command
(shell),run function
(python). ARuntimeTask
can be executed on a remote host either by handing it over to aRuntime
object or standalone by handing over a fabric Connection object to the execute method of theRuntimeTask
. Consequently, all invididual task steps are executed sequentially. Moreover, aRuntimeTask
object captures the output (stdout/stderr) of the remote execution in its execution log. An example for aRuntimeTask
could be to send a csv file to aRuntime
, execute a python function that is transforming the csv file and finally get the file back.
Getting started
Installation
pip install lazycluster
# Most up-to-date development version
pip install --upgrade git+https://github.com/ml-tooling/lazycluster.git@develop
Prerequisites
For lazycluster usage on the manager:
Unix based OS
Python >= 3.6
ssh client (e.g. openssh-client)
Passwordless ssh access to the Runtime
hosts (recommended)
Configure passwordless ssh access (click to expand...)
- Create a key pair on the manager as described here or use an existing one
- Install lazycluster on the manager
- Create the ssh configuration for each host to be used as Runtime by using the lazycluster CLI command
lazycluster add-runtime
as described here and do not forget to specify the--id-file
argument. - Finally, enable the passwordless ssh access by copying the public key to each Runtime as descibed here
Runtime host requirements:
- Unix based OS
- Python >= 3.6
- ssh server (e.g. openssh-server)
Note:
Passwordless ssh needs to be setup for the hosts to be used as Runtimes for the most convenient user experience. Otherwise, you need to pass the connection details to Runtime.__init__ via connection_kwargs. These parameters will be passed on to the fabric.Connection.
Usage example high-level API
Start a Dask cluster.
from lazycluster import RuntimeManager
from lazycluster.cluster.dask_cluster import DaskCluster
# Automatically generate a group based on the ssh configuration
runtime_manager = RuntimeManager()
runtime_group = runtime_manager.create_group()
# Start the Dask cluster instances using the RuntimeGroup
dask_cluster = DaskCluster(runtime_group)
dask_cluster.start()
# => Now, you can start using the running Dask cluster
# Get Dask client to interact with the cluster
# Note: This will give you a dask.distributed.Client which is not
# a lazycluster cluster but a Dask one instead
client = cluster.get_client()
Usage example lower-level API
Execute a Python function on a remote host and access the return data.
from lazycluster import RuntimeTask, Runtime
# Define a Python function which will be executed remotely
def hello(name:str):
return 'Hello ' + name + '!'
# Compose a `RuntimeTask`
task = RuntimeTask('my-first_task').run_command('echo Hello World!') \
.run_function(hello, name='World')
# Actually execute it remotely in a `Runtime`
task = Runtime('host-1').execute_task(task, execute_async=False)
# The stdout from from the executing `Runtime` can be accessed
# via the execution log of the `RuntimeTask`
task.print_log()
# Print the return of the `hello()` call
generator = task.function_returns
print(next(generator))