Note
Go to the end to download the full example code
Using Dask for single-machine parallel computing¶
This example shows the simplest usage of the Dask backend on your local machine.
This is useful for prototyping a solution, to later be run on a truly distributed Dask cluster, as the only change needed is the cluster class.
Another realistic usage scenario: combining dask code with joblib code, for instance using dask for preprocessing data, and scikit-learn for machine learning. In such a setting, it may be interesting to use distributed as a backend scheduler for both dask and joblib, to orchestrate the computation.
Setup the distributed client¶
from dask.distributed import Client, LocalCluster
# replace with whichever cluster class you're using
# https://docs.dask.org/en/stable/deploying.html#distributed-computing
cluster = LocalCluster()
# connect client to your cluster
client = Client(cluster)
# Monitor your computation with the Dask dashboard
print(client.dashboard_link)
http://127.0.0.1:8787/status
Run parallel computation using dask.distributed¶
import time
import joblib
def long_running_function(i):
time.sleep(0.1)
return i
The verbose messages below show that the backend is indeed the dask.distributed one
with joblib.parallel_config(backend="dask"):
joblib.Parallel(verbose=100)(
joblib.delayed(long_running_function)(i) for i in range(10)
)
[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.2s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.15952348709106445s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.2s
[Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.3s
[Parallel(n_jobs=-1)]: Done 6 tasks | elapsed: 0.5s
[Parallel(n_jobs=-1)]: Done 8 out of 10 | elapsed: 0.5s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 10 out of 10 | elapsed: 0.6s finished
Total running time of the script: (0 minutes 1.886 seconds)