Note
Go to the end to download the full example code
Random state within joblib.Parallel¶
Randomness is affected by parallel execution differently by the different backends.
In particular, when using multiple processes, the random sequence can be the same in all processes. This example illustrates the problem and shows how to work around it.
A utility function for the example
Sequential behavior¶
stochastic_function
will generate five random integers. When calling the function several times, we are expecting to obtain different vectors. For instance, we will call the function five times in a sequential manner, we can check that the generated vectors are all different.
def stochastic_function(max_value):
"""Randomly generate integer up to a maximum value."""
return np.random.randint(max_value, size=5)
n_vectors = 5
random_vector = [stochastic_function(10) for _ in range(n_vectors)]
print('\nThe different generated vectors in a sequential manner are:\n {}'
.format(np.array(random_vector)))
The different generated vectors in a sequential manner are:
[[1 0 2 6 3]
[6 6 4 8 5]
[5 5 9 2 3]
[2 6 3 8 6]
[8 5 5 6 7]]
Parallel behavior¶
Joblib provides three different backends: loky (default), threading, and multiprocessing.
backend = 'loky'
random_vector = Parallel(n_jobs=2, backend=backend)(delayed(
stochastic_function)(10) for _ in range(n_vectors))
print_vector(random_vector, backend)
The different generated vectors using the loky backend are:
[[2 5 0 7 9]
[3 5 8 6 1]
[7 5 8 3 4]
[8 6 0 4 6]
[6 5 2 1 3]]
backend = 'threading'
random_vector = Parallel(n_jobs=2, backend=backend)(delayed(
stochastic_function)(10) for _ in range(n_vectors))
print_vector(random_vector, backend)
The different generated vectors using the threading backend are:
[[3 5 2 6 0]
[6 8 0 5 4]
[4 8 1 0 5]
[6 4 2 7 0]
[1 4 6 1 3]]
Loky and the threading backends behave exactly as in the sequential case and do not require more care. However, this is not the case regarding the multiprocessing backend with the “fork” or “forkserver” start method because the state of the global numpy random stated will be exactly duplicated in all the workers
Note: on platforms for which the default start method is “spawn”, we do not have this problem but we cannot use this in a Python script without using the if __name__ == “__main__” construct. So let’s end this example early if that’s the case:
import multiprocessing as mp
if mp.get_start_method() != "spawn":
backend = 'multiprocessing'
random_vector = Parallel(n_jobs=2, backend=backend)(delayed(
stochastic_function)(10) for _ in range(n_vectors))
print_vector(random_vector, backend)
The different generated vectors using the multiprocessing backend are:
[[6 5 2 0 7]
[6 5 2 0 7]
[7 7 7 4 7]
[7 6 6 9 2]
[6 0 8 2 8]]
Some of the generated vectors are exactly the same, which can be a problem for the application.
Technically, the reason is that all forked Python processes share the
same exact random seed. As a result, we obtain twice the same randomly
generated vectors because we are using n_jobs=2
. A solution is to
set the random state within the function which is passed to
joblib.Parallel
.
def stochastic_function_seeded(max_value, random_state):
rng = np.random.RandomState(random_state)
return rng.randint(max_value, size=5)
stochastic_function_seeded
accepts as argument a random seed. We can
reset this seed by passing None
at every function call. In this case, we
see that the generated vectors are all different.
if mp.get_start_method() != "spawn":
random_vector = Parallel(n_jobs=2, backend=backend)(delayed(
stochastic_function_seeded)(10, None) for _ in range(n_vectors))
print_vector(random_vector, backend)
The different generated vectors using the multiprocessing backend are:
[[9 5 9 0 3]
[8 4 8 6 9]
[0 9 7 4 4]
[1 7 2 9 6]
[2 1 2 7 7]]
Fixing the random state to obtain deterministic results¶
The pattern of
stochastic_function_seeded
has another advantage: it allows to control the random_state by passing a known seed. So for instance, we can replicate the same generation of vectors by passing a fixed state as follows.
if mp.get_start_method() != "spawn":
random_state = np.random.randint(np.iinfo(np.int32).max, size=n_vectors)
random_vector = Parallel(n_jobs=2, backend=backend)(delayed(
stochastic_function_seeded)(10, rng) for rng in random_state)
print_vector(random_vector, backend)
random_vector = Parallel(n_jobs=2, backend=backend)(delayed(
stochastic_function_seeded)(10, rng) for rng in random_state)
print_vector(random_vector, backend)
The different generated vectors using the multiprocessing backend are:
[[9 1 1 9 3]
[0 9 8 6 0]
[4 3 4 8 2]
[3 5 1 9 1]
[8 3 3 4 0]]
The different generated vectors using the multiprocessing backend are:
[[9 1 1 9 3]
[0 9 8 6 0]
[4 3 4 8 2]
[3 5 1 9 1]
[8 3 3 4 0]]
Total running time of the script: ( 0 minutes 0.755 seconds)