On demand recomputing: the Memory class¶
Use case¶
The Memory
class defines a context for lazy evaluation of
function, by putting the results in a store, by default using a disk, and not
re-running the function twice for the same arguments.
It works by explicitly saving the output to a file and it is designed to work with non-hashable and potentially large input and output data types such as numpy arrays.
A simple example:¶
First, define the cache directory:
>>> cachedir = 'your_cache_location_directory'Then, instantiate a memory context that uses this cache directory:
>>> from joblib import Memory >>> memory = Memory(cachedir, verbose=0)After these initial steps, just decorate a function to cache its output in this context:
>>> @memory.cache ... def f(x): ... print('Running f(%s)' % x) ... return xCalling this function twice with the same argument does not execute it the second time, the output is just reloaded from a pickle file in the cache directory:
>>> print(f(1)) Running f(1) 1 >>> print(f(1)) 1However, calling the function with a different parameter executes it and recomputes the output:
>>> print(f(2)) Running f(2) 2
Comparison with memoize¶
The memoize decorator (https://code.activestate.com/recipes/52201/)
caches in memory all the inputs and outputs of a function call. It can
thus avoid running twice the same function, with a very small
overhead. However, it compares input objects with those in cache on each
call. As a result, for big objects there is a huge overhead. Moreover
this approach does not work with numpy arrays, or other objects subject
to non-significant fluctuations. Finally, using memoize with large
objects will consume all the memory, where with Memory, objects are
persisted to disk, using a persister optimized for speed and memory
usage (joblib.dump()
).
In short, memoize is best suited for functions with “small” input and output objects, whereas Memory is best suited for functions with complex input and output objects, and aggressive persistence to disk.
Using with numpy¶
The original motivation behind the Memory context was to have a memoize-like pattern on numpy arrays. Memory uses fast cryptographic hashing of the input arguments to check if they have been computed.
An example¶
Define two functions: the first with a number as an argument, outputting an array, used by the second one. Both functions are decorated with
Memory.cache
:>>> import numpy as np >>> @memory.cache ... def g(x): ... print('A long-running calculation, with parameter %s' % x) ... return np.hamming(x) >>> @memory.cache ... def h(x): ... print('A second long-running calculation, using g(x)') ... return np.vander(x)If the function h is called with the array created by the same call to g, h is not re-run:
>>> a = g(3) A long-running calculation, with parameter 3 >>> a array([0.08, 1. , 0.08]) >>> g(3) array([0.08, 1. , 0.08]) >>> b = h(a) A second long-running calculation, using g(x) >>> b2 = h(a) >>> b2 array([[0.0064, 0.08 , 1. ], [1. , 1. , 1. ], [0.0064, 0.08 , 1. ]]) >>> np.allclose(b, b2) True
Using memmapping¶
Memmapping (memory mapping) speeds up cache looking when reloading large numpy arrays:
>>> cachedir2 = 'your_cachedir2_location'
>>> memory2 = Memory(cachedir2, mmap_mode='r')
>>> square = memory2.cache(np.square)
>>> a = np.vander(np.arange(3)).astype(float)
>>> square(a)
________________________________________________________________________________
[Memory] Calling square...
square(array([[0., 0., 1.],
[1., 1., 1.],
[4., 2., 1.]]))
___________________________________________________________square - ...min
memmap([[ 0., 0., 1.],
[ 1., 1., 1.],
[16., 4., 1.]])
Note
Notice the debug mode used in the above example. It is useful for tracing of what is being reexecuted, and where the time is spent.
If the square function is called with the same input argument, its return value is loaded from the disk using memmapping:
>>> res = square(a)
>>> print(repr(res))
memmap([[ 0., 0., 1.],
[ 1., 1., 1.],
[16., 4., 1.]])
The memmap file must be closed to avoid file locking on Windows; closing numpy.memmap objects is done with del, which flushes changes to the disk
>>> del res
Note
If the memory mapping mode used was ‘r’, as in the above example, the array will be read only, and will be impossible to modified in place.
On the other hand, using ‘r+’ or ‘w+’ will enable modification of the array, but will propagate these modification to the disk, which will corrupt the cache. If you want modification of the array in memory, we suggest you use the ‘c’ mode: copy on write.
Shelving: using references to cached values¶
In some cases, it can be useful to get a reference to the cached result, instead of having the result itself. A typical example of this is when a lot of large numpy arrays must be dispatched across several workers: instead of sending the data themselves over the network, send a reference to the joblib cache, and let the workers read the data from a network filesystem, potentially taking advantage of some system-level caching too.
Getting a reference to the cache can be done using the call_and_shelve method on the wrapped function:
>>> result = g.call_and_shelve(4)
A long-running calculation, with parameter 4
>>> result
MemorizedResult(location="...", func="...g...", args_id="...")
Once computed, the output of g is stored on disk, and deleted from memory. Reading the associated value can then be performed with the get method:
>>> result.get()
array([0.08, 0.77, 0.77, 0.08])
The cache for this particular value can be cleared using the clear method. Its invocation causes the stored value to be erased from disk. Any subsequent call to get will cause a KeyError exception to be raised:
>>> result.clear()
>>> result.get()
Traceback (most recent call last):
...
KeyError: 'Non-existing cache value (may have been cleared).\nFile ... does not exist'
A MemorizedResult instance contains all that is necessary to read the cached value. It can be pickled for transmission or storage, and the printed representation can even be copy-pasted to a different python interpreter.
Shelving when cache is disabled
In the case where caching is disabled (e.g. Memory(None)), the call_and_shelve method returns a NotMemorizedResult instance, that stores the full function output, instead of just a reference (since there is nothing to point to). All the above remains valid though, except for the copy-pasting feature.
Gotchas¶
Across sessions, function cache is identified by the function’s name. Thus assigning the same name to different functions, their cache will override each-others (e.g. there are ‘name collisions’), and unwanted re-run will happen:
>>> @memory.cache ... def func(x): ... print('Running func(%s)' % x) >>> func2 = func >>> @memory.cache ... def func(x): ... print('Running a different func(%s)' % x)
As long as the same session is used, there are no collisions (in joblib 0.8 and above), although joblib does warn you that you are doing something dangerous:
>>> func(1) Running a different func(1) >>> # FIXME: The next line should create a JolibCollisionWarning but does not >>> # memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (<doctest memory.rst>:...) and 'func' (<doctest memory.rst>:...) >>> func2(1) Running func(1) >>> func(1) # No recomputation so far >>> func2(1) # No recomputation so far
But suppose the interpreter is exited and then restarted, the cache will not be identified properly, and the functions will be rerun:
>>> # FIXME: The next line will should create a JoblibCollisionWarning but does not. Also it is skipped because it does not produce any output >>> # memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (<doctest memory.rst>:...) and 'func' (<doctest memory.rst>:...) >>> func(1) Running a different func(1) >>> func2(1) Running func(1)
As long as the same session is used, there are no needless recomputation:
>>> func(1) # No recomputation now >>> func2(1) # No recomputation now
lambda functions
Beware that with Python 2.7 lambda functions cannot be separated out:
>>> def my_print(x): ... print(x) >>> f = memory.cache(lambda : my_print(1)) >>> g = memory.cache(lambda : my_print(2)) >>> f() 1 >>> f() >>> g() memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function '<lambda>' 2 >>> g() >>> f() 1
memory cannot be used on some complex objects, e.g. a callable object with a __call__ method.
However, it works on numpy ufuncs:
>>> sin = memory.cache(np.sin) >>> print(sin(0)) 0.0
caching methods: memory is designed for pure functions and it is not recommended to use it for methods. If one wants to use cache inside a class the recommended pattern is to cache a pure function and use the cached function inside your class, i.e. something like this:
@memory.cache def compute_func(arg1, arg2, arg3): # long computation return result class Foo(object): def __init__(self, args): self.data = None def compute(self): self.data = compute_func(self.arg1, self.arg2, 40)
Using
Memory
for methods is not recommended and has some caveats that make it very fragile from a maintenance point of view because it is very easy to forget about these caveats when a software evolves. If this cannot be avoided (we would be interested about your use case by the way), here are a few known caveats:a method cannot be decorated at class definition, because when the class is instantiated, the first argument (self) is bound, and no longer accessible to the Memory object. The following code won’t work:
class Foo(object): @memory.cache # WRONG def method(self, args): pass
The right way to do this is to decorate at instantiation time:
class Foo(object): def __init__(self, args): self.method = memory.cache(self.method) def method(self, ...): pass
The cached method will have
self
as one of its arguments. That means that the result will be recomputed if anything withself
changes. For example ifself.attr
has changed callingself.method
will recompute the result even ifself.method
does not useself.attr
in its body. Another example is changingself
inside the body ofself.method
. The consequence is thatself.method
will create cache that will not be reused in subsequent calls. To alleviate these problems and if you know that the result ofself.method
does not depend onself
you can useself.method = memory.cache(self.method, ignore=['self'])
.
joblib cache entries may be invalidated after environment updates. Values returned by
joblib.hash()
are not guaranteed to stay constant acrossjoblib
versions. This means that all entries of aMemory
cache can get invalidated when upgradingjoblib
. Invalidation can also happen when upgrading a third party library (such asnumpy
): in such a case, only the cached function calls with parameters that are constructs (or contain references to constructs) defined in the upgraded library should potentially be invalidated after the upgrade.Cache-miss with objects that have non-reproducible pickle representations. The identifier of the cache entry is based on the pickle’s representation of the input arguments. Therefor, for objects that don’t have a deterministic pickle representation, or objects whose representation depends on the way they are constructed, the cache will not work. In particular,
pytorch.Tensor
are known to have non-deterministic pickle representation (see this issue). A good way to debug this is to check that two calls to the following script withargs
andkwargs
being the cached function’s inputs give the same output:from joblib import hash for x in args: print(f"{hash(x)}") for k, x in kwargs.items(): print(f"hash({k})={hash(x)}")
To avoid this issue, a good practice is to use
Memory.cache
with functions that take simple input arguments when possible.
Ignoring some arguments¶
It may be useful not to recalculate a function when certain arguments
change, for instance a debug flag. Memory
provides the ignore
list:
>>> @memory.cache(ignore=['debug'])
... def my_func(x, debug=True):
... print('Called with x = %s' % x)
>>> my_func(0)
Called with x = 0
>>> my_func(0, debug=False)
>>> my_func(0, debug=True)
>>> # my_func was not reevaluated
Custom cache validation¶
In some cases, external factors can invalidate the cached results and one wants to have more control on whether to reuse a result or not.
This is for instance the case if the results depends on database records that change over time: a small delay in the updates might be tolerable but after a while, the results might be invalid.
One can have a finer control on the cache validity specifying a function
via cache_validation_callback
in cache()
. For
instance, one can only cache results that take more than 1s to be computed.
>>> import time
>>> def cache_validation_cb(metadata):
... # Only retrieve cached results for calls that take more than 1s
... return metadata['duration'] > 1
>>> @memory.cache(cache_validation_callback=cache_validation_cb)
... def my_func(delay=0):
... time.sleep(delay)
... print(f'Called with {delay}s delay')
>>> my_func()
Called with 0s delay
>>> my_func(1.1)
Called with 1.1s delay
>>> my_func(1.1) # This result is retrieved from cache
>>> my_func() # This one is not and the call is repeated
Called with 0s delay
cache_validation_cb
will be called with a single argument containing
the metadata of the cached call as a dictionary containing the following
keys:
duration
: the duration of the function call,
time
: the timestamp when the cache called has been recorded
input_args
: a dictionary of keywords arguments for the cached function call.
Note a validity duration for cached results can be defined via
joblib.expires_after()
by providing similar with arguments similar to the
ones of a datetime.timedelta
:
>>> from joblib import expires_after
>>> @memory.cache(cache_validation_callback=expires_after(seconds=0.5))
... def my_func():
... print(f'Function run')
>>> my_func()
Function run
>>> my_func()
>>> time.sleep(0.5)
>>> my_func()
Function run
Reference documentation of the Memory
class¶
- class joblib.Memory(location=None, backend='local', mmap_mode=None, compress=False, verbose=1, bytes_limit=None, backend_options=None)
A context object for caching a function’s return value each time it is called with the same input arguments.
All values are cached on the filesystem, in a deep directory structure.
Read more in the User Guide.
- Parameters
- location: str, pathlib.Path or None
The path of the base directory to use as a data store or None. If None is given, no caching is done and the Memory object is completely transparent. This option replaces cachedir since version 0.12.
- backend: str, optional
Type of store backend for reading/writing cache files. Default: ‘local’. The ‘local’ backend is using regular filesystem operations to manipulate data (open, mv, etc) in the backend.
- mmap_mode: {None, ‘r+’, ‘r’, ‘w+’, ‘c’}, optional
The memmapping mode used when loading from cache numpy arrays. See numpy.load for the meaning of the arguments.
- compress: boolean, or integer, optional
Whether to zip the stored data on disk. If an integer is given, it should be between 1 and 9, and sets the amount of compression. Note that compressed arrays cannot be read by memmapping.
- verbose: int, optional
Verbosity flag, controls the debug messages that are issued as functions are evaluated.
- bytes_limit: int | str, optional
Limit in bytes of the size of the cache. By default, the size of the cache is unlimited. When reducing the size of the cache,
joblib
keeps the most recently accessed items first. If a str is passed, it is converted to a number of bytes using units { K | M | G} for kilo, mega, giga.Note: You need to call
joblib.Memory.reduce_size()
to actually reduce the cache size to be less thanbytes_limit
.Note: This argument has been deprecated. One should give the value of
bytes_limit
directly injoblib.Memory.reduce_size()
.- backend_options: dict, optional
Contains a dictionary of named parameters used to configure the store backend.
- __init__(location=None, backend='local', mmap_mode=None, compress=False, verbose=1, bytes_limit=None, backend_options=None)
- Parameters
- depth: int, optional
The depth of objects printed.
- name: str, optional
The namespace to log to. If None, defaults to joblib.
- cache(func=None, ignore=None, verbose=None, mmap_mode=False, cache_validation_callback=None)
Decorates the given function func to only compute its return value for input arguments not cached on disk.
- Parameters
- func: callable, optional
The function to be decorated
- ignore: list of strings
A list of arguments name to ignore in the hashing
- verbose: integer, optional
The verbosity mode of the function. By default that of the memory object is used.
- mmap_mode: {None, ‘r+’, ‘r’, ‘w+’, ‘c’}, optional
The memmapping mode used when loading from cache numpy arrays. See numpy.load for the meaning of the arguments. By default that of the memory object is used.
- cache_validation_callback: callable, optional
Callable to validate whether or not the cache is valid. When the cached function is called with arguments for which a cache exists, this callable is called with the metadata of the cached result as its sole argument. If it returns True, then the cached result is returned, else the cache for these arguments is cleared and recomputed.
- Returns
- decorated_func: MemorizedFunc object
The returned object is a MemorizedFunc object, that is callable (behaves like a function), but offers extra methods for cache lookup and management. See the documentation for
joblib.memory.MemorizedFunc
.
- clear(warn=True)
Erase the complete cache directory.
- eval(func, *args, **kwargs)
Eval function func with arguments *args and **kwargs, in the context of the memory.
This method works similarly to the builtin apply, except that the function is called only if the cache is not up to date.
- format(obj, indent=0)
Return the formatted representation of the object.
- reduce_size(bytes_limit=None, items_limit=None, age_limit=None)
Remove cache elements to make the cache fit its limits.
The limitation can impose that the cache size fits in
bytes_limit
, that the number of cache items is no more thanitems_limit
, and that all files in cache are not older thanage_limit
.- Parameters
- bytes_limit: int | str, optional
Limit in bytes of the size of the cache. By default, the size of the cache is unlimited. When reducing the size of the cache,
joblib
keeps the most recently accessed items first. If a str is passed, it is converted to a number of bytes using units { K | M | G} for kilo, mega, giga.- items_limit: int, optional
Number of items to limit the cache to. By default, the number of items in the cache is unlimited. When reducing the size of the cache,
joblib
keeps the most recently accessed items first.- age_limit: datetime.timedelta, optional
Maximum age of items to limit the cache to. When reducing the size of the cache, any items last accessed more than the given length of time ago are deleted.
Useful methods of decorated functions¶
Functions decorated by Memory.cache
are
MemorizedFunc
objects that, in addition of behaving like normal functions, expose
methods useful for cache exploration and management. For example, you can
use func.check_call_in_cache
to
check if a cache hit will occur for a decorated func
given a set of inputs
without actually needing to call the function itself:
>>> @memory.cache
... def func(x):
... print('Running func(%s)' % x)
... return x
>>> type(func)
<class 'joblib.memory.MemorizedFunc'>
>>> func(1)
Running func(1)
1
>>> func.check_call_in_cache(1) # cache hit
True
>>> func.check_call_in_cache(2) # cache miss
False
- class joblib.memory.MemorizedFunc(func, location, backend='local', ignore=None, mmap_mode=None, compress=False, verbose=1, timestamp=None, cache_validation_callback=None)¶
Callable object decorating a function for caching its return value each time it is called.
Methods are provided to inspect the cache or clean it.
- Attributes
- func: callable
The original, undecorated, function.
- location: string
The location of joblib cache. Depends on the store backend used.
- backend: str
Type of store backend for reading/writing cache files. Default is ‘local’, in which case the location is the path to a disk storage.
- ignore: list or None
List of variable names to ignore when choosing whether to recompute.
- mmap_mode: {None, ‘r+’, ‘r’, ‘w+’, ‘c’}
The memmapping mode used when loading from cache numpy arrays. See numpy.load for the meaning of the different values.
- compress: boolean, or integer
Whether to zip the stored data on disk. If an integer is given, it should be between 1 and 9, and sets the amount of compression. Note that compressed arrays cannot be read by memmapping.
- verbose: int, optional
The verbosity flag, controls messages that are issued as the function is evaluated.
- cache_validation_callback: callable, optional
Callable to check if a result in cache is valid or is to be recomputed. When the function is called with arguments for which a cache exists, the callback is called with the cache entry’s metadata as its sole argument. If it returns True, the cached result is returned, else the cache for these arguments is cleared and the result is recomputed.
- __init__(func, location, backend='local', ignore=None, mmap_mode=None, compress=False, verbose=1, timestamp=None, cache_validation_callback=None)¶
- Parameters
- depth: int, optional
The depth of objects printed.
- name: str, optional
The namespace to log to. If None, defaults to joblib.
- call(*args, **kwargs)¶
Force the execution of the function with the given arguments.
The output values will be persisted, i.e., the cache will be updated with any new values.
- Parameters
- *args: arguments
The arguments.
- **kwargs: keyword arguments
Keyword arguments.
- Returns
- outputobject
The output of the function call.
- metadatadict
The metadata associated with the call.
- check_call_in_cache(*args, **kwargs)¶
Check if function call is in the memory cache.
Does not call the function or do any work besides func inspection and arg hashing.
- Returns
- is_call_in_cache: bool
Whether or not the result of the function has been cached for the input arguments that have been passed.
- clear(warn=True)¶
Empty the function’s cache.
Helper Reference¶
- joblib.expires_after(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0)¶
Helper cache_validation_callback to force recompute after a duration.
- Parameters
- days, seconds, microseconds, milliseconds, minutes, hours, weeks: numbers
argument passed to a timedelta.