Persistence¶
Use case¶
joblib.dump()
and joblib.load()
provide a replacement for
pickle to work efficiently on arbitrary Python objects containing large data,
in particular large numpy arrays.
Warning
joblib.dump()
and joblib.load()
are based on the Python pickle
serialization model, which means that arbitrary Python code can be executed
when loading a serialized object with joblib.load()
.
joblib.load()
should therefore never be used to load objects from an
untrusted source or otherwise you will introduce a security vulnerability in
your program.
Note
As of Python 3.8 and numpy 1.16, pickle protocol 5 introduced in PEP 574 supports efficient serialization and de-serialization for large data buffers natively using the standard library:
pickle.dump(large_object, fileobj, protocol=5)
A simple example¶
First create a temporary directory:
>>> from tempfile import mkdtemp
>>> savedir = mkdtemp()
>>> import os
>>> filename = os.path.join(savedir, 'test.joblib')
Then create an object to be persisted:
>>> import numpy as np
>>> to_persist = [('a', [1, 2, 3]), ('b', np.arange(10))]
which is saved into filename:
>>> import joblib
>>> joblib.dump(to_persist, filename)
['...test.joblib']
The object can then be reloaded from the file:
>>> joblib.load(filename)
[('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
Persistence in file objects¶
Instead of filenames, joblib.dump()
and joblib.load()
functions
also accept file objects:
>>> with open(filename, 'wb') as fo:
... joblib.dump(to_persist, fo)
>>> with open(filename, 'rb') as fo:
... joblib.load(fo)
[('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
Compressed joblib pickles¶
Setting the compress argument to True in joblib.dump()
will allow to
save space on disk:
>>> joblib.dump(to_persist, filename + '.compressed', compress=True)
['...test.joblib.compressed']
If the filename extension corresponds to one of the supported compression methods, the compressor will be used automatically:
>>> joblib.dump(to_persist, filename + '.z')
['...test.joblib.z']
By default, joblib.dump()
uses the zlib compression method as it gives
the best tradeoff between speed and disk space. The other supported compression
methods are ‘gzip’, ‘bz2’, ‘lzma’ and ‘xz’:
>>> # Dumping in a gzip compressed file using a compress level of 3.
>>> joblib.dump(to_persist, filename + '.gz', compress=('gzip', 3))
['...test.joblib.gz']
>>> joblib.load(filename + '.gz')
[('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
>>> joblib.dump(to_persist, filename + '.bz2', compress=('bz2', 3))
['...test.joblib.bz2']
>>> joblib.load(filename + '.bz2')
[('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
The compress
parameter of the joblib.dump()
function also accepts a
string corresponding to the name of the compressor used. When using this, the
default compression level is used by the compressor:
>>> joblib.dump(to_persist, filename + '.gz', compress='gzip')
['...test.joblib.gz']
Note
Lzma and Xz compression methods are only available for python versions >= 3.3.
Compressor files provided by the python standard library can also be used to
compress pickle, e.g gzip.GzipFile
, bz2.BZ2File
, lzma.LZMAFile
:
>>> # Dumping in a gzip.GzipFile object using a compression level of 3.
>>> import gzip
>>> with gzip.GzipFile(filename + '.gz', 'wb', compresslevel=3) as fo:
... joblib.dump(to_persist, fo)
>>> with gzip.GzipFile(filename + '.gz', 'rb') as fo:
... joblib.load(fo)
[('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
If the lz4
package is installed, this compression method is automatically
available with the dump function.
>>> joblib.dump(to_persist, filename + '.lz4')
['...test.joblib.lz4']
>>> joblib.load(filename + '.lz4')
[('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
Note
LZ4 compression is only available with python major versions >= 3
More details can be found in the joblib.dump()
and
joblib.load()
documentation.
Registering extra compressors¶
Joblib provides joblib.register_compressor()
in order to extend the list
of default compressors available.
To fit with Joblib internal implementation and features, such as
joblib.load()
and joblib.Memory
, the registered compressor
should implement the Python file object interface.
Compatibility across python versions¶
Compatibility of joblib pickles across python versions is not fully supported. Note that, for a very restricted set of objects, this may appear to work when saving a pickle with python 2 and loading it with python 3 but relying on it is strongly discouraged.
If you are switching between python versions, you will need to save a different joblib pickle for each python version.
Here are a few examples or exceptions:
Saving joblib pickle with python 2, trying to load it with python 3:
Traceback (most recent call last): File "/home/lesteve/dev/joblib/joblib/numpy_pickle.py", line 453, in load obj = unpickler.load() File "/home/lesteve/miniconda3/lib/python3.4/pickle.py", line 1038, in load dispatch[key[0]](self) File "/home/lesteve/miniconda3/lib/python3.4/pickle.py", line 1176, in load_binstring self.append(self._decode_string(data)) File "/home/lesteve/miniconda3/lib/python3.4/pickle.py", line 1158, in _decode_string return value.decode(self.encoding, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1024: ordinal not in range(128) Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/lesteve/dev/joblib/joblib/numpy_pickle.py", line 462, in load raise new_exc ValueError: You may be trying to read with python 3 a joblib pickle generated with python 2. This is not feature supported by joblib.Saving joblib pickle with python 3, trying to load it with python 2:
Traceback (most recent call last): File "<string>", line 1, in <module> File "joblib/numpy_pickle.py", line 453, in load obj = unpickler.load() File "/home/lesteve/miniconda3/envs/py27/lib/python2.7/pickle.py", line 858, in load dispatch[key](self) File "/home/lesteve/miniconda3/envs/py27/lib/python2.7/pickle.py", line 886, in load_proto raise ValueError, "unsupported pickle protocol: %d" % proto ValueError: unsupported pickle protocol: 3