The make function decorator is for saving the output of a function, and running the function again only if its code, or its input argument, have changed. Unlike the memoize decorator, it works by explicitely saving the output to a file with a specified format (thus potentialy useful for later work), and it is designed to work with non-hashable input and output data types such as numpy arrays.
First we create a temporary directory, for the cache:
>>> from tempfile import mkdtemp >>> cachedir = mkdtemp() >>> from joblib.make import make, PickleFileThen we define our function, speicifying its cache directory, and that it persists its output using a pickle file in the chace directory:
>>> @make(cachedir=cachedir, output=PickleFile(cachedir+'/f_output')) ... def f(x): ... print 'Running f(%s)' % x ... return xWhen we call this function twice with the same argument, it does not get executed the second time, an the output is loaded from the pickle file:
>>> print f(1) Running f(1) 1 >>> print f(1) 1However, when we call it a third time, with a different argument, the output gets recomputed:
>>> print f(2) Running f(2) 2
The memoize decorator caches in memory all the inputs and outputs of a function call. It can thus avoid running twice the same function, but with a very small overhead. However, it compares input objects with those in cache on each call. As a result, for big objects there is a huge overhead. More over this approach does not work with numpy arrays, or other objects subject to non-significant fluctuations. Finally, using memoize with large object will consume all the memory, and even though memoize can persist to disk, the resulting files are not easy to load in different softwares.
In short, memoize is best suited for functions with “small” input and output objects, whereas make is best suited for functions with complex input and output objects.
The original motivation behind the make function decorator was to be able to a memoize-like pattern on numpy arrays. The difficulty is that numpy arrays cannot be well-hashed: it is computational expensive to do a cache lookup, and, due to small numerical errors, many cache comparisons fail for identical computations.
The time-stamp mechanism of make makes it robust to these problems. As long as numpy arrays (or any complex mutable objects) are created through a function decorated by make, the cache lookup will work:
We define two functions, the first with a number as an argument, outputting an array, used by the second one. We decorate both functions with make, persisting the output in numpy files:
>>> import numpy as np >>> from joblib.make import NumpyFile >>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy')) ... def f(x): ... print 'A long-running calculation, with parameter', x ... return np.hamming(x) >>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/g.npy')) ... def g(x): ... print 'A second long-running calculation, using f(x)' ... return np.vander(x)If we call the function g with the array created by the same call to f, g is not re-run:
>>> a = f(3) A long-running calculation, with parameter 3 >>> a array([ 0.08, 1. , 0.08]) >>> f(3) array([ 0.08, 1. , 0.08]) >>> b = g(a) A second long-running calculation, using f(x) >>> b2 = g(a) >>> b2 array([[ 0.0064, 0.08 , 1. ], [ 1. , 1. , 1. ], [ 0.0064, 0.08 , 1. ]]) >>> np.allclose(b, b2) TrueThis works even if the input parameter to g is not the same object, as long as it comes from the same call to f:
>>> a2 = f(3) >>> b3 = g(a2) >>> np.allclose(b, b3) TrueNote that a and a2 are not the same object even though they are numerically equivalent:
>>> a2 is a False >>> np.allclose(a2, a) True
Only the last result is cached. As a consequence, if you call the same function with alternating values, it will be rerun:
>>> @make(cachedir=cachedir, output=None)
... def f(x):
... print 'Running f(%s)' % x
>>> f(1)
Running f(1)
>>> f(2)
Running f(2)
>>> f(1)
Running f(1)
Workaround: You can define different function names, with different persistence if needed:
>>> def f(x):
... print 'Running f(%s)' % x
>>> def g(x):
... return make(func=f, name=repr(x), cachedir=cachedir,
... output=None)(x)
>>> g(1)
Running f(1)
>>> g(2)
Running f(2)
>>> g(1)
Function cache is identified by the function’s name. Thus if you have the same name to different functions, their cache will override each-others, and you well get unwanted re-run:
>>> @make(cachedir=cachedir, output=None)
... def f(x):
... print 'Running f(%s)' % x
>>> g = f
>>> @make(cachedir=cachedir, output=None)
... def f(x):
... print 'Running a different f(%s)' % x
>>> f(1)
Running a different f(1)
>>> g(1)
Running f(1)
>>> f(1)
Running a different f(1)
>>> g(1)
Running f(1)
Beware that all lambda functions have the same name:
>>> def my_print(x):
... print x
>>> f = make(func=lambda : my_print(1), cachedir=cachedir)
>>> g = make(func=lambda : my_print(2), cachedir=cachedir)
>>> f()
1
>>> g()
2
>>> f()
1
Thus to use lambda functions reliably, you have to specify the name used for caching:
>>> f = make(func=lambda : my_print(1), cachedir=cachedir, name='f')
>>> g = make(func=lambda : my_print(2), cachedir=cachedir, name='g')
>>> f()
1
>>> g()
2
>>> f()
make cannot be used on objects more complex than a function, eg an object with a __call__ method.
make cannot track changes outside functions it decorates. When tracking changes made to mutable objects (such as numpy arrays), make cannot track changes made out of functions it decorates:
>>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy'))
... def f(x):
... return np.array(x)
>>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/g.npy'))
... def g(x):
... print "Running g(%s)" % x
... return x**2
>>> a = f([1])
>>> a
array([1])
>>> b = g(a)
Running g([1])
>>> a *= 2
>>> b = g(a)
>>> b
array([1])
This is why for more reliability, you should modify objects only in functions decorated by make: do not break the chain of trust.
Persisting can have side-effects:
>>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy'))
... def f(x):
... return x
>>> f(1)
1
>>> f(1)
array(1)
In the above lines, the returned value is saved as a numpy file, and thus restored in the second call as an array.
Let us not forget to clean our cache dir once we are finished:
>>> import shutil
>>> shutil.rmtree(cachedir)
Decorate a function for lazy re-evaluation.
Parameters: | func : a callable, optional
output : persisters, optional
cachedir : string, optional
debug : boolean, optional
name : string, optional
force : boolean, optional
|
---|---|
Returns: | The decorated function, if func is given. A decorator object : elsewhere. : |
Persist the data to a file using a ‘.npy’ or ‘.npz’ file.
Persists the data using a nifti file.
Requires PyNifti to be installed.
header is the optional nifti header.
dtype is a numpy dtype and is used to force the loading of the results with a certain dtype. Useful when the types understood by nifti are not complete enough for your purpose.
Persists the data using a memmapped nifti file.
header is the optional nifti header.
dtype is a numpy dtype and is used to force the loading of the results with a certain dtype. Useful when the types understood by nifti are not complete enough for your purpose.
A persistence object inherits from joblib.make.Persister and exposes a save method, accepting the data as an argument, and a load method, returning the data. The filename is usually set in the initializer.
Objects are tracked by their Python id. The make decorator stores information on the history of each object in the cache for the different functions, and reloads results only if objects given to a function are newer or different than the objects used in the previous run, or it cannot determine the history of these objects.