The make function decorator

Usecase

The make function decorator is for saving the output of a function, and running the function again only if its code, or its input argument, have changed. Unlike the memoize decorator, it works by explicitely saving the output to a file with a specified format (thus potentialy useful for later work), and it is designed to work with non-hashable input and output data types such as numpy arrays.

A simple example:

First we create a temporary directory, for the cache:

>>> from tempfile import mkdtemp
>>> cachedir = mkdtemp()

>>> from joblib.make import make, PickleFile

Then we define our function, speicifying its cache directory, and that it persists its output using a pickle file in the chace directory:

>>> @make(cachedir=cachedir, output=PickleFile(cachedir+'/f_output'))
... def f(x):
...     print 'Running f(%s)' % x
...     return x

When we call this function twice with the same argument, it does not get executed the second time, an the output is loaded from the pickle file:

>>> print f(1)
Running f(1)
1
>>> print f(1)
1

However, when we call it a third time, with a different argument, the output gets recomputed:

>>> print f(2)
Running f(2)
2

Comparison with memoize

The memoize decorator caches in memory all the inputs and outputs of a function call. It can thus avoid running twice the same function, but with a very small overhead. However, it compares input objects with those in cache on each call. As a result, for big objects there is a huge overhead. More over this approach does not work with numpy arrays, or other objects subject to non-significant fluctuations. Finally, using memoize with large object will consume all the memory, and even though memoize can persist to disk, the resulting files are not easy to load in different softwares.

In short, memoize is best suited for functions with “small” input and output objects, whereas make is best suited for functions with complex input and output objects.

Using with numpy

The original motivation behind the make function decorator was to be able to a memoize-like pattern on numpy arrays. The difficulty is that numpy arrays cannot be well-hashed: it is computational expensive to do a cache lookup, and, due to small numerical errors, many cache comparisons fail for identical computations.

The time-stamp mechanism of make makes it robust to these problems. As long as numpy arrays (or any complex mutable objects) are created through a function decorated by make, the cache lookup will work:

An example

We define two functions, the first with a number as an argument, outputting an array, used by the second one. We decorate both functions with make, persisting the output in numpy files:

>>> import numpy as np
>>> from joblib.make import NumpyFile

>>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy'))
... def f(x):
...     print 'A long-running calculation, with parameter', x
...     return np.hamming(x)

>>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/g.npy'))
... def g(x):
...     print 'A second long-running calculation, using f(x)'
...     return np.vander(x)

If we call the function g with the array created by the same call to f, g is not re-run:

>>> a = f(3)
A long-running calculation, with parameter 3
>>> a
array([ 0.08,  1.  ,  0.08])
>>> f(3)
array([ 0.08,  1.  ,  0.08])
>>> b = g(a)
A second long-running calculation, using f(x)
>>> b2 = g(a)
>>> b2
array([[ 0.0064,  0.08  ,  1.    ],
       [ 1.    ,  1.    ,  1.    ],
       [ 0.0064,  0.08  ,  1.    ]])
>>> np.allclose(b, b2)
True

This works even if the input parameter to g is not the same object, as long as it comes from the same call to f:

>>> a2 = f(3)
>>> b3 = g(a2)
>>> np.allclose(b, b3)
True

Note that a and a2 are not the same object even though they are numerically equivalent:

>>> a2 is a
False
>>> np.allclose(a2, a)
True

make as a persistence model and lazy-re-evaluation execution engine

Gotchas

  • Only the last result is cached. As a consequence, if you call the same function with alternating values, it will be rerun:

    >>> @make(cachedir=cachedir, output=None)
    ... def f(x):
    ...     print 'Running f(%s)' % x
    
    >>> f(1)
    Running f(1)
    >>> f(2)
    Running f(2)
    >>> f(1)
    Running f(1)
    

    Workaround: You can define different function names, with different persistence if needed:

    >>> def f(x):
    ...     print 'Running f(%s)' % x
    
    >>> def g(x):
    ...     return make(func=f, name=repr(x), cachedir=cachedir,
    ...                 output=None)(x)
    
    >>> g(1)
    Running f(1)
    >>> g(2)
    Running f(2)
    >>> g(1)
    
  • Function cache is identified by the function’s name. Thus if you have the same name to different functions, their cache will override each-others, and you well get unwanted re-run:

    >>> @make(cachedir=cachedir, output=None)
    ... def f(x):
    ...     print 'Running f(%s)' % x
    
    >>> g = f
    
    >>> @make(cachedir=cachedir, output=None)
    ... def f(x):
    ...     print 'Running a different f(%s)' % x
    
    >>> f(1)
    Running a different f(1)
    >>> g(1)
    Running f(1)
    >>> f(1)
    Running a different f(1)
    >>> g(1)
    Running f(1)
    

    Beware that all lambda functions have the same name:

    >>> def my_print(x):
    ...     print x
    
    >>> f = make(func=lambda : my_print(1), cachedir=cachedir)
    >>> g = make(func=lambda : my_print(2), cachedir=cachedir)
    
    >>> f()
    1
    >>> g()
    2
    >>> f()
    1
    

    Thus to use lambda functions reliably, you have to specify the name used for caching:

    >>> f = make(func=lambda : my_print(1), cachedir=cachedir, name='f')
    >>> g = make(func=lambda : my_print(2), cachedir=cachedir, name='g')
    
    >>> f()
    1
    >>> g()
    2
    >>> f()
    
  • make cannot be used on objects more complex than a function, eg an object with a __call__ method.

  • make cannot track changes outside functions it decorates. When tracking changes made to mutable objects (such as numpy arrays), make cannot track changes made out of functions it decorates:

    >>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy'))
    ... def f(x):
    ...     return np.array(x)
    
    >>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/g.npy'))
    ... def g(x):
    ...     print "Running g(%s)" % x
    ...     return x**2
    
    >>> a = f([1])
    >>> a
    array([1])
    >>> b = g(a)
    Running g([1])
    >>> a *= 2
    >>> b = g(a)
    >>> b
    array([1])
    

    This is why for more reliability, you should modify objects only in functions decorated by make: do not break the chain of trust.

  • Persisting can have side-effects:

    >>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy'))
    ... def f(x):
    ...     return x
    
    >>> f(1)
    1
    >>> f(1)
    array(1)
    

    In the above lines, the returned value is saved as a numpy file, and thus restored in the second call as an array.

Let us not forget to clean our cache dir once we are finished:

>>> import shutil
>>> shutil.rmtree(cachedir)

Optional arguments to make

joblib.make.make(func=None, output=None, cachedir='cache', debug=False, name=None, raise_errors=False, force=False)

Decorate a function for lazy re-evaluation.

Parameters:

func : a callable, optional

If func is given, the function is returned decorated. Elsewhere, the call to ‘make’ returns a decorator object that can be applied to a function.

output : persisters, optional

output can be a persistence objects, or a list of persistence objects. This argument describes the mapping to the disk used to save the output of the decorated function.

cachedir : string, optional

Name of the directory used to store the function calls cache information.

debug : boolean, optional

If debug is true, joblib produces a verbose output that can be useful to understand why memoized functions are being re-evaluated.

name : string, optional

Identifier for the function used in the cache. If none is given, the function name is used. Changing the default value of this identifier is usefull when you want to call the function several times with different arguments and store in different caches.

force : boolean, optional

If force is true, make tries to reload the results, even if the input arguments have changed. This is useful to avoid long recalculation when minor details up the pipeline changed.

Returns:

The decorated function, if func is given. A decorator object :

elsewhere. :

Persistence objects

Persistence objects provided with make

class joblib.make.PickleFile(filename)
Persist the data to a file using the pickle protocole.
class joblib.make.NumpyFile(filename, mmap_mode=None)

Persist the data to a file using a ‘.npy’ or ‘.npz’ file.

__init__(filename, mmap_mode=None)
mmap_mode is the mmap_mode argument to numpy.save. When given, memmapping is used to read the results. This can be much faster.
class joblib.make.NiftiFile(filename, header=None, dtype=None)

Persists the data using a nifti file.

Requires PyNifti to be installed.

__init__(filename, header=None, dtype=None)

header is the optional nifti header.

dtype is a numpy dtype and is used to force the loading of the results with a certain dtype. Useful when the types understood by nifti are not complete enough for your purpose.

class joblib.make.MemMappedNiftiFile(filename, header=None, dtype=None)

Persists the data using a memmapped nifti file.

__init__(filename, header=None, dtype=None)

header is the optional nifti header.

dtype is a numpy dtype and is used to force the loading of the results with a certain dtype. Useful when the types understood by nifti are not complete enough for your purpose.

Writing your own

A persistence object inherits from joblib.make.Persister and exposes a save method, accepting the data as an argument, and a load method, returning the data. The filename is usually set in the initializer.

How it works

Objects are tracked by their Python id. The make decorator stores information on the history of each object in the cache for the different functions, and reloads results only if objects given to a function are newer or different than the objects used in the previous run, or it cannot determine the history of these objects.