The make function decorator

Usecase

The make function decorator is for saving the output of a function, and running the function again only if its code, or its input argument, have changed. Unlike the memoize decorator, it works by explicitely saving the output to a file with a specified format (thus potentialy useful for later work), and it is designed to work with non-hashable input and output data types such as numpy arrays.

A simple example:

First we create a temporary directory, for the cache:

>>> from tempfile import mkdtemp
>>> cachedir = mkdtemp()

>>> from joblib.make import make, PickleFile

Then we define our function, speicifying its cache directory, and that it persists its output using a pickle file in the chace directory:

>>> @make(cachedir=cachedir, output=PickleFile(cachedir+'/f_output'))
... def f(x):
...     print 'Running f(%s)' % x
...     return x

When we call this function twice with the same argument, it does not get executed the second time, an the output is loaded from the pickle file:

>>> print f(1)
Running f(1)
1
>>> print f(1)
1

However, when we call it a third time, with a different argument, the output gets recomputed:

>>> print f(2)
Running f(2)
2

Comparison with memoize

The memoize decorator caches in memory all the inputs and outputs of a function call. It can thus avoid running twice the same function, but with a very small overhead. However, it compares input objects with those in cache on each call. As a result, for big objects there is a huge overhead. More over this approach does not work with numpy arrays, or other objects subject to non-significant fluctuations. Finally, using memoize with large object will consume all the memory, and even though memoize can persist to disk, the resulting files are not easy to load in different softwares.

In short, memoize is best suited for functions with “small” input and output objects, whereas make is best suited for functions with complex input and output objects.

Using with numpy

The original motivation behind the make function decorator was to be able to a memoize-like pattern on numpy arrays. The difficulty is that numpy arrays cannot be well-hashed: it is computational expensive to do a cache lookup, and, due to small numerical errors, many cache comparisons fail for identical computations.

The time-stamp mechanism of make makes it robust to these problems. As long as numpy arrays (or any complex mutable objects) are created through a function decorated by make, the cache lookup will work:

An example

We define two functions, the first with a number as an argument, outputting an array, used by the second one. We decorate both functions with make, persisting the output in numpy files:

>>> import numpy as np
>>> from joblib.make import NumpyFile

>>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy'))
... def f(x):
...     print 'A long-running calculation, with parameter', x
...     return np.hamming(x)

>>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/g.npy'))
... def g(x):
...     print 'A second long-running calculation, using f(x)'
...     return np.vander(x)

If we call the function g with the array created by the same call to f, g is not re-run:

>>> a = f(3)
A long-running calculation, with parameter 3
>>> a
array([ 0.08,  1.  ,  0.08])
>>> f(3)
array([ 0.08,  1.  ,  0.08])
>>> b = g(a)
A second long-running calculation, using f(x)
>>> b2 = g(a)
>>> b2
array([[ 0.0064,  0.08  ,  1.    ],
       [ 1.    ,  1.    ,  1.    ],
       [ 0.0064,  0.08  ,  1.    ]])
>>> np.allclose(b, b2)
True

This works even if the input parameter to g is not the same object, as long as it comes from the same call to f:

>>> a2 = f(3)
>>> b3 = g(a2)
>>> np.allclose(b, b3)
True

Note that a and a2 are not the same object even though they are numerically equivalent:

>>> a2 is a
False
>>> np.allclose(a2, a)
True

make as a persistence model and lazy-re-evaluation execution engine

Gotchas

  • Only the last result is cached. As a consequence, if you call the same function with alternating values, it will be rerun:

    >>> @make(cachedir=cachedir, output=None)
    ... def f(x):
    ...     print 'Running f(%s)' % x
    
    >>> f(1)
    Running f(1)
    >>> f(2)
    Running f(2)
    >>> f(1)
    Running f(1)
    

    Workaround: You can define different function names, with different persistence if needed:

    >>> def f(x):
    ...     print 'Running f(%s)' % x
    
    >>> def g(x):
    ...     return make(func=f, name=repr(x), cachedir=cachedir,
    ...                 output=None)(x)
    
    >>> g(1)
    Running f(1)
    >>> g(2)
    Running f(2)
    >>> g(1)
    
  • Function cache is identified by the function’s name. Thus if you have the same name to different functions, their cache will override each-others, and you well get unwanted re-run:

    >>> @make(cachedir=cachedir, output=None)
    ... def f(x):
    ...     print 'Running f(%s)' % x
    
    >>> g = f
    
    >>> @make(cachedir=cachedir, output=None)
    ... def f(x):
    ...     print 'Running a different f(%s)' % x
    
    >>> f(1)
    Running a different f(1)
    >>> g(1)
    Running f(1)
    >>> f(1)
    Running a different f(1)
    >>> g(1)
    Running f(1)
    

    Beware that all lambda functions have the same name:

    >>> def my_print(x):
    ...     print x
    
    >>> f = make(func=lambda : my_print(1), cachedir=cachedir)
    >>> g = make(func=lambda : my_print(2), cachedir=cachedir)
    
    >>> f()
    1
    >>> g()
    2
    >>> f()
    1
    

    Thus to use lambda functions reliably, you have to specify the name used for caching:

    >>> f = make(func=lambda : my_print(1), cachedir=cachedir, name='f')
    >>> g = make(func=lambda : my_print(2), cachedir=cachedir, name='g')
    
    >>> f()
    1
    >>> g()
    2
    >>> f()
    
  • make cannot be used on objects more complex than a function, eg an object with a __call__ method.

  • make cannot track changes outside functions it decorates. When tracking changes made to mutable objects (such as numpy arrays), make cannot track changes made out of functions it decorates:

    >>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy'))
    ... def f(x):
    ...     return np.array(x)
    
    >>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/g.npy'))
    ... def g(x):
    ...     print "Running g(%s)" % x
    ...     return x**2
    
    >>> a = f([1])
    >>> a
    array([1])
    >>> b = g(a)
    Running g([1])
    >>> a *= 2
    >>> b = g(a)
    >>> b
    array([1])
    

    This is why for more reliability, you should modify objects only in functions decorated by make: do not break the chain of trust.

  • Persisting can have side-effects:

    >>> @make(cachedir=cachedir, output=NumpyFile(cachedir+'/f.npy'))
    ... def f(x):
    ...     return x
    
    >>> f(1)
    1
    >>> f(1)
    array(1)
    

    In the above lines, the returned value is saved as a numpy file, and thus restored in the second call as an array.

Optional arguments to make

Persistence objects

Persistence objects provided with make

Writing your own

How it works

Let us not forget to clean our cache dir once we are finished:

>>> import shutil
>>> shutil.rmtree(cachedir)