Application Developers: Managing Units

Sandboxes

During the life of a client connection, your application should create and use a Sandbox to manage the set of "live" Units. A Sandbox manages the in-memory lifecycle of Units: creation, identity, mutation, and destruction. Sandboxes route persistence operations on Units to the correct Storage Manager.

You can create Sandbox objects directly. They take a single argument, the top-level Arena object. Arenas also provide a convenience function, new_sandbox, which does this for you. The following lines are equivalent:

box = dejavu.Sandbox(myArena)

box = myArena.new_sandbox()
You might often choose the latter when you have a reference to the Arena object, and would rather avoid importing dejavu yet again just to obtain the Sandbox class.

Memorizing Units

When you create a Unit instance, it exists in isolation. There is no connection between that Unit and storage; your Unit will not be persisted, because Dejavu doesn't yet possess a reference to your Unit. To provide that link, you memorize your Unit (or rather, you tell your Sandbox to memorize it):

class Publisher(Unit):
    City = UnitProperty(unicode)

p = Publisher(ID='Walter J. Black')
box.memorize(p)

Memorization does several things. First, it places your new Unit into your Arena. That Unit instance will now be persisted by the appropriate Storage Manager. It can be recalled from storage when needed, using the built-in Expression syntax. It may have been given an ID (see Sequencing, below). Memorization also makes your Unit concrete; that is, your Unit will now possess a sandbox attribute. Units whose sandbox attribute is not set (is None) have no relationships, and their Unit Property triggers (if any) will not fire.

You may define special methods on your Units to provide start-of-life behaviors. If a Unit possesses an on_memorize method, it will be called after the Unit has been 'reserved' in storage and placed in the Sandbox cache.

Sequencing

Every Unit has one or more identifiers. The default ID property is of type int; however, you can override that to whatever type you like. As long as you provide your own identifier values for Units, nothing will break--you can memorize and recall Units without problems. However, if you memorize a Unit with an ID of None, the Sandbox may attempt to provide an ID for it.

The Unit base class possesses a sequencer attribute to help Sandboxes generate new IDs. The default value is an instance of UnitSequencerInteger, which examines all existing Units, finds the maximum integer ID, adds 1, and uses that value for the new ID. [StorageManagers are free to optimize any sequencer, whether builtin or custom, in order to take advantage of ID-generation tools in the database or other storage provider. All builtin database StorageManagers optimize UnitSequencerInteger in this way.]

The other useful Sequencer is the base class UnitSequencer, which simply raises an error when asked to generate an ID. If you set ClassA.ID to a string or unicode type, you'll probably want to set ClassA.sequencer = dejavu.UnitSequencer(), and form ID values in your own code.

Recalling

Once you have memorized a Unit or two, you will probably want to recall them at some point. Sandboxes possess four member functions to accomplish this.

recall()

First, the appropriately named recall(cls, expr=None, inherit=False, **kwargs) function. This is the full-blown query method. As a first argument, you pass it the class (not the name of the class, but the actual class) of which you expect to retrieve instances. The second argument should be a lambda, or an instance of dejavu.logic.Expression, an object which encapsulates your specific query (see Querying). Alternately, you may supply keyword arguments, which will then be combined into an Expression for you. The following three examples are equivalent and all result in the same output:

>>> units = box.recall(Book, Year=1928)
>>> units = box.recall(Book, lambda x: x.Year == 1928)
>>> units = box.recall(Book, logic.Expression(lambda x: x.Year == 1928))
>>> [x.Title for x in units]
[u'The Giant Horse of Oz', u'Kai Lung Unrolls His Mat',
 u'Tarzan, The Lord of the Jungle']
If you do not supply an Expression or any keyword args, all Units of the given Unit class will be retrieved in a list.

If your Unit class defines an on_recall() method, it will be called when each Unit has been loaded from storage (at the end of the recall process). Once the unit is loaded into a Sandbox, however, on_recall will not be called; it's only called at the Sandbox/SM boundary. If on_recall raises UnrecallableError, the unit will not be yielded back to the caller, nor placed in the Sandbox cache.

Recalling multiple classes at once (JOINs)

In addition to providing a single class to recall, you have the option of providing a tree of classes, a nested set of UnitJoin(class1, class2, leftbiased=None) instances.

The "leftbiased" argument specifies how the results will be joined:

leftbiasedJoin TypeDescriptionOperator
None Inner Join All related pairs of both classes will be returned. & or +
True Left Join All related pairs of both classes will be returned. In addition, if any Unit in class1 has no match in class2, we return a single row with Unit1 and a "null Unit" (a Unit, all of whose properties are None). <<
False Right Join All related pairs of both classes will be returned. In addition, if any Unit in class2 has no match in class1, we return a single row with a "null Unit" (a Unit, all of whose properties are None) and Unit2. >>

Look hard? Fear not. There's a much easier way to join units than writing a big tree of UnitJoins. Use the &, <<, and >> operators directly with Unit classes:

tree = (Book << Publisher) & Author

This example will automatically produce a UnitJoin tree for you, with Book 'left joined' to Publisher, and then 'inner joined' to Author.

When you provide multiple classes, the recall method returns a list of rows. Each row will be a list of units, one per class in the classes arg. The expr arg should be a logic.Expression which can evaluate all of the units in any given row at once.

for pub, book in box.recall(Publisher & Book, lambda p, b: p.ID == 4)
This example will retrieve a series of [Publisher, Book] pairs. Note that all three constructs (the UnitJoins, the lambda arguments, and the resulting rows) have the same classes listed in order from left to right.

In database terminology, this technique performs a series of joins between each pair of classes in your UnitJoin tree. However, repeated units in the results will reference the same object; in the example above, each "pub" unit will be the same object, since we limited that expression to a single Publisher. So we might retrieve multiple (pub, book) pairs, but the first unit in each pair will be the same unit instance.

The relationships (joins) between each class are specified by Unit Associations).

xrecall()

Just like recall, but returns an iterator instead of a list. Use xrecall to load Units in a more lazy fashion.

unit()

The recall method can be verbose. When you want a one-liner and only expect a single Unit, use the unit(cls, expr=None, inherit=False, **kw) method of Sandboxes. Again, you pass the class of Units you wish to retrieve as the first argument. Then, supply a logic.Expression, or keyword arguments of the form "property_name=value". The method will form an equivalent Expression for you from the keyword args. For example:

>>> book = box.unit(Book, ID=1)
>>> if book:
...     print book.Title
u'Ladies in Hades'
If no Unit can be found that matches the criteria, None is returned. If multiple Units match the criteria, only the first one is returned (although the rest are probably loaded into memory).

"Magic recaller" methods

For each class you have registered with your Arena, the Sandbox will have a "magic recaller" method of the same name, to make single-unit lookups easier. Instead of the above example for box.unit(), we might just as well have written:

>>> book = box.Book(1)
Note that for the magic methods, unlike for the unit method, you may pass identifiers as positional arguments. If the class has multiple identifiers, you should probably stick to keyword arguments; otherwise, you must remember the order of the class' identifiers tuple.

Forgetting and Repressing

To forget a Unit is to destroy it forever. You have two options for forgetting Units: you can call sandbox.forget(unit) or the simpler version, unit.forget(). Either of these will clear the Unit from the Sandbox' cache, and the Sandbox will tell the appropriate Storage Manager to destroy the stored Unit data. If a Unit has not yet been memorized, you do not need to forget it.

In some circumstances, you may wish to only clear the Unit from the Sandbox without destroying it. You can do this by calling either sandbox.repress(unit) or the simpler version, unit.repress().

You may define special methods on your Units to provide end-of-life behaviors. If a Unit possesses an on_forget method, it will be called after the Unit has been destroyed. If a Unit possesses an on_repress method, it will be called before the Unit has been repressed. I'm sure there was a good reason for this disparity, but I've forgotten (or perhaps repressed) it.

Be aware that many of the things you put in an on_repress handler might also need to go into on_forget. The one doesn't call the other automatically, because sometimes you don't want the same behavior.

Flushing Sandboxes

When the client connection has closed, you should flush the Sandbox caches. In general, a single call to sandbox.flush_all() will do the trick. Notice that flush_all() calls any on_repress() handler for each Unit in the Sandbox.

Warning: You should NOT call flush_all() indiscriminately. You will rapidly get into concurrency trouble. You can stay out of trouble following an easy rule: call flush_all only at the end of your client's connection.

If you want the "hard" rule, here it is. If you flush any Unit class, then any instances of that class hanging around need to be re-recalled. If you don't re-recall them, then any changes you make to the old instance won't be saved on the next flush, since flushing only iterates through units in the sandbox. For example, if you do this:

box = arena.new_sandbox()
thing = box.unit(Thing)
box.flush_all()
thing.Size += 12
box.flush_all()

...then the change you make to "Size" won't be persisted, since the Thing object is no longer in the sandbox--it's been flushed out. You have to recall it somehow to get it stuck in the sandbox again. You could go through all kinds of gyrations to save the old units directly to storage, but don't bother. Just get new references to them and save yourself a lot of headache.

Views

Sandboxes provide a view(cls, attrs, expr=None, **kwargs) function. This works like recall, but returns values, rather than Units. Put simply, it yields all values for the given attribute(s) of the Unit class provided; each unit will yield a tuple of its values in the same order as the attrs sequence you provide. Providing an expr argument (either a lambda or an Expression object, see below), or keyword arguments, will filter the set of Units before obtaining the value tuples.

>>> v = sandbox.view(zoo.Animal, ['Name', 'Lifespan'])
>>> [row for row in v]    # or list(v), or iterate over v...
[('Leopard', 73.5),
 ('Slug', .75),
 ('Tiger', None),
 ('Lion', None),
 ('Bear', None),
 ('Ostrich', 103.2),
 ('Centipede', None),
 ('Emperor Penguin', None),
 ('Adelie Penguin', None),
 ('Millipede', None)
 ]

In this example (pulled from the "zoo" test suite), we grab the name and Lifespan for each Animal. The attrs argument must always be an iterable.

Sandboxes also provide a distinct(cls, attrs, expr=None, **kwargs) function. This works just like view, but returns distinct tuples rather than all tuples.

The distinct function can also be used as a count function by passing attrs = cls.identifiers. Sandboxes provide a count(cls, expr=None, **kwargs) method which does just this.

Dejavu 1.5 adds two new sandbox methods: range and sum. The range(cls, attr, expr, **kw) method takes a single attribute and returns the closed interval [min(attr), ..., max(attr)]. The sum(cls, attr, expr, **kw) method also takes a single attribute and returns the sum of all non-None values for the given cls.attr.

Transactions

Dejavu supports distributed transactions at all levels (however, it does not yet use distributed two-phase commit! That's planned for 1.6). Most often, your code will call transaction methods on the current sandbox object. When you call sandbox.start(isolation=None), you are telling Dejavu to begin a transaction on all known stores. Note that this will start a transaction on all stores regardless of which classes are registered for each store; Dejavu has no way of knowing beforehand which classes or stores your next statements will affect.

The isolation argument to the start method is very important. It determines the "isolation level" of the transaction; that is, the degree to which the current transaction can see changes made to a concurrent transaction. The ANSI/ISO SQL92 standard defines four isolation levels, based on three phenomena:

  Level
Read Uncommitted Read Committed Repeatable Read Serializable
P
h
e
n
o
m
e
n
a
Dirty Read Possible Not possible Not possible Not possible
Fuzzy Read Possible Possible Not possible Not possible
Phantom Possible Possible Possible Not possible

A "dirty read" occurs when TX 1 writes a value and TX 2 is able to read the change before TX 1 commits.

A "fuzzy read" (or "nonrepeatable read" occurs when TX 1 reads a value, TX 2 changes that value and commits, and TX 1 obtains the new value when it re-reads.

A "phantom" occurs when TX 1 reads a set, TX 2 adds to the set, and TX 1 obtains the new rows when it re-reads.

Dejavu supports a variety of stores, and not every store supports every isolation level. See the comparison chart for details. Some stores, like shelve and RAM, don't support transactions at all. In addition, different stores "prevent" the above phenomena in different ways. In some cases, the phenomena is simply not allowed to exist. In other cases, the phenomena raises an error immediately. In still other cases, the phenomena is prevented by waiting (up to a timeout) until one of the offending transactions completes.

Once you have finished executing statements, you should call flush_all, which will call commit() for you. Alternately, you may call rollback() if you need to ignore your changes.

If you're using a store that supports implicit transactions (also sometimes called "autocommit"), you can skip calling start by setting the Database attribute implicit_trans to True (it's False by default). This can be done in code or config. See the comparison chart for details.

Querying

When you retrieve Units, you often don't want to load the entire set for a given class. In Dejavu, you filter the set according to the UnitProperty attributes for each object. Naturally, there must be a way to express the filter you intend. Dejavu actually provides three ways, all in the dejavu.logic module: Expression, filter, and comparison.

The Expression class

Regardless of which technique you use to express your filter, you're going to end up with a logic.Expression object. You can build an Expression directly, passing a single lambda as an argument:

>>> from dejavu import logic
>>> import datetime
>>> f = lambda x: x.Date >= datetime.date(2004, 3, 1)
>>> e = logic.Expression(f)
>>> e
logic.Expression(lambda x: x.Date >= datetime.date(2004, 3, 1))
Neat, eh? I worked hard on that __repr__. ;)

It may be obvious, but we'll be explicit, here. The lambda which you pass into an Expression must possess a positional argument, which will always be bound to a Unit instance. In the example above, it's named 'x', but you can use any name you like. Using lambdas as a base means that we can simply call e(unit), and receive a boolean value indicating whether our Unit "passes the test". Attribute lookups on our 'x' object will apply to Unit Properties for that Unit object. That is, x.Date becomes unit.Date.

You can also do fancier things with Expressions (although the vast majority of the time, you won't need to in order to use Dejavu):

>>> logic.Expression(lambda x, y, z: "Dave" in x.Name and y.Age > 65)
logic.Expression(lambda x, y, z: ('Dave' in x.Name) and (y.Age > 65))
>>> logic.Expression(lambda *units, **kw: units and
...                  (units[0].Width > units[0].Height or
...                   units[0].Color in kw['Colors']))
logic.Expression(lambda *units, **kw: (units) and
                 ((units[0].Width > units[0].Height) or
                  (units[0].Color in kw['Colors'])))
>>> 

Early binding

What is not obvious from the above code snippet is perhaps the most important aspect of Expressions: any globals or cell references (from closures) in the supplied lambda get bound early. Compare the following disassemblies:

>>> import dis
>>> dis.dis(f)
  1           0 LOAD_FAST                0 (x)
              3 LOAD_ATTR                1 (Date)
              6 LOAD_GLOBAL              2 (datetime)
              9 LOAD_ATTR                3 (date)
             12 LOAD_CONST               1 (2004)
             15 LOAD_CONST               2 (3)
             18 LOAD_CONST               3 (1)
             21 CALL_FUNCTION            3
             24 COMPARE_OP               5 (>=)
             27 RETURN_VALUE        
>>> dis.dis(e.func)
  1           0 LOAD_FAST                0 (x)
              3 LOAD_ATTR                1 (Date)
              6 LOAD_CONST               6 (datetime.date(2004, 3, 1))
              9 COMPARE_OP               5 (>=)
             12 RETURN_VALUE        
As you can see, the function itself references the global 'datetime' module. Once we wrap it in the Expression, however, it becomes a constant! Thanks to Raymond Hettinger for inspiring this solution [1]. Early binding, however, implies two consequences:

First, any globals or cell references must be present in the lambda's scope when it is passed into Expression(). This is the norm and shouldn't require too much thought from you when you write Expressions. In the example above, we simply imported datetime as you would expect.

Second, any globals or cell references must also be present in the logic module's globals when the Expression is unpickled. Pickling occurs when Expressions are sent over sockets, and also if Expressions are themselves persisted to storage (for example, see Unit Engines, below). This means your application should inject globals into the logic module. Note that the logic module already tries to import datetime, fixedpoint and decimal.

External functions within Expressions

Dejavu provides additional functions which can be used in Expressions. For example, you can construct an Expression like:

logic.Expression(lambda x: x.Size < 3 and x.Date > dejavu.today())
In this example, the today() function breaks convention and is actually bound late. That is, if you construct this Expression now and use it six months later, the value of today() will change. Storage Managers "know about" these dejavu functions, and can use them to build more appropriate queries. Here are the functions supplied by the dejavu module:

FunctionLate bound?Description
icontains(a, b) Case-insensitive test b in a. Note the operand order.
icontainedby(a, b) Case-insensitive test a in b. Note the operand order.
istartswith(a, b) True if a starts with b (case-insensitive), False otherwise.
iendswith(a, b) True if a ends with b (case-insensitive), False otherwise.
ieq(a, b) True if a == b (case-insensitive), False otherwise.
year(value) The year attribute of a date. If value is None, return None.
month(value) The month attribute of a date. If value is None, return None.
day(value) The day attribute of a date. If value is None, return None.
now() Y datetime.datetime.now()
today() Y datetime.date.today()
iscurrentweek(value) Y If value is in the current week, return True, else False.

It is possible for you, the application developer, to define your own external functions by injecting them into the globals of the logic module. For example, logic.odd = lambda unit: (unit.num % 2) == 1. However, because the builtin Storage Managers are unaware of your new functions, they will not be able to optimize their use; instead, they will simply retrieve a larger set of objects from storage, evaluate each one against the function you provide, and return those Units which match your function. This isn't necessarily a bad thing; it provides the same functionality as if you wrote the test inline within your own code. By making that test a logic function, you also allow it to be stored in Engine rules (see Unit Engines, below). You may, of course, create your own Storage Manager which understands your external function (and can translate its logic into, say, SQL), and thereby achieve end-to-end functionality.

Using filter to form Expressions

The logic module also provides convenient methods to create common types of Expression objects via the filter and comparison factory functions.

The filter(**kwargs) function produces an Expression by taking the keyword arguments you supply, and rewriting them in lambda form. The only operator allowed is therefore the equals '==' operator. For example:

>>> logic.filter(Type='Cat', Mutation='Atomic')
logic.Expression(lambda x: (x.Type == 'Cat') and (x.Mutation == 'Atomic'))

Using comparison to form Expressions

The comparison(attr, cmp_op, criteria) function allows you to form Expressions with dynamic operators. This can come in handy when you are constructing Expressions on the fly from user input. For example, a search page might prompt users for an attribute name, an operator, and an operand (the criteria).

Borrowing from opcode.cmp_op, the allowed values for our cmp_op argument are as follows:

Numeric Value (cmp_op)Operator
0<
1<=
2==
3!=
4>
5>=
6in
7not in
Most SM's only support the following with None:
8is
9is not

Here's an example of using comparison:

>>> logic.comparison('Name', 3, 'Mr. Kamikaze')
logic.Expression(lambda x: x.Name != 'Mr. Kamikaze')
Although the comparison function only allows a single comparison at a time, the resulting Expressions can be combined with the & and | operators (described earlier) to produce more complex Expressions.

Combining Expressions

Expressions are combinable; by using the & operator, the two expressions are combined with an adjoining logical "and". For example:

>>> a = logic.Expression(lambda x: x.Size > 3)
>>> b = logic.Expression(lambda x: x.Size <= 15)
>>> c = a & b
>>> c
logic.Expression(lambda x: (x.Size > 3) and (x.Size <= 15))
The + operator works just like the & operator. The | operator combines the two Expressions with a logical 'or'.

When you combine two Expressions with dissimilar argument lists, what happens? The Expression class doesn't really care what the argument names are, just their order, so the names might not come out as you might expect; however, the logic is preserved:

>>> f = logic.filter(Name='Bruce')
>>> f
logic.Expression(lambda x: x.Name == 'Bruce')
>>> g = logic.Expression(lambda a, b, **kw: a.Name + b.Surname == kw['Full Name'])
>>> 
>>> f + g
logic.Expression(lambda x, b, **kw: (x.Name == 'Bruce')
                 and (x.Name + b.Surname == kw['Full Name']))
>>> g + f
logic.Expression(lambda a, b, **kw: (a.Name + b.Surname == kw['Full Name'])
                 and (a.Name == 'Bruce'))

Specifying types for Expression kwargs

Up to now, we've constructed Expression objects with a single argument, the function which we're going to wrap. But Expression objects may take a second argument, called "kwtypes". This argument must be a dictionary of {keyword: type} pairs. Dejavu doesn't do anything internally with this information; it's simply a standard place to keep it for use by your own applications. However, the kwtypes attribute will be persisted when pickling and unpickling Expression objects, a very common operation.

Exporting the logic module

The logic module (and codewalk, on which it is built) isn't limited to Dejavu. Feel free to use it in some other framework or script! The only change you may have to make (if you relocate the module outside of the dejavu package) would be to the single line: from dejavu import codewalk, to point to the new location.

In particular, logic.Expression objects can operate on any Python objects, not just dejavu Unit instances. If you wish to provide additional logic functions (as dejavu does), simply inject them into logic's globals.

You may also find the underlying codewalk module useful for other purposes on its own. The Visitor base class can be very convenient for building bytecode hacks.

To make a long story short, Dejavu depends on logic throughout, but the reverse is not true.

Unit Engines

Once you've created and associated your Unit classes, you can begin to write "business logic" code (mostly inside those classes, we hope), and "presentation logic" code (mostly outside those classes). In most cases, you will construct Expressions within your own code manually to retrieve Units. Sometimes, however, you need to persist query parameters from your users; in other cases, you might store a list of Units which match a query (regardless of who formed the necessary Expression). Finally, you might wish to manipulate lists of Units as sets: differences, intersections, and unions. The engines module addresses all of these needs.

Collections: Lists of Units

The UnitCollection class provides a means of storing a list of Units, or rather, a list of Unit identifier values. You use its Type property to indicate the class of the indexed Units. That value should be the name of the Unit Class, not the class object itself (this is different than most other calls in Dejavu). If you need to retrieve the actual Unit class, call UnitCollection().unit_class().

UnitCollection itself subclasses dejavu.Unit; you can therefore persist Unit Collections via Dejavu Storage Managers (most SM's, anyway; it's recommended that SM's handle Unit Collections, but not required. Check your SM to see if it does).

Each Collection has a thread lock (an RLock, actually) which you should acquire() before you add an ID to the set, and release() afterward. If you use the add(ID) method, this locking is done for you.

When you need to retrieve the actual Units which are indexed by the Collection, call the units(quota=None) method, which will look up the Units and return them in a list. Since the Collection only stores identifier values, it is possible that one of the indexed Units may have been destroyed since the list was built. The units method simply passes over these "phantom" Units. You can inspect the full list of IDs in the Collection (whether they reference existing Units or not) with the ids() method.

Collections also provide a convenience function for grouping Units by attribute: xdict(attr). This function will look up each Unit in the Collection, inspect the attribute that you specify, and return a dictionary of the form {attr_val1: [Unit, Unit, ...]}. Each distinct attribute value will have its own key, with a list of matching Units as the value.

Engines

You can form Collections by hand, but a more powerful technique is the UnitEngine, a factory for Collections. Engines are very simple: they possess a set of rules which are executed when you want to take a snapshot of Units. The snapshot which is produced is a UnitCollection object. Whenever you call take_snapshot(), the Engine will maintain an association to the resulting Collection. You can access past snapshots with the snapshots() method.

Engines are themselves Units, and can be persisted via Storage Managers. The only properties they possess are: an ID, a Name, an Owner, a FinalClassName, and Created, the creation date of the Engine.

The Owner property should either be a user name, or one of the reserved names: "Public" and "System". By default, the permit() method allows a user read-access to the Engine if they are the Owner, or the Owner is "Public" or "System". Write-access is permitted if the user is the Owner, or the Owner is "Public". Feel free to override permit() in a subclass to provide different behaviors.

The FinalClassName is set for you as you add Rules to the Engine. You can use the value of this property, for example, to tell your users, "Engine #23569 is an 'Armadillo' engine," when it produces Collections of Armadillo Units. The only time you might want to set this value manually is when you first create the Engine, before you have added any Rules.

Rules

Just like Collections and Engines, UnitEngineRule is also a subclass of Unit, and can be persisted via Storage Managers. All three work together to provide a complete, dynamic, application-level query generator.

Okay, so what are Rules? You might say they're a "little language", with the following primitives, or "operations":

OperationOperand(s)Description
Operations on a single set
CREATE The classname of the new Type Creates a new Set of the specified Type. All Units of that Type are included in the new Set.
FILTER A logic.Expression Removes Units from the current Set which do not match the Expression.
FUNCTION The name of a function in the Arena.engine_functions dict Calls the function, passing the current Set. The function should modify the Set.
TRANSFORM The classname of the new Type Transform the current Set into a Set of associated Units (of another Type). The association must be present in the Arena.associations graph.
RETURN Optional. If omitted, the last Set handled is returned as the snapshot. If supplied, the ID of the Set to return.
Operations on two sets
COPY The Set ID of the new Set Copies the current Set to a new Set. The current Set is unchanged.
DIFFERENCE The ID of the Set to mix in Removes IDs from the current Set which exist in the second Set.
INTERSECTION The ID of the Set to mix in Removes IDs from the current Set which do not exist in the second Set.
UNION The ID of the Set to mix in Adds any IDs to the current Set which exist in the second Set.

Each Rule has an Operation property (a string, one of the above), a SetID, and an Operand. Here's an example ruleset:

SequenceOperationSetIDOperand
1CREATE1"Invoice"
2FILTER1 lambda x: x.Date > dejavu.today()
3CREATE2"Inventory"
4FILTER2 logic.Expression(lambda z: z.ID < 10)
5TRANSFORM2"Invoice"
6DIFFERENCE12
7RETURN1

As you can see, every Rule operates on a Set of Units. The first rule is always to CREATE a set, declaring it to contain a certain Type of Units. In most cases, you will then FILTER that set. If you simply created a set and then returned it, it would contain all Units of the declared Type. When you filter a set, however, you remove Units from the whole which do not match the filter's Expression.

In the example above, we CREATE a second Set so that we can eventually obtain the DIFFERENCE between Set 1 and Set 2. The second Set contains Units of a different Type than the first. Once we filter Set 2, we then TRANSFORM it; for each Inventory Unit, we look up associated Invoice Units. Then, we find the difference between the two Invoice sets and RETURN it.

Rules are executed in order according to their Sequence attribute (lowest first). When you use the Engine.add_rule method, the next Sequence value is retrieved for you. Notice that each Rule belongs to one and only one Engine; they are not shared between Engines. Each Rule has its own EngineID attribute.

Engine Functions

The FUNCTION rule deserves special mention. The Operand of a FUNCTION rule is a string, a key in the Arena.engine_functions dictionary. When the rule is executed, that key is used to look up the function, which is then called, passing (sandbox, set). The function should mutate the set directly. Use FUNCTION rules to mutate sets in ways which are more complex than those provided by FILTER and TRANSFORM. For example, you might provide a function which removes all but the first Unit in the Set (according to some ordering algorithm).

Analysis Tools

Dejavu includes various tools to help you manipulate groups of Units.

Sorting Units

When you recall Units, you receive a list. However, the recall method doesn't do any sorting; you must sort your list in your Python code. Dejavu provides a sort(attrs, descending=False) function to assist you in sorting Units. It returns a function, which you can then use in Python's sort function (which operates in place). Continuing our example:

people.sort(dejavu.sort('Size', 'Name'))
The most important issue (and the reason we don't just use 2.4's attrgetter), is that any Unit property must allow values of None, which tends to raise errors when compared to values of other types. The function which sort creates for you treats None as "less than" any other value.

Cross-tabulation

Cross-tabs (also called aggregate tables or pivot tables) display aggregate information about objects by category. For example, rather than show a list of Safari records, one row per trip, you might wish to show a table where each row represents a Destination, and each column shows the count of Safaris to that Destination for each distinct Year. In this example, we say that the Safaris are "grouped by" their Destination values, and that we "pivot" on the Year values.

Dejavu helps you form such a table via the CrossTab class. You need to specify the group(s) you wish to use, and the pivot attribute. Finally, you must specify the aggregate function. Here's a code example:

>>> data = ["a", "b", "cc", "bddd", "a4", "b6"]
>>> group = lambda x: x.isalpha()
>>> pivot = lambda x: x[0]
>>> ctab = analysis.CrossTab(data, [group], pivot, dejavu.COUNT)
>>> data, columns = ctab.results()
>>> data
{(True,): {"a": 1, "b": 2, "c": 1},
 (False,): {"a": 1, "b": 1}}
>>> columns
["a", "b", "c"]
You may notice that we're not using Units in our example; the CrossTab class is designed to work with any objects. Here's one way to lay out that data:

Is Alphaabc
Y121
N110

The results method returns two values. First, the table itself in the form of a dictionary; each key is a tuple of group values, and the corresponding value is a sub-dictionary. Each sub-dict has keys which are the pivot attribute, and values which equal the aggregates. I know, that was confusing; look at the example. The second value to be returned is a list of the pivot column values; you'll notice they're sorted.

The groups and pivot arguments may be either strings or functions. If strings, they must be the names of attributes of the source objects. The final aggfunc argument defaults to COUNT, but may also be SUM. More aggfuncs may arrive in the future.


[1] Python Cookbook, Binding Constants at compile time