Orange provides two algorithms for induction of association rules. One is the basic Agrawal's algorithm with dynamic induction of supported itemsets and rules that is designed specifically for datasets with a large number of different items. This is, however, not really suitable for attribute-based machine learning problems, which are at the primary focus of Orange. We have thus adapted the original algorithm to be more efficient for the latter type of data, and to induce the rules in which, for contrast to Agrawal's rules, both sides don't only contain attributes (like "bread, butter -> jam") but also their values ("bread = wheat, butter = yes -> jam = plum"). As a further variation, the algorithm can be limited to search only for classification rules in which the sole attribute to appear on the right side of the rule is the class attribute.
It is also possible to extract item sets instead of association rules. These are often more interesting than the rules themselves.
Besides association rule inducer, Orange also provides a rather simplified method for classification by association rules.
The class that induces rules by Agrawal's algorithm, accepts the data examples of two forms. The first is the standard form in which each examples is described by values of a fixed list of attributes, defined in domain. The algorithm, however, disregards the attribute values but only checks whether the value is defined or not. The rule shown above, "bread, butter -> jam" actually means that if "bread" and "butter" are defined, then "jam" is defined as well. It is expected that most of values will be undefined - if this is not so, you need to use the other association rules inducer, described in the next chapter.
Since the usual representation of examples described above is rather unsuitable for sparse examples, AssociationRulesSparseInducer
can also use examples represented a bit differently. Sparse examples have no fixed attributes - the examples' domain is empty, there are neither ordinary nor class attributes. All values assigned to example are given as meta-attributes. All meta-attributes need, however, be registered with the domain descriptor. If you have data of this kind, the most suitable format for it is the .basket format.
In both cases, the examples are first translated into an internal AssociationRulesSparseInducer
's internal format for sparse datasets. The algorithm first dynamically builds all itemsets (sets of attributes) that have at least the prescribed support. Each of these is then used to derive rules with requested confidence.
If examples were given in the sparse form, so are the left and right side of the induced rules. If examples were given in the standard form, so are the examples in association rules.
Attributes
The last attribute deserves some explanation. The algorithm's running time (and its memory consumption) depends on the minimal support; the lower the requested support, the more eligible itemsets will be found. There is no general rule for knowing the itemset in advance (generally, value should be around 0.3, but this depends upon the number of different items, the diversity of examples...) so it's very easy to set the limit too low. In this case, the algorithm can induce hundreds of thousands of itemsets until it runs out of memory. To prevent this, it will stop inducing itemsets and report an error when the prescribed maximum maxItemSets
is exceeded. In this case, you should increase the required support. On the other hand, you can (reasonably) increase the maxItemSets
to as high as you computer is able to handle.
We shall test the rule inducer on a dataset consisting of a brief description of Spanish Inquisition, given by Palin et al:
NOBODY expects the Spanish Inquisition! Our chief weapon is surprise...surprise and fear...fear and surprise.... Our two weapons are fear and surprise...and ruthless efficiency.... Our *three* weapons are fear, surprise, and ruthless efficiency...and an almost fanatical devotion to the Pope.... Our *four*...no... *Amongst* our weapons.... Amongst our weaponry...are such elements as fear, surprise.... I'll come in again.
NOBODY expects the Spanish Inquisition! Amongst our weaponry are such diverse elements as: fear, surprise, ruthless efficiency, an almost fanatical devotion to the Pope, and nice red uniforms - Oh damn!
The text needs to be cleaned of punctuation marks and capital letters at beginnings of the sentences, each sentence needs to be put in a new line and commas need to be inserted between the words. The first three lines thus become:
Inducing the rules is trivial.
assoc-agrawal.py (uses inquisition.basket)
The induced rules are surprisingly fear-full.
If examples are weighted, weight can be passed as an additional argument to call operator.
To get only a list of supported item sets, one should call the method getItemsets
. The result
is a list whose elements are tuples with two elements. The first is a tuple with indices of attributes in the item set. Sparse examples are usually represented with meta attributes, so this indices will be negative. The second element is a list of indices supporting the item set, that is, containing all the items in the set. If storeExamples
is False
, the second element is None
.
assoc-agrawal.py (uses inquisition.basket)
Now itemsets
is a list of itemsets along with the examples supporting them since we set storeExamples
to True
.
The sixth itemset contains attributes with indices -11 and -7, that is, the words "surprise" and "our". The examples supporting it are those with indices 1,2, 3, 6 and 9.
This way of representing the itemsets is not very programmer-friendly, but it is much more memory efficient than and faster to work with than using objects like Variable and Example.
The other algorithm for association rules provided by Orange,
is optimized for non-sparse
examples in the usual Orange form. Each example is described by values
of a fixed set of attributes. Unknown values are ignored, while values
of attributes are not (as opposite to the above-described algorithm
for sparse rules). In addition, the algorithm can be directed to
search only for classification rules, in which the only attribute on
the right-hand side is the class attribute.
Attributes
Meaning of all attributes (except the new one,
classificationRules
) is the same as for
AssociationRulesSparseInducer
. See the description of
maxItemSet
there.
assoc.py (uses lenses.tab)
The found rules are
To limit the algorithm to classification rules, set classificationRules
to 1.
part of assoc.py (uses lenses.tab)
The found rules are, naturally, a subset of the above rules.
part of assoc.py (uses lenses.tab)
AssociationRulesInducer
can also work with weighted examples; the ID of weight attribute should be passed as an additional argument in a call.
Itemsets are induced in a similar fashion as for sparse data, except that the first element of the tuple, the item set, is represented not by indices of attributes, as before, but with tuples (attribute-index, value-index).
part of assoc.py (uses lenses.tab)
This prints out
Both classes for induction of association rules return the induced rules in
which is basically a list of instances of AssociationRule
.
Attributes
Example
. In rules created by AssociationRulesSparseInducer
from examples that contain all values as meta-values, left
and right
are examples in the same form. Otherwise, values in left
that do not appear in the rule are don't care, and value in right
are don't know. Both can, however, be tested by isSpecial
(see documentation on Value).nAppliesBoth/nExamples
nAppliesBoth/nAppliesLeft
nAppliesLeft/nExamples
nAppliesRight/nAppliesLeft
nExamples * nAppliesBoth / (nAppliesLeft * nAppliesRight)
(nAppliesBoth * nExamples - nAppliesLeft * nAppliesRight)
storeExamples
was True
during induction, examples
contains a copy of the example table used to induce the rules. Attributes matchLeft
and matchBoth
are lists of integers, representing the indices of examples which match the left-hand side of the rule and both sides, respectively.Methods
AssociationRules
's constructor cannot compute anything from these two arguments.
Association rule inducers do not store evidence about which example supports which rule (although this is available during induction, the information is discarded afterwards). Let us write a function that find the examples that confirm the rule (ie fit both sides of it) and those that contradict it (fit the left-hand side but not the right).
assoc-rule.py (uses lenses.tab)
The latter printouts get simpler and (way!) faster if we instruct the inducer to store the examples. We can then do, for instance, this.
(uses lenses.tab)The "contradicting" examples are then those whose indices are find in matchLeft
but not in matchBoth
. The memory friendlier and the faster ways to compute this are as follows.