Contingency matrix contains conditional distributions. They can work for both, discrete and continuous attributes; although the examples on this page will be mostly limited to discrete attributes, the analogous could be done with continuous values.
part of contingency1.py (uses monk1.tab)
As this simple example shows, contingency is similar to a dictionary (or a list, it is a bit ambiguous), where attribute values serve as keys and class distributions are the dictionary values. The attribute e
is here called the outer attribute, and the class is the inner. That's not the only possible configuration of contingency matrix; class can also be outside or there can be no class at all and the matrix shows distributions of one attribute values given the value of another.
There is a hierarchy of classes with contingencies:
Contingency
ContingencyClass
ContingencyClassAttr
ContingencyAttrClass
ContingencyAttrAttr
The base object is Contingency
. Derived from it is ContingencyClass
in which one of the attributes is class attribute; ContingencyClass
is a base for two classes, ContingencyAttrClass
and ContingencyClassAttr
, the former having class as the inner and the latter as the outer attribute. Class ContingencyAttrAttr
is derived directly from Contingency
and represents contingency matrices in which none of the attributes is the class attribute.
The most common used of the above classes is ContingencyAttrClass
which resembles conditional probabilities of classes given the attribute value.
Here's what all contingency matrices share in common.
Attributes
e
. e
is <108.000, 108.000, 108.000, 108.000>innerDistribution
and the sum of all distributions in the matrix.varType
for the outer attribute (discrete, continuous...); varType
equals outerVariable.varType
and outerDistribution.varType
.Methods
keys
, values
and items
.
Although keys returned by the above functions are strings, you can index the contingency with anything that converts into values of the outer attribute - strings, numbers or instances of Value
.
Naturally, the length of Contingency
equals the number of values of the outer attribute. The only weird thing is that iterating through contingency (by using a for
loop, for instance) doesn't return keys, as with dictionaries, but dictionary values.
If cont
behaved like a normal dictionary, the above script would print out strings from '0' to '3'.
innerDistribution
or outerDistribution
.
The base class is, once for a change, not abstract. Its constructor expects two attribute descriptors, the first one for the outer and the second for the inner attribute. It initializes empty distributions and it's up to you to fill them. This is, for instance, how to manually reproduce results of the script at the top of the page.
part of contingency2.py (uses monk1.tab)
The "reproduction" is not perfect. We didn't care about unknown values and haven't computed innerDistribution
and outerDistribution
. The better way to do it is by using the method add
, so that the loop becomes:
It's not only simpler, but also correctly handles unknown values and updates innerDistribution
and outerDistribution
.
ContingencyClass
is an abstract base class for contingency matrices that contain the class attribute, either as the inner or the outer attribute. If offers a function for making filing the contingency clearer.
After reading through the rest of this page you might ask yourself why do we need to separate the classes ContingencyAttrClass
, ContingencyClassAttr
and ContingencyAttrAttr
, given that the underlying matrix is the same. This is to avoid confusion about what is in the inner and the outer variable. Contingency matrices are most often used to compute probabilities of conditional classes or attributes. By separating the classes and giving them specialized methods for computing the probabilities that are most suitable to compute from a particular class, the user (ie, you or the method that gets passed the matrix) is relieved from checking what kind of matrix it got, that is, where is the class and where's the attribute.
Attributes
innerVariable
or outerVariable
innerVariable
or outerVariable
Methods
Contigency.add
is that the attribute value is always the first argument and class value the second, regardless whether the attribute is actually the outer variable or the inner.ContingencyAttrClass
is derived from ContingencyClass
. Here, attribute is the outer variable (hence variable=outerVariable
) and class is the inner (classVar=innerVariable
), so this form of contingency matrix is suitable for computing the conditional probabilities of classes given a value of an attribute.
Calling add_attrclass(v, c)
is here equivalent to calling add(v, c)
. In addition to this, the class supports computation of contingency from examples, as you have already seen in the example at the top of this page.
Methods
Contingency
's constructor.attribute_value
. If the matrix is normalized, this is equivalent to returning self[attribute_value]
. Result is returned as a normalized Distribution
.class_value
given the attribute_value
. If the matrix is normalized, this is equivalent to returning self[attribute_value][class_value]
.
Don't confuse the order of arguments: attribute value is the first, class value is the second, just as in add_attrclass
. Although in this instance counterintuitive (since the returned value represents the conditional probability P(class_value|attribute_value), this order is uniform for all (applicable) methods of classes derived from ContingencyClass
.
You have seen this form of matrix used already at the top of the page. We shall only explore the new stuff we've learned about it.
part of contingency3.py (uses monk1.tab)
The inner and the outer variable and their relations to the class and the attribute are as expected.
Distributions are normalized and probabilities are elements from the normalized distributions. Knowing that the target concept is y := (e=1) or (a=b)
, distributions are as expected: when e
equals 1, class 1 has a 100% probability, while for the rest, probability is one third, which agrees with a probability that two three-valued independent attributes have the same value.
Manual computation using add_attrclass
is similar (to be precise: exactly the same) as computation using add
.
ContingencyClassAttr is similar to ContingencyAttrClass
except that here the class attribute is the outer and the attribute the inner variable. As a consequence, this form of contingency matrix is suitable for computing conditional probabilities of attribute values given class values. Constructor and add_attrclass
nevertheless get the arguments in the same order as for ContingencyAttrClass
, that is, attribute first, class second.
Methods
Contingency
's constructor, except that the argument order is reversed (in Contingency
, the outer attribute is given first, while here the first argument, attribute
, is the inner attribute).class_value
. If the matrix is normalized, this is equivalent to returning self[class_value]
. Result is returned as a normalized Distribution
.attribute_value
given the class_value
. If the matrix is normalized, this is equivalent to returning self[class_value][attribute_value]
.As you can see, the class is rather similar to ContingencyAttrClass
, except that it has p_attr
instead of p_class
. If you, for instance, take the above script and replace the class name, the first bunch of prints print out
part of the output from contingency4.py (uses monk1.tab)
This is exactly the reverse of the printout from ContingencyAttrClass
. To print out the distributions, the only difference now is that you need to iterate through values of the class attribute and call p_attr
. For instance,
part of contingency4.py (uses monk1.tab)
will print
If the class value is '0', than attribute e
cannot be '1' (the first value), but can be anything else, with equal probabilities of 0.333. If the class value is '1', e
is '1' in exactly half of examples (work-out why this is so); in the remaining cases, e
is again distributed uniformly.
ContingencyAttrAttr
stores contingency matrices in which none of the attributes is the class attribute. This is rather similar to Contingency
, except that it has an additional constructor and method for getting the conditional probabilities.
Methods
Contingency
.examples
.inner_value
given the outer_value
.In the following example, we shall use the ContingencyAttrAttr
on dataset "bridges" to determine which material is used for bridges of different lengths.
part of contingency5.py (uses bridges.tab)
The output tells us that short bridges are mostly wooden or iron, and the longer (and the most of middle sized) are made from steel.
As all other contingency matrices, this one can also be computed "manually".
part of contingency5.py (uses bridges.tab)
What happens if one or both attributes are continuous? As first, contingencies can be built for such attributes as well. Just imagine a contingency as a dictionary with attribute values as keys and objects of type Distribution
as values.
If the outer attribute is continuous, you can use either its values or ordinary floating point number for indexing. The index must be one of the values that do exist in the contingency matrix.
The following script will query for a distribution in between the first two keys, which triggers an exception.
part of the output from contingency6.py (uses iris.tab)
If you still find such contingencies useful, you need to take care about what you pass for indices. Always use the values from keys()
directly, instead of manually entering the keys' values you see printed. If, for instance, you print out the first key, see it's 4.500
and then request cont[4.500]
this can give an index error due to rounding.
Contingencies with continuous inner attributes are more useful. As first, indexing by discrete values is easier than with continuous. Secondly, class Distribution
covers both, discrete and continuous distributions, so even the methods p_class
and p_attr
will work, except they won't return is not the probability but the density (interpolated, if necessary). See the page about Distribution
for more information.
For instance, if you build a ContingencyClassAttr
on the iris dataset, you can enquire about the probability of the sepal length 5.5.
part of contingency7.py (uses iris.tab)
The script's output is
These number represent the number of examples having with sepal length of 5.5. If the matrix was normalized, numbers would be divided by the total number of examples in classes setosa, versicolor and virginica, respectively.
Computing contingency matrices requires iteration through examples. We often need to compute ContingencyAttrClass
or ContingencyClassAttr
for all attributes in the dataset and it is obvious that this will be faster if we do it for all attributes at once. That's taken care of by class DomainContingency
.
DomainContingency
is basically a list of contingencies, either of type ContingencyAttrClass
or ContingencyClassAttr
, with two additional fields and a constructor that computes the contingencies.
Attributes
ContingencyAttrClass
or ContingencyClassAttr
.Methods
classIsOuter
is 0 (default), these will be ContingencyAttrClass
, if 1, ContingencyClassAttr
are used. It then iterates through examples and computes the contingencies.DomainContingency
and an ordinary Python list (except for the additional methods and fields, of course) is that its elements cannot be indexed only by numbers, but also by attribute names and descriptors, as shown in the example below.The following script will print the contingencies for attributes "a", "b" and "e" for the dataset Monk 1.
part of contingency8.py (uses monk1.tab)
The contingencies in the DomainContingency
dc
are of type ContingencyAttrClass
and tell us conditional distributions of classes, given the value of the attribute. To compute the distribution of attribute values given the class, one needs to get a list of ContingencyClassAttr
.
part of contingency8.py (uses monk1.tab)
Note that classIsOuter
cannot be given as positional argument, but needs to be passed by keyword.