Removal of Unused Attribute Values

It can often happen that the definition of a discrete attribute (EnumVariable) declares values that do not actually appear in the data, either originally or as a consequence of some preprocessing. Such anomalies are taken care of by class RemoveUnusedValues that, given an attribute and the data, determines whether there are any unused values and reduces the attribute if needed. There are four possible cases.

Attributes

removeOneValued
Decides whether to remove or to retain the attributes with only one value defined (default: False).

Let us show the use of the class on a simple dataset with three examples, given by the following tab-delimited file.

a b c d y 0 1 0 1 2 discrete discrete discrete class 0 0 ? 0 0 1 2 ? 0 0 0 0 ? 0 1

The below script construct a list newattrs which contains either the original attribute, None or a reduced attribute, for each attribute from the original dataset.

part of unusedValues.py (uses unusedValues.tab)

import orange data = orange.ExampleTable("unusedValues") newattrs = [orange.RemoveUnusedValues(attr, data) for attr in data.domain.variables] print for attr in range(len(data.domain)): print data.domain[attr], if newattrs[attr] == data.domain[attr]: print "retained as is" elif newattrs[attr]: print "reduced, new values are", newattrs[attr].values else: print "removed"

And here's the script's output.

EnumVariable 'a' retained as is EnumVariable 'b' reduced, new values are <0, 2> EnumVariable 'c' removed EnumVariable 'd' retained as is EnumVariable 'y' retained as is

Attributes a and y are OK and are left alone. In b, value 1 is not used and is removed (not in the original attribute, of course; a new attribute is created). c is useless and is removed altogether. d is retained since removeOneValued was left at False; if we set it to True, this attribute would be removed as well.

The values of the new attribute for b are automatically computed from the original. The script can thus proceed as follows.

part of unusedValues.py (uses unusedValues.tab)

filteredattrs = filter(bool, newattrs) newdata = orange.ExampleTable(orange.Domain(filteredattrs), data) print "\nOriginal example table" for ex in data: print ex print "\nReduced example table" for ex in newdata: print ex

List newattrs includes some original attributes (a, d and y) a new attribute (b) and a None (for c). The latter is removed by filter called at the beginning of the script. We use filteredattrs to construct a new domain and then convert the original data to newdata. As the output shows, the two tables are the same except for the removed attribute c.

Original example table ['0', '0', '?', '0', '0'] ['1', '2', '?', '0', '0'] ['0', '0', '?', '0', '1'] Reduced example table ['0', '0', '0', '0'] ['1', '2', '0', '0'] ['0', '0', '0', '1']