Example-based automatic discretization is in essence similar to learning: given a set of examples, discretization method proposes a list of suitable intervals to cut the attribute's values into. For this reason, Orange structures for discretization resemble its structures for learning. Objects derived from orange.Discretization
play a role of "learner" that, upon observing the examples, construct an orange.Discretizer
whose role is to convert continuous values into discrete according to the rule found by Discretization
.
Orange core now supports several methods of discretization; here's a list of methods with belonging classes.
EquiDistDiscretization
,
EquiDistDiscretizer
)EquiNDiscretization
,
IntervalDiscretizer
)EntropyDiscretization
,
IntervalDiscretizer
)BiModalDiscretization
,
BiModalDiscretizer
/IntervalDiscretizer
)FixedDiscretization
,
IntervalDiscretizer
)Instances of classes derived from orange.
define a single method: the call operator. The object can also be called through constructor.
attribute
, examples and, optionally id of attribute with example weight, this function returns a discretized attribute. Argument attribute
can be a descriptor, index or name of the attribute.Here's an example.
part of discretization.py (uses iris.tab)
The discretized attribute sep_w
is constructed with a call to EntropyDiscretization
(instead of constructing it and calling it afterwards, we passed the arguments for calling to the constructor, as is often allowed in Orange). We then constructed a new ExampleTable
with attributes "sepal width" (the original continuous attribute), sep_w
and the class attribute. Script output is:
EntropyDiscretization
named the new attribute's values by the interval range (it also named the attribute as "D_sepal width"). The new attribute's values get computed automatically when they are needed.
As those that have read about Variable
know, the answer to "How this works?" is hidden in the attribute's field getValueFrom
. This little dialog reveals the secret.
So, the select
statement in the above example converted all examples from data
to the new domain. Since the new domain includes the attribute sep_w
that is not present in the original, sep_w
's values are computed on the fly. For each example in data
, sep_w.getValueFrom
is called to compute sep_w
's value (if you ever need to call getValueFrom
, you shouldn't call getValueFrom
directly but call computeValue
instead). sep_w.getValueFrom
looks for value of "sepal width" in the original example. The original, continuous sepal width is passed to the transformer
that determines the interval by its field points
. Transformer returns the discrete value which is in turn returned by getValueFrom
and stored in the new example.
You don't need to understand this mechanism exactly. It's important to know that there are two classes of objects for discretization. Those derived from Discretizer
(such as IntervalDiscretizer
that we've seen above) are used as transformers that translate continuous value into discrete. Discretization algorithms are derived from Discretization
. Their job is to construct a Discretizer
and return a new variable with the discretizer stored in getValueFrom.transformer
.
Different discretizers support different methods for conversion of continuous values into discrete. The most general is IntervalDiscretizer
that is also used by most discretization methods. Two other discretizers, EquiDistDiscretizer
and ThresholdDiscretizer
could easily be replaced by IntervalDiscretizer
but are used for speed and simplicity. The fourth discretizer, BiModalDiscretizer
is specialized for discretizations induced by BiModalDiscretization
.
All discretizers support a handy method for construction of a new attribute from an existing one.
Methods
attribute
. The new attribute's name equal attribute.name
prefixed by "D_", and its symbolic values are discretizer specific. The above example shows what comes out form IntervalDiscretizer
. Discretization algorithms actually first construct a discretizer and then call its constructVariable
to construct an attribute descriptor.An example of how this method is used is shown in the following section about IntervalDiscretizer
.
IntervalDiscretizer
is the most common discretizer. It made its first appearance in the example about general discretization schema and you will see more of it later. It has a single interesting attribute.
Attributes
points
. The number of intervals is thus len(points)+1
.Let us manually construct an interval discretizer with cut-off points at 3.0 and 5.0. We shall use the discretizer to construct a discretized sepal length.
part of discretization.py (uses iris.tab)
That's all. First five examples of data2
are now
Can you use the same discretizer for more than one attribute? Yes, as long as they have same cut-off points, of course. Simply call constructVar
for each continuous attribute.
part of discretization.py (uses iris.tab)
Each attribute now has its own ClassifierFromVar
in its getValueFrom
, but all use the same IntervalDiscretizer
, idisc
. Changing an element of its points
affect all attributes.
Do not change the length of points
if the
discretizer is used by any attribute. The length of
points
should always match the number of values of the
attribute, which is determined by the length of the attribute's field
values
. Therefore, if attr
is a discretized
attribute, than len(attr.values)
must equal
len(attr.getValueFrom.transformer.points)+1
. It always
does, unless you deliberately change it. If the sizes don't match,
Orange will probably crash, and it will be entirely your fault.
EquiDistDiscretizer
is a bit faster but more rigid than IntervalDiscretizer
: it uses intervals of fixed width.
Attributes
EquiDistDiscretizer
the same interface as that of IntervalDiscretizer
.All values below firstCut
belong to the first interval (including possible values smaller than firstVal
. Otherwise, value val
's interval is floor((val-firstVal)/step)
. If this is turns out to be greater or equal to numberOfIntervals
, it is decreased to numberOfIntervals-1
.
This discretizer is returned by EquiDistDiscretization
; you can see an example in the corresponding section. You can also construct an EquiDistDiscretization
manually and call its constructVariable
, just as already shown for the IntervalDiscretizer
.
Threshold discretizer converts continuous values into binary by comparing them with a threshold. This discretizer is actually not used by any discretization method, but you can use it for manual discretization. Orange needs this discretizer for binarization of continuous attributes in decision trees.
Attributes
This discretizer is the first discretizer that couldn't be replaced by IntervalDiscretizer
. It has two cut off points and values are discretized according to whether they belong to the middle region (which includes the lower but not the upper boundary) or not. The discretizer is returned by ByModalDiscretization
if its field splitInTwo
is true
(which by default is); see an example there.
Attributes
EquiDistDiscretization
discretizes the attribute by cutting it into the prescribed number of intervals of equal width. The examples are needed to determine the span of attribute values. The interval between the smallest and the largest is then cut into equal parts.
Attributes
For an example, we shall discretize all attributes of Iris dataset into 6 intervals. We shall construct an ExampleTable
with discretized attributes and print description of the attributes.
Script's answer is
Any more decent ways for a script to find the interval boundaries than by
parsing the symbolic values? Sure, they are hidden in the discretizer, which is, as usual, stored in attr.getValueFrom.transformer
.
Compare the following with the values above.
As all discretizers, EquiDistDiscretizer
also has the method constructVariable
. The following example discretizes all attributes into six equal intervals of width 1, the first interval
EquiNDiscretization
discretizes the attribute by cutting it into the prescribed number of intervals so that each of them contains equal number of examples. The examples are obviously needed for this discretization, too.
Attributes
The use of this discretization is equivalent to the above one, except that we use EquiNDiscretization
instead of EquiDistDiscretization
. The resulting discretizer is IntervalDiscretizer
, hence it has points
instead of firstCut
/step
/numberOfIntervals
.
Fayyad-Irani's discretization method works without a predefined number of intervals. Instead, it recursively splits intervals at the cut-off point that minimizes the entropy, until the entropy decrease is smaller than the increase of MDL induced by the new point.
An interesting thing about this discretization technique is that an attribute can be discretized into a single interval, if no suitable cut-off points are found. If this is the case, the attribute is rendered useless and can be removed. This discretization can therefore also serve for feature subset selection.
Attributes
false
).part of discretization.py (uses iris.tab)
The output shows that all attributes are discretized onto three intervals:
BiModalDiscretization
sets two cut-off points so that the class distribution of examples in between is as different from the overall distribution as possible. The difference is measure by chi-square statistics. All possible cut-off points are tried, thus the discretization runs in O(n2).
This discretization method is especially suitable for the attributes in which the middle region corresponds to normal and the outer regions to abnormal values of the attribute. Depending on the nature of the attribute, we can treat the lower and higher values separately, thus discretizing the attribute into three intervals, or together, in a binary attribute whose values correspond to normal and abnormal.
Attributes
true
(default), we have three intervals and the discretizer is of type BiModalDiscretizer
. If false
the result is the ordinary IntervalDiscretizer
.Iris dataset has three-valued class attribute, classes are setosa, virginica and versicolor. As the picture below shows, sepal lenghts of versicolors are between lengths of setosas and virginicas (the picture itself is drawn using LOESS probability estimation, see documentation on naive Bayesian learner.
If we merge classes setosa and virginica into one, we can observe whether the bi-modal discretization would correctly recognize the interval in which versicolors dominate.
In this script, we have constructed a new class attribute which tells whether an iris is versicolor or not. We have told how this attribute's value is computed from the original class value with a simple lambda function. Finally, we have constructed a new domain and converted the examples. Now for discretization.
Script prints out the middle intervals:
Judging by the graph, the cut-off points for "sepal length" make sense.