Prev: Other Techniques for Orange Scripting, Next: Feature Subset Selection, Up: Other Techniques for Orange Scripting
Data discretization (or as in machine learning also referred to as
discretization) is a procedure that takes a data set and converts all
continuous attributes to categorical. In other words, it discretizes
the continuous attributes. Orange's core supports three discretization
methods: first using equal-width intervals
(orange.EquiDistDiscretization
), second using
equal-frequency intervals (orange.EquiNDiscretization
)
and class-aware discretization as introduced by Fayyad & Irani
(AAAI92) that uses MDL and entropy to find the best cut-off points
(orange.EntropyDiscretization
). The discretization
methods are invoked through calling a preprocessor directive
orange.Preprocessor_discretize
which takes a data set and
discretization method, and returns a data set with any continuous
attribute being discretized.
In machine learning and data mining discretization may be used for different purposes. It may be interesting to find informative cut-off points in the data (for instance, finding that the cut-off for blood's acidity is 7.3 may mean something to physicians). In machine learning, discretization may enable the use of some learning algorithms (for instance, the naive Bayes in orange does not handle continuous-valued attributes).
Here is an orange script that should illustrate the basic use of Orange's discretization functions:
Two types of discretization are used in the script, Fayyad-Irani's method and equal-frequency interval discretization. The output of the script is given bellow. Note that orange also changes the name of the original attribute being discretized by adding “D_” at its start. Further notice that with Fayyad-Irani's discretization, all four attributes were found to have at least two meaningful cut-off points.
In the example above, all continuous attributes were discretized using the same method. This may be ok [in fact, this is how most often machine learning people do discretization], but it may not be the right way to do especially if you want to tailor discretization to specific attributes. For this, you may want to apply different kind of discretizations. The idea is that you discretize each of attributes separately, and them use newly crafter attributes to form your new domain for the new data set. We have not told you anything on working with example domains, so if you want to learn more on this, jump to Basic Data Manipulation section of this tutorial, and then come back. For those of you that trust us in what we are doing, just read on.
In Orange, when converting examples (transforming one data set to
another), attribute's values can be computed from values of other
attributes, when needed. This is exactly how discretization
works. Let's take again the iris data set. We shall replace
petal width
by quartile-discretized attribute called
pl
. For sepal length
, we'll keep the
original attribute, but add the attribute discretized using quartiles
(sl
) and using Fayyad-Irani's algorithm
(sl_ent
). We shall also keep the original (continuous)
attribute sepal width
. Here is the code:
And here is the output of this script:
Again, EquiNDiscretization
and
EntropyDiscretization
are two of the classes that perform
different kinds of discretization, the first will prepare four
quartiles and the second does a Fayyad-Irani's discretization based on
entropy and MDL. Both are derived from a common ancestor
Discretization
; another discretization we could use is
EquiDistDiscretization
that discretizes onto the given
number of intervals of equal width.
Called by an attribute (name, index or descriptor) and an example set, discretization prepares a descriptor of a discretized attribute. The constructed attribute is able to compute its value from value of the original continuous attribute and this is why conversion by select can work.
Names of discretized attribute's values tell the boundaries of the interval. The output is thus informative, but not easily readable. You can, however, always change names of values, as long as the number of values remains the same. Adding the line
Want to know the cut-off points for the discretized attributes?
This requires a little knowledge about the computation mechanics. How
does a discretized attribute know from each attribute it should
compute its values, and how? An attribute descriptor has a property
getValueFrom
which is a kind of classifier (it can indeed
be a classifier!) that is given an original example and returns the
value for the attribute. When converting examples from one domain to
another, the getValueFrom
is called for all attributes of
the new domain that do not occur in the original. Get value takes the
value of the original attribute and calls a property transformer to
discretize it.
Both, EquiNDiscretization
and EntropyDiscretization
construct transformer objects of type IntervalDiscretizer
. It's cut-off points are stored in a list points:
Here's the output:
Sometimes, you may not like the cut-offs suggested by functions in
Orange. In fact, we can tell that domain experts always like cut-offs
at least rounded, if not changed to completely something else. To do
this, simply assign new values to the cut-off points. Remember when
the new attribute is crafter (like sl
), this specifies
only the domain of the attribute and how it is derived. We did not
created a data set with this attribute yet, so before this, it is well
time to change anything the discretization will actually do to the
data. In the following example, we have rounded the cut-off points for
the attribute pl
. [A note is in place here:
pl
is python's variable that stores the pointer to our
attribute. The name of this attribute is derived from the name of
original attribute (petal length
) by adding a prefix
D_
. You may not like this, and you can change the name by
assign its name to something else, like pl.name="pl"
]
Don't try this with discretization when using
EquiDistDiscretization
. Instead of
IntervalDiscretizer
this uses
EquiDistDiscretizer
with fields firstVal
,
step
and numberOfIntervals
.
What we have done above is something very close to manual
discretization, except that the number of intervals used was the same
as suggested by EquiNDiscretization
. To do everything
manually, we need to construct the same structures as the described
discretization algorithms. We need to define a descriptor, among with
the name
, type
, values
and
getValueFrom
. The getValueFrom
should be
IntervalDiscretizer
and with it we specify the cut-off
points.
Let's now discretize Iris' attribute pl using three intervals with cut-off points 2.0 and 4.0.
Notice that we have also named each of the three intervals, and constructed the data set that shows both original and discretized attribute:
Prev: Other Techniques for Orange Scripting, Next: Feature Subset Selection, Up: Other Techniques for Orange Scripting
In machine learning, you would often discretize the learning set. How does one then apply the same discretization on the test set? For discretized attributes Orange remembers the how they were converted from their original continuous versions, so you need only to convert the testing examples to a new (discretized) domain. Following code shows how:
Following is the output of the above script: