C4.5 is a standard benchmark in machine learning. For this reason, it is incorporated in Orange, although Orange has its own implementation of decision trees.
uses the original Quinlan's code for learning so the tree you get is exactly like the one that would be build by standalone C4.5. Upon return, however, the original tree is copied to Orange components that contain exactly the same information plus what is needed to make them visible from Python. To be sure that the algorithm behaves just as the original, we use a dedicated class
orange.C45TreeNode
instead of reusing the components used by Orange's tree inducer (ie, orange.TreeNode
). This, however, could be done and probably will be done in the future; we shall still retain orange.C45TreeNode
but offer transformation to orange.TreeNode
so that routines that work on Orange trees will also be usable for C45 trees.
C45Learner
and C45Classifier
behave
like any other Orange learner and classifier. Unlike most of Orange learning algorithms, C4.5 does not accepts weighted examples.
We haven't been able to obtain the legal rights to distribute C4.5 and therefore couldn't statically link it into Orange. Instead, it's incorporated as a plug-in which you'll need to build yourself. The procedure is trivial, but you'll need a C compiler. On Windows, the scripts we provide work with MS Visual C and the files CL.EXE and LINK.EXE must be on the PATH. On Linux you're equipped with gcc. Mac OS X comes without gcc, but you can download it for free from Apple.
Orange must be installed prior to building C4.5. (This is because the build script will copy the created file next to Orange, which it obviously can't if Orange isn't there yet.)
orange.C45Learner()
. If this fails, something went wrong; see the diagnostic messages from buildC45.py and read the below paragraph.If the process fails, here's what buildC45.py really does: it creates .h files that wrap Quinlan's .i files and ensure that they are not included twice. It modifies C4.5 sources to include .h's instead of .i's. This step can hardly fail. Then follows the platform dependent step which compiles ensemble.c (which includes all the Quinlan's .c files it needs) into c45.dll or c45.so and puts it next to Orange. If this fails, but you do have a C compiler and linker, and you know how to use them, you can compile the ensemble.c into a dynamic library yourself. See the compile and link steps in buildC45.py, if it helps. Anyway, after doing this check that the built C4.5 gives the same results as the original.
C45Learner
's attributes have double names - those that you know from C4.5 command lines and the corresponding names of C4.5's internal variables. All defaults are set as in C4.5; if you change nothing, you are running C4.5.
Attributes
false
, default) or gain ratio
for selection of attributes (true
)false
, no subsetting)false
)true
)C45Learner
also offers another way for setting
the arguments: it provides a function commandline
which is given a string and parses it the same way as C4.5 would
parse its command line.
C45Classifier
contains a faithful reimplementation of Quinlan's function from C4.5. The only difference (and the only reason it's been rewritten) is that it uses a tree composed of orange.C45TreeNode
s instead of C4.5's original tree structure.
Attributes
C45TreeNode
s.
This class is a reimplementation of the corresponding struct
from Quinlan's C4.5 code.
Attributes
C45TreeNode.Leaf
(0), C45TreeNode.Branch
(1), C45TreeNode.Cut
(2), C45TreeNode.Subset
(3). "Leaves" are leaves, "branches" split examples based on values of a discrete attribute, "cuts" cut them according to a threshold value of a continuous attributes and "subsets" use discrete attributes but with subsetting so that several values can go into the same branch.Value
returned by that leaf. The field is defined for internal nodes as well.DiscDistribution
).tested
is None
, if node is of type Branch
or Cut
tested
is a discrete attribute, and if node is of type cut
then tested
is a continuous attribute.Cut
. Undefined otherwise.Subset
. Element mapping[i]
gives the index for an example whose value of tested
is i
. Here, i
denotes an index of value, not a Value
.The simplest way to use C45Learner
is to call it. This
script constructs the same learner as you would get by calling the usual C4.5.
part of c45.py (uses lenses.tab)
Arguments can be set by the usual mechanism (the below to lines do the same, except that one uses command-line symbols and the other internal variable names)
The way that could be prefered by veteran C4.5 user might be through
method commandline
.
There's nothing special about using C45Classifier
- it's just like any other. To demonstrate what the structure of C45TreeNode
's looks like, will show a script that prints it out in the same format as C4.5 does. (You can find the script in module orngC45).
def printTree0(node, classvar, lev): var = node.tested if node.nodeType == 0: print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items), elif node.nodeType == 1: for i, val in enumerate(var.values): print ("\n"+"| "*lev + "%s = %s:") % (var.name, val), printTree0(node.branch[i], classvar, lev+1) elif node.nodeType == 2: print ("\n"+"| "*lev + "%s <= %.1f:") % (var.name, node.cut), printTree0(node.branch[0], classvar, lev+1) print ("\n"+"| "*lev + "%s > %.1f:") % (var.name, node.cut), printTree0(node.branch[1], classvar, lev+1) elif node.nodeType == 3: for i, branch in enumerate(node.branch): inset = filter(lambda a:a[1]==i, enumerate(node.mapping)) inset = [var.values[j[0]] for j in inset] if len(inset)==1: print ("\n"+"| "*lev + "%s = %s:") % (var.name, inset[0]), else: print ("\n"+"| "*lev + "%s in {%s}:") % (var.name, reduce(lambda x,y:x+", "+y, inset)), printTree0(branch, classvar, lev+1) def printTree(tree): printTree0(tree.tree, tree.classVar, 0) print
Leaves are the simplest. We just print out the value contained in node.leaf
. Since this is not a qualified value (ie., C45TreeNode
does not know to which attribute it belongs), we need to convert it to a string through classVar
, which is passed as an extra argument to the recursive part of printTree
.
For discrete splits without subsetting, we print out all attribute values and recursively call the function for all branches. Continuous splits are equally easy to handle.
For discrete splits with subsetting, we iterate through branches, retrieve the corresponding values that go into each branch to inset
, turn the values into strings and print them out, separately treating the case when only a single value goes into the branch.