Prev: Naive Bayes in Python, Next: Regression, Up: Build Your Own Learner,
Here we show how to use the schema that allows us to build our own learners/classifiers for bagging. While you can find bagging, boosting, and other ensemble-related stuff in orngEnsemble module, we thought explaining how to code bagging in Python may provide for a nice example. The following pseudo-code (from Whitten & Frank: Data Mining) illustrates the main idea of bagging:
Using the above idea, this means that our Learner_Class
will
need to develop t classifiers and will have to pass them to
Classifier
, which, once seeing a data instance, will use them for
classification. We will allow parameter t to be specified by the
user, 10 being the default.
The code for the Learner_Class
is therefore:
class Learner_Class
from bagging.py
Upon invocation, __init__
stores the base learning (the one that
will be bagged), the value of the parameter t, and the name of the
classifier. Note that while the learner requires the base learner
to be specified, parameters t and name are optional.
When the learner is called with examples, a list of t
classifiers is build and stored in variable classifier
. Notice that
for data sampling with replacement, a list of data instance indices
is build (selection
) and then used to sample the data from training
examples (example.getitems
). Finally, a Classifier
is called
with a list of classifiers, name and domain information.
class Classifier
from bagging.py
For initialization, Classifier
stores all parameters it was
invoked with. When called with a data instance, a list freq is
initialized which is of length equal to the number of classes and
records the number of models that classify an instance to a
specific class. The class that majority of models voted for is
returned. While it may be possible to return classes index, or even
a name, by convention classifiers in Orange return an object Value
instead.
Notice that while, originally, bagging was not intended to compute probabilities of classes, we compute these as the proportion of models that voted for a certain class (this is probably incorrect, but suffice for our example, and does not hurt if only classes values and not probabilities are used).
Here is the code that tests our bagging we have just implemented. It compares a decision tree and its bagged variant. Run it yourself to see which one is better!
bagging_test.py (uses bagging.py and adult_sample.tab)
Prev: Naive Bayes in Python, Next: Regression, Up: Build Your Own Learner