Imputation is a procedure of replacing the missing attribute values with some appropriate values. Imputation is needed because of the methods (learning algorithms and others) that are not capable of handling unknown values, for instance logistic regression.
Missing values sometimes have a special meaning, so they need to be replaced by a designated value. Sometimes we know what to replace the missing value with; for instance, in a medical problem, some laboratory tests might not be done when it is known what their results would be. In that case, we impute certain fixed value instead of the missing. In the most complex case, we assign values that are computed based on some model; we can, for instance, impute the average or majority value or even a value which is computed from values of other, known attribute, using a classifier.
In a learning/classification process, imputation is needed on two occasions. Before learning, the imputer needs to process the training examples. Afterwards, the imputer is called for each example to be classified.
In general, imputer itself needs to be trained. This is, of course, not needed when the imputer imputes certain fixed value. However, when it imputes the average or majority value, it needs to compute the statistics on the training examples, and use it afterwards for imputation of training and testing examples.
While reading this document, bear in mind that imputation is a part of the learning process. If we fit the imputation model, for instance, by learning how to predict the attribute's value from other attributes, or even if we simply compute the average or the minimal value for the attribute and use it in imputation, this should only be done on learning data. If cross validation is used for sampling, imputation should be done on training folds only. Orange provides simple means for doing that.
This page will first explain how to construct various imputers. Then follow the examples for proper use of imputers. Finally, quite often you will want to use imputation with special requests, such as certain attributes' missing values getting replaced by constants and other by values computed using models induced from specified other attributes. For instance, in one of the studies we worked on, the patient's pulse rate needed to be estimated using regression trees that included the scope of the patient's injuries, sex and age, some attributes' values were replaced by the most pessimistic ones and others were computed with regression trees based on values of all attributes. If you are using learners that need the imputer as a component, you will need to write your own imputer constructor. This is trivial and is explained at the end of this page.
As common in Orange, imputation is done by pairs of two classes: one that does the work and another that constructs it.
is an abstract root of the hierarchy of classes that get the training data (with an optional id for weight) and constructs an instance of a class, derived from Imputer
. Imputer
can be called either with an Example
; it will return a new example with the missing values imputed (it will leave the original example intact!). If imputer is called with an ExampleTable
, it will return a new example table with imputed examples.
Attributes of ImputerConstructor
The simplest imputers always impute the same value for a particular attribute, disregarding the values of other attributes. They all use the same imputer class,
.
Attributes
defaults
.Imputer_defaults
is constructed by
,
and
. For continuous attributes, they will impute the smallest, largest or the average values encountered in the training examples. For discrete, they will impute the lowest (the one with index 0, eg attr.values[0]
), the highest (attr.values[-1]
), and the most common value encountered in the data. The first two imputers will mostly be used when the discrete values are ordered according to their impact on the class (for instance, possible values for symptoms of some disease can be ordered according to their seriousness). The minimal and maximal imputers will then represent optimistic and pessimistic imputations.
The following code will load the bridges data, and first impute the values in a single examples and then in the whole table.
part of imputation.py (uses bridges.tab)
This is example shows what the imputer does, not how it is to be used. Don't impute all the data and then use it for cross-validation. As warned at the top of this page, see the instructions for actual use of imputers.
Note that ImputerConstructor
s are another Orange class with schizophrenic constructor: if you give the constructor the data, it will return an Imputer
- the above call is equivalent to calling orange.ImputerConstructor_minimal()(data)
You can also construct the Imputer_defaults
yourself and specify your own defaults. Or leave some values unspecified, in which case the imputer won't impute them, as in the following example.
part of imputation.py (uses bridges.tab)
Here, the only attribute whose values will get imputed is "LENGTH"; the imputed value will be 1234.
The Imputer_default
's constructor will accept an argument of type Domain
(in which case it will construct an empty example for defaults
) or an example. (Be careful with this: Imputer_default
will have a reference to the example and not a copy. But you can make a copy yourself to avoid problems: instead of orange.Imputer_default(data[0])
you may want to write orange.Imputer_default(orange.Example(data[0]))
.
imputes random values. The corresponding constructor is
Model-based imputers learn to predict the attribute's value from values of other attributes.
are given a learning algorithm (two, actually - one for discrete and one for continuous attributes) and they construct a classifier for each attribute. The constructed imputer Imputer_model
stores a list of classifiers which are used when needed.
Attributes of ImputerConstructor_model
false
. It can however be useful for a more complex design in which we would use one imputer for learning examples (this one would use the class value) and another for testing examples (which would not use the class value as this is unavailable at that moment).Attributes of Imputer_model
classVar
's of the models should equal the examples' attributes. If any of classifier is missing (that is, the corresponding element of the table is None
, the corresponding attribute's values will not be imputed.The following imputer predicts the missing attribute values using classification and regression trees with the minimum of 20 examples in a leaf.
part of imputation.py (uses bridges.tab)
We could even use the same learner for discrete and continuous attributes! (The way this functions is rather tricky. If you desire to know: orngTree.TreeLearner
is a learning algorithm written in Python - Orange doesn't mind, it will wrap it into a C++ wrapper for a Python-written learners which then call-backs the Python code. When given the examples to learn from, orngTree.TreeLearner
checks the class type. If it's continuous, it will set the orange.TreeLearner
to construct regression trees, and if it's discrete, it will set the components for classification trees. The common parameters, such as the minimal number of examples in leaves, are used in both cases.)
You can of course use different learning algorithms for discrete and continuous attributes. Probably a common setup will be to use BayesLearner
for discrete and MajorityLearner
(which just remembers the average) for continuous attributes, as follows.
part of imputation.py (uses bridges.tab)
You can also construct an Imputer_model
yourself. You will do this if different attributes need different treatment. Brace for an example that will be a bit more complex. First we shall construct an Imputer_model
and initialize an empty list of models.
part of imputation.py (uses bridges.tab)
Attributes "LANES" and "T-OR-D" will always be imputed values 2 and "THROUGH". Since "LANES" is continuous, it suffices to construct a DefaultClassifier
with the default value 2.0 (don't forget the decimal part, or else Orange will think you talk about an index of a discrete value - how could it tell?). For the discrete attribute "T-OR-D", we could construct a DefaultClassifier
and give the index of value "THROUGH" as an argument. But we shall do it nicer, by constructing a Value
. Both classifiers will be stored at the appropriate places in imputer.models
.
"LENGTH" will be computed with a regression tree induced from "MATERIAL", "SPAN" and "ERECTED" (together with "LENGTH" as the class attribute, of course). Note that we initialized the domain by simply giving a list with the names of the attributes, with the domain as an additional argument in which Orange will look for the named attributes.
We printed the tree just to see what it looks like.
Small and nice. Now for the "SPAN". Wooden bridges and walkways are short, while the others are mostly medium. This could be done by ClassifierByLookupTable
- this would be faster than what we plan here. See the corresponding documentation on lookup classifier. Here we are gonna do it with a Python function.
computeSpan
could also be written as a class, if you'd prefer it. It's important that it behaves like a classifier, that is, gets an example and returns a value. The second element tells, as usual, what the caller expect the classifier to return - a value, a distribution or both. Since the caller, Imputer_model
, always wants values, we shall ignore the argument (at risk of having problems in the future when imputers might handle distribution as well).
OK, that's enough. Other attributes' values will remain unknown.
Missing values sometimes have a special meaning. The fact that something was not measured can sometimes tell a lot. Be, however, cautious when using such values in decision models; it the decision not to measure something (for instance performing a laboratory test on a patient) is based on the expert's knowledge of the class value, such unknown values clearly should not be used in models.
constructs a new domain in which each discrete attribute is replaced with a new attribute that has one value more: "NA". The new attribute will compute its values on the fly from the old one, copying the normal values and replacing the unknowns with "NA".
For continuous attributes, ImputerConstructor_asValue
will construct a two-valued discrete attribute with values "def" and "undef", telling whether the continuous attribute was defined or not. The attribute's name will equal the original's with "_def" appended. The original continuous attribute will remain in the domain and its unknowns will be replaced by averages.
ImputerConstructor_asValue
has no specific attributes.
The constructed imputer is named Imputer_asValue
(I bet you wouldn't guess). It converts the example into the new domain, which imputes the values for discrete attributes. If continuous attributes are present, it will also replace their values by the averages.
Attributes of Imputer_asValue
ImputerConstructor_asValue
.Here's a script that shows what this imputer actually does to the domain.
part of imputation.py (uses bridges.tab)
The script's output looks like this.
Seemingly, the two examples have the same attributes (with imputed
having a few additional ones). If you check this by original.domain[0] == imputed.domain[0]
, you shall see that this first glance is False
. The attributes only have the same names, but they are different attributes. If you read this page (which is already a bit advanced), you know that Orange does not really care about the attribute names).
Therefore, if we wrote "imputed[i]
" the program would fail since imputed
has no attribute i
. But it has an attribute with the same name (which even usually has the same value). We therefore use i.name
to index the attributes of imputed
. (Using names for indexing is not fast, though; if you do it a lot, compute the integer index with imputed.domain.index(i.name)
.)
For continuous attributes, there is an additional attribute with "_def" appended; we get it by i.name+"_def"
. Not really nice, but it works.
The first continuous attribute, "ERECTED" is defined. Its value remains 1874 and the additional attribute "ERECTED_def" has value "def". Not so for "LENGTH". Its undefined value is replaced by the average (1567) and the new attribute has value "undef". The undefined discrete attribute "CLEAR-G" (and all other undefined discrete attributes) is assigned the value "NA".
To properly use the imputation classes in learning process, they must be trained on training examples only. Imputing the missing values and subsequently using the data set in cross-validation will give overly optimistic results.
Orange learners that cannot handle missing values will generally provide a slot for the imputer component. An example of such a class is logistic regression learner with an attribute called imputerConstructor
. To it you can assign an imputer constructor - one of the above constructors or a specific constructor you wrote yourself. When given learning examples, LogRegLearner
will pass them to imputerConstructor
to get an imputer (again some of the above or a specific imputer you programmed). It will immediately use the imputer to impute the missing values in the learning data set, so it can be used by the actual learning algorithm. Besides, when the classifier (LogRegClassifier
) is constructed, the imputer will be stored in its attribute imputer
. At classification, the imputer will be used for imputation of missing values in (testing) examples.
Although details may vary from algorithm to algorithm, this is how the imputation is generally used in Orange's learners. Also, if you write your own learners, it is recommended that you use imputation according to the described procedure.
Most of Orange's learning algorithms do not use imputers because they can appropriately handle the missing values. Bayesian classifier, for instance, simply skips the corresponding attributes in the formula, while classification/regression trees have components for handling the missing values in various ways.
If for any reason you want to use these algorithms to run on imputed data, you can use the wrappers provide in the module orngImpute. The module's description is a matter of a separate page, but we shall show its code here as another demonstration of how to use the imputers - logistic regression is implemented essentially the same as the below classes.
The complete code of module orngImpute.py
LearnerWithImputation
puts the keyword arguments into the instance's dictionary. You are expected to call it like LearnerWithImputation(baseLearner=<someLearner>, imputer=<someImputerConstructor>)
. When the learner is called with examples, it trains the imputer, imputes the data, induces a baseClassifier
by the baseLearner
and constructs ClassifierWithImputation
that stores the baseClassifier
and the imputer
. For classification, the missing values are imputed and the classifier's prediction is returned.
Note that this code is slightly simplified, although the omitted details handle non-essential technical issues that are unrelated to imputation.
Imputation classes provide the Python-callback functionality (not all Orange classes do so, refer to the documentation on subtyping the Orange classes in Python for a list). If you want to write your own imputation constructor or an imputer, you need to simply program a Python function that will behave like the built-in Orange classes (and even less, for imputer, you only need to write a function that gets an example as argument, imputation for example tables will then use that function).
You will most often write the imputation constructor when you have a special imputation procedure or separate procedures for various attributes, as we've demonstrated in the description of ImputerConstructor_model. You basically only need to pack everything we've written there to an imputer constructor that will accept a data set and the id of the weight meta-attribute (ignore it if you will, but you must accept two arguments), and return the imputer (probably Imputer_model
). The benefit of implementing an imputer constructor as opposed to what we did above is that you can use such a constructor as a component for Orange learners (like logistic regression) or for wrappers from module orngImpute, and that way properly use the in classifier testing procedures.