This module contains various measures of quality for classification
and regression. Most functions require an argument named
res
, an instance of
as computed by functions from orngTest and which contains predictions
obtained through cross-validation, leave one-out, testing on training
data or test set examples.
To prepare some data for examples on this page, we shall load the voting data set (problem of predicting the congressman's party (republican, democrat) based on a selection of votes) and evaluate naive bayesian learner, classification trees and majority classifier using cross-validation. For examples requiring a multivalued class problem, we shall do the same with the vehicle data set (telling whether a vehicle described by the features extracted from a picture is a van, bus, or Opel or Saab car).
part of statExamples.py
If examples are weighted, weights are taken into account. This can
be disabled by giving unweighted=1
as a keyword
argument. Another way of disabling weights is to clear the
ExperimentResults
' flag weights
.
Computes classification accuracy, i.e. percentage of matches
between predicted and actual class. The function returns a list of
classification accuracies of all classifiers tested. If
reportSE
is set to true, the list will contain tuples
with accuracies and standard errors.
If results are from multiple repetitions of experiments (like those returned by orngTest.crossValidation or orngTest.proportionTest) the standard error (SE) is estimated from deviation of classification accuracy accross folds (SD), as SE = SD/sqrt(N), where N is number of repetitions (e.g. number of folds).
If results are from a single repetition, we assume independency of examples and treat the classification accuracy as distributed according to binomial distribution. This can be approximated by normal distribution, so we report the SE of sqrt(CA*(1-CA)/N), where CA is classification accuracy and N is number of test examples.
Instead of ExperimentResults
, this function can be
given a list of confusion matrices (see below). Standard errors are in
this case estimated using the latter method.
So, let's compute all this and print it out.
part of statExamples.py
The output should look like this.
Script statExamples.py contains another example that also prints out the standard errors.
This function can compute two different forms of confusion matrix: one in which a certain class is marked as positive and the other(s) negative, and another in which no class is singled out. The way to specify what we want is somewhat confusing due to backward compatibility issues.
A positive-negative confusion matrix is computed (a) if the
class is binary unless classIndex
argument is -2, (b) if
the class is multivalued and the classIndex
is
non-negative. Argument classIndex
then tells which class
is positive. In case (a), classIndex
may be omited; the
first class is then negative and the second is positive, unless the
baseClass
attribute in the object with results has
non-negative value. In that case, baseClass
is an index
of the traget class. baseClass
attribute of results
object should be set manually. The result of a function is a list
of instances of class ConfusionMatrix
, containing the
(weighted) number of true positives (TP
), false negatives
(FN
), false positives (FP
) and true
negatives (TN
).
We can also add the keyword argument cutoff
(e.g. confusionMatrices(results, cutoff=0.3)
; if we do,
confusionMatrices
will disregard the classifiers' class
predictions and observe the predicted probabilities, and consider the
prediction "positive" if the predicted probability of the positive
class is higher than the cutoff.
The example below shows how setting the cut off threshold from the default 0.5 to 0.2 affects the confusion matrics for naive Bayesian classifier.
part of statExamples.py
The output,
To observe how good are the classifiers in detecting vans in the vehicle data set, we would compute the matrix like this:
General confusion matrix is computed (a) in case of a binary class, when classIndex
is set to -2, (b) when we have multivalued class and the caller doesn't specify the classIndex
of the positive class. When called in this manner, the function cannot use the argument cutoff
.
The function then returns a three-dimensional matrix, where the element A[learner][actualClass][predictedClass] gives the number of examples belonging to 'actualClass' for which the 'learner' predicted 'predictedClass'. We shall compute and print out the matrix for naive Bayesian classifier.
part of statExamples.py
Sorry for the language, but it's time you learn to talk
dirty in Python, too. "\t".join(classes)
will join the
strings from list classes
by putting tabulators between
them. zip
merges to lists, element by element, hence it
will create a list of tuples containing a class name from
classes
and a list telling how many examples from this
class were classified into each possible class. Finally, the format
string consists of a %s
for the class name and one
tabulator and %i
for each class. The data we provide for
this format string is (className, )
(a tuple containing
the class name), plus the misclassification list converted to a
tuple.
So, here's what this nice piece of code gives:
Van's are clearly simple: 189 vans were classified as vans (we know this already, we've printed it out above), and the 10 misclassified pictures were classified as buses (6) and Saab cars (4). In all other classes, there were more examples misclassified as vans than correctly classified examples. The classifier is obviously quite biased to vans.
With the confusion matrix defined in terms of positive and
negative classes, you can also compute the sensitivity
[TP/(TP+FN)], specificity
[TN/(TN+FP)], positive
predictive value [TP/(TP+FP)] and negative
predictive value [TN/(TN+FN)]. In information retrieval, positive
predictive value is called precision (the ratio of the number of
relevant records retrieved to the total number of irrelevant and
relevant records retrieved), and sensitivity is called recall
(the ratio of the number of relevant records retrieved to the total
number of relevant records in the database). The harmonic mean of
precision and recall is called an F-measure, where,
depending on the ratio of the weight between precision and recall is
implemented as F1
[2*precision*recall/(precision+recall)]
or, for a general case, Falpha
[(1+alpha)*precision*recall / (alpha*precision + recall)]. The
[http://en.wikipedia.org/wiki/Matthews_correlation_coefficient
Matthews correlation coefficient] in essence a correlation coefficient
between the observed and predicted binary classifications; it returns
a value between -1 and +1. A coefficient of +1 represents a perfect
prediction, 0 an average random prediction and -1 an inverse
prediction.
If the argument confm
is a single confusion matrix, a
single result (a number) is returned. If confm
is a list
of confusion matrices, a list of scores is returned, one for each
confusion matrix.
Note that weights are taken into account when computing the matrix, so these functions don't check the 'weighted' keyword argument.
Let us print out sensitivities and specificities of our classifiers.
part of statExamples.py
Receiver
Operating Characteristic (ROC) analysis was initially developed
for a binary-like problems and there is no consensus on how to apply
it in multi-class problems, nor do we know for sure how to do ROC
analysis after cross validation and similar multiple sampling
techniques. If you are interested in the area under the curve,
function AUC
will deal with those problems as
specifically described below.
method
:
AUC.ByWeightedPairs
(or 0
)AUC.ByPairs
(or 1
)AUC.WeightedOneAgainstAll
(or 2
)AUC.OneAgainstAll
(or 3
)In case of multiple folds (for instance if the data comes from cross validation), the computation goes like this. When computing the partial AUCs for individual pairs of classes or singled-out classes, AUC is computed for each fold separately and then averaged (ignoring the number of examples in each fold, it's just a simple average). However, if a certain fold doesn't contain any examples of a certain class (from the pair), the partial AUC is computed treating the results as if they came from a single-fold. This is not really correct since the class probabilities from different folds are not necessarily comparable, yet this will most often occur in a leave-one-out experiments, comparability shouldn't be a problem.
Computing and printing out the AUC's looks just like printing out classification accuracies (except that we call AUC instead of CA, of course):
part of statExamples.py
For vehicle, you can run exactly this same code; it will compute AUCs for all pairs of classes and return the average weighted by probabilities of pairs. Or, you can specify the averaging method yourself, like this
part of statExamples.py
classIndex
is singled out, and all other classes are treated as a single class. To find how good our classifiers are in distinguishing between vans and other vehicle, call the function like this
part of statExamples.py
The remaining functions, which plot the curves and statistically
compare them, require that the results come from a test with a single
iteration, and they always compare one chosen class against all
others. If you have cross validation results, you can either use splitByIterations
to split
the results by folds, call the function for each fold separately and
then sum the results up however you see fit, or you can set the
ExperimentResults
' attribute
numberOfIterations
to 1, to cheat the function - at your
own responsibility for the statistical correctness. Regarding the
multi-class problems, if you don't chose a specific class,
orngStat
will use the class attribute's
baseValue
at the time when results were computed. If
baseValue
was not given at that time, 1 (that is, the
second class) is used as default.
We shall use the following code to prepare suitable experimental results
Computes the area under ROC (AUC) and its standard error using
Wilcoxon's approach proposed by Hanley
and McNeal (1982). If classIndex
is not specified,
the first class is used as "the positive" and others are negative. The
result is a list of tuples (aROC, standard error).
To compute the AUCs with the corresponding confidence intervals for our experimental results, simply call
Compares ROC curves of learning algorithms with indices
learner1
and learner2
. The function returns
three tuples, the first two have areas under ROCs and standard errors
for both learner, and the third is the difference of the areas and its
standard error: ((AUC1, SE1), (AUC2, SE2), (AUC1-AUC2,
SE(AUC1)+SE(AUC2)-2*COVAR)).
This function is broken at the moment: it returns some numbers, but they're wrong.
Computes a ROC curve as a list of (x, y) tuples, where x is 1-specificity and y is sensitivity.
These two functions are obsolete and shouldn't be called. Use AUC
instead.
Several alternative measures, as given below, can be used to evaluate the sucess of numeric prediction:
The following code uses most of the above measures to score several regression methods.
The code above produces the following output:
Needs matplotlib to work.
The code above produces the following graph: