orngVizRank: Orange VizRank module

Module orngVizRank implements VizRank (Leban et al, 2004; Leban et al, 2005) algorithm which is able to rank possible data projections generated using two different visualization methods - scatterplot and radviz method. For a given class labeled data set, VizRank creates different possible data projections and assigns a score of interestingness to each of the projections. VizRank scores the projections based on how well are different classes separated in the projection. If different classes are well separated the projection gets a high score, otherwise the score it is correspondingly lower. After evaluation it is sensible to focus on top-ranked projections that provide the greatest insight on how to separate between different classes.

In the rest of this document we will talk about two different visualization methods - scatterplot and radviz. While scatterplot is a well known method, not many people know radviz. For those readers who are interested in this method, please see (Hoffman, 1997).


VizRank in Orange

The easiest way to use VizRank in Orange is through Orange widgets. Widgets like Scatterplot, Radviz and Polyviz (which can be found in Visualize tab in Orange Canvas) contain a button "VizRank" which opens VizRank's dialog where you can change all possible settings and find interesting data projections.

A more advanced user, however, will perhaps also want to use VizRank in scripting. These users will use the orngVizRank module.

In the rest of this document we will give information only about using VizRank in scripts. For those of you who will use VizRank in Orange widgets we provided extensive tooltips that should clarify the meaning of different settings.

Creating a VizRank instance

First lets show a very simple example of how we can use VizRank in scripts:

>>> import orange >>> data = orange.ExampleTable("wine.tab") >>> from orngVizRank import * >>> vizrank = VizRank(SCATTERPLOT) # use SCATTERPLOT if you want to evaluate scatterplot projections or RADVIZ to evaluate radviz projections >>> vizrank.setData(data) # set the data set >>> vizrank.evaluateProjections() # evaluate possible projections >>> print vizrank.results[0] (86.88861657813024, (86.88861657813024, [87.603105074268271, 82.08174408531525, 93.120556697249413], [59, 71, 48]), 178, ['A7', 'A10'], 5, [])

In this example we created a VizRank instance, evaluated scatterplot projections of the UCI wine data set and printed the information about the best ranked projection. The best projection scored a value of 86.88 (in a range between 0 and 100) and is showing attributes 'A7' and 'A10'.

Below is a list of functions and settings, that can be used in order to modify VizRank's behaviour.

kValue
    the number of examples used in predicting the class value. By default it is set to N/c, where N is number of examples in the data set and c is the number of class values
percentDataUsed
    when handling large data sets, the kNN method might take a lot of time to evaluate each projection. We can still get a good estimate of projection interestingness if we consider only a subset of examples. You can specify a value between 0 and 100. Default: 100
qualityMeasure
    there are different measures of prediction success that one can use to evaluate a classifier. You can use classification accuracy (CLASS_ACCURACY), average probability of correct classification (AVERAGE_CORRECT) or Brier score (BRIER_SCORE). Default: AVERAGE_CORRECT
testingMethod
    the way how the accuracy of the classifier is computed. You can use leave one out (LEAVE_ONE_OUT), 10 fold cross validation (TEN_FOLD_CROSS_VALIDATION) or testing on the learning set (TEST_ON_LEARNING_SET). Default: TEN_FOLD_CROSS_VALIDATION
attrCont
    which method for evaluating continuous attributes do we want to use. Attributes are ranked and projections with top ranked attributes are evaluated first. Possible options are ReliefF (CONT_MEAS_RELIEFF), Signal to Noise (CONT_MEAS_S2N), a modification of Signal to Noise measure (CONT_MEAS_S2NMIX) or no measure (CONT_MEAS_NONE). Default: CONT_MEAS_RELIEFF
attrDisc
    which method for evaluating discrete attributes do we want to use. Attributes are ranked and projections with top ranked attributes are evaluated first. Possible options are ReliefF (DISC_MEAS_RELIEFF), Gain ratio(DISC_MEAS_GAIN), Gini index (DISC_MEAS_GINI) or no measure (DISC_MEAS_NONE). Default: DISC_MEAS_RELIEFF
useExampleWeighting
    if class distribution is very uneven example weighting can be used. Default: 0
evaluationTime
    time in minutes that we want to spend in evaluating projections. Since there might be a large number of possible projections we can this way stop evaluation before it evaluates all projetions. Because of the seach heuristic (attrCont and attrDisc) we will most likely evaluate top ranked projections at the beginning of the evaluation. Default: 2

Radviz specific settings:
optimizationType
    for description see attributeCount below. Possible values are EXACT_NUMBER_OF_ATTRS and MAXIMUM_NUMBER_OF_ATTRS. Default: MAXIMUM_NUMBER_OF_ATTRS
attributeCount
    maximum number of attributes in a projection that we will consider. If optimizationType == MAXIMUM_NUMBER_OF_ATTRS then we will consider projections that have between 3 and attributeCount attributes. If optimizationType == EXACT_NUMBER_OF_ATTRS then we will consider only projections that have exactly CODE>attributeCount attributes. Default: 4

Methods:
setData(data)
    set the example table to evaluate
evaluateProjections()
    start projection evaluation. If not all projections are yet evaluated, it will automatically stop after evaluationTime minutes.
save(filename)
    save the list of evaluated projections
load(filename)
    load a file with evaluated projections

VizRank as a learner

VizRank can also be used as a learning method. You can construct a learner by creating an instance of the VizRankLearner class.

learner = VizRankLearner(SCATTERPLOT)

VizRankLearner can actually accept three parameters. First is the type of the visualization method to use (SCATTERPLOT or RADVIZ). The second parameter is an instance of VizRank class. If it is not given, a new instance is created. The third parameter is a graph instance - orngScaleScatterPlotData or orngScaleRadvizData instance. If it is not specified, a new instance is created.

To change the VizRank's settings we simply access them through the learner.VizRank instance (e.g. learner.VizRank.kValue = 10).

The learner instance can be used as any other learners. If you provide it the examples it returns a classifier of type VizRankClassifier which can be used as any other classifier:

classifier = learner(data)

When classifying VizRank classifier will use the evaluated projections to make class prediction for the new example. Evaluated projection will serve as arguments for each class value. Arguments have different values (weights) and the example is classified to the class which has the highest sum of argument values.

VizRank's settings that are relevant when using VizRank as a classifier:

argumentCount
    number of arguments (projections) used when predicting the class value
argumentValueFormula
    the way how argument values are computed. For argument values, two different things are important: the score of the projection and the class probability distribution computed using the kNN on the given projection. It makes sense that the argument is stronger if the projection has a high score or if probability for one class value is high. There are three ways how the argument values can be computed:
        0 - argument value is the same as projection score
        1 - argument value is 0.5* projection score + 0.5* predicted class probability
        2 - argument value is the same as predicted class probability

A simple example:

>>> import orange >>> from orngVizRank import * >>> data = orange.ExampleTable(&quot;iris.tab&quot;) >>> learner = VizRankLearner(SCATTERPLOT) >>> learner.VizRank.argumentCount = 3 >>> classifier = learner(data) >>> for i in range(5): print classifier(data[i]), data[i].getclass() (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa (<orange.Value 'iris'='Iris-setosa'>, <1.000, 0.000, 0.000>) Iris-setosa

References

Leban, G., Bratko, I., Petrovic, U., Curk, T., Zupan, B. VizRank: finding informative data projections in functional genomics by machine learning. Bioinformatics 21, 413-414 (2005).

Leban, G., Mramor, M., Bratko, I., Zupan, B.: Simple and Effective Visual Models for Gene Expression Cancer Diagnostics, KDD-2005 167--177 (Chicago, 2005).