Converting data to Numeric/numarray/numpy

Besides being a great language by itself, Python features a huge library of scientific functions called SciPy. SciPy is centered around a module called numpy, which evolved from numarray and Numeric (sometimes mistakenly referred to as NumPy), where numarray is a compatibility-breaking variation of Numeric (the one called NumPy) to which all numerical code Python was supposed to migrate. When it did not, module numpy again joined numarray and Numeric (aka NumPy). Or so they say. Orange proudly supports an integration with this mess in hope to make a lasting contribution to it. ;)

Data from an ExampleTable can be converted to and from Numeric, numarray and numpy's arrays. Numpy also has a new type, record array. Due to lack of documentation for the C API and, even more, doubting that this type would be very useful, Orange at the moment doesn't know about it.

Note: Binary distributions do not necessarily include the listed modules; conversion will yield an error if they are not installed on your system. To build Orange from sources you will need to have numpy installed. Header files for Numeric and numarray are not needed since the necessary objects are binary compatible (or so it seems).

From ExampleTable to an Array

ExampleTable's methods for conversion into various array types are called toNumeric, toNumericMA, toNumarray, toNumarrayMA, toNumpy and toNumpyMA. The functions with the "MA" prefix create mask arrays, where the mask denotes the defined values. Functions without the prefix will yield an error if the data includes any undefined values.

All functions accept the same set of three optional arguments: contents string, weight ID and the flag that tells how to treat the multinomial attributes.

Contents string can include 'A' or 'a' for attribute values (excluding the class), 'C' or 'c' for class value, 'W' or 'w' for weight and '0' and '1' for constants 0.0 and 1.0. The same symbol can occur more than once. For instance, to convert an example table 'data' to a Numeric array that will have the class value in the first column, followed by the attributes and finally by two columns of zero's (to which we can later set some other data), we would call data.toNumeric("CA00")

If the data set doesn't have a class attribute, but the content string includes 'C', an exception is raised. If it includes only the lower case 'c', the corresponding column is omitted without an error or even a warning. Similar goes for 'W' and 'w': the former raises an exception if the weightID (the second argument to a call) is omitted or zero, while the latter simply omits the column. Finally, if 'A' is given and there are no attributes (except, possibly, the class attribute and ignored discrete attributes) an exception is raised.

In addition to returning the matrix, the functions can return vectors of classes or weights. This is requested by putting a slash to the contents string, followed by a c, C, w and/or W. Like before, capital letters will yield an exception if the class or weight is absent, while in case of lower cases None is returned instead of the corresponding vector.

The result of the function is a tuple containing the array and the requested vectors. If certain element is requested, but unavailable (e.g., we want the class, but the data is classless), None is used as a placeholder. If slash is the first character of the contents string, there will be no array. If there's no slash or it is the last character, we will have a one-element tuple containing only the array...

The default contents is "a/cw" - a matrix with attribute values and separate vectors with classes and weights. Specifying an empty string has the same effect. If you would, for some reason, want a matrix with two columns with class values and three columns of 0's, and, besides that, a separate vector of classes and three vectors of weights, you would request this by "acc000/cwww". The three weight vectors will, however, be one and the same Python object, so modifying one will change all three of them. You can also repeat a's: the combination "ACC1Aw/ccw" will put two copies of attributes values in the matrix, they will be separated by two columns of classes and a column of 1's, and followed by the weights if they exist. In addition, it will return two copies of vector of classes and a vector of weights (or None if there are none).

The third argument to the function specifies the treatment of non-continuous non-binary values (for binary values we have no problem: they are translated to continuous 0.0 or 1.0). The argument's value can be ExampleTable.Multinomial_Ignore (such attributes are omitted), ExampleTable.Multinomial_AsOrdinal (the attribute's values' indices are treated as continuous numbers) or ExampleTable.Multinomial_Error (such attributes are forbidden, so an exception is raised if they are encountered). Default treatment is ExampleTable.Multinomial_AsOrdinal.

When the class attribute is discrete and has more than two values, an exception is raised unless multinomial attributes are treated as ordinal.

The treatment of multinomial attributes offered by these functions is very limited. There are way more versatile

a part of matrix.py

>>> data = orange.ExampleTable("../datasets/iris") >>> a, c, w = data.toNumpy() >>> a.shape (150, 4) >>> c.shape (150,) >>> w >>> a[0] array([ 5.0999999 , 3.5 , 1.39999998, 0.2 ]) >>> c[0] 0.0 >>> c[120] 2.0

When the array is to be used in linear regression, one would typically want the array to include a column of 1's, say as the first column.

a part of matrix.py

>>> a, c, w = data.toNumpy("1A/cw") >>> print a.shape (150, 5) >>> print a[0] [ 1. 5.0999999 3.5 1.39999998 0.2 ]

For a more perverse example, let's pack the array with a few additional columns: a column with class values will be followed by attributes, than a column of 1's, two more class columns and a column of zeros. This is just an exercise - probably nobody will ever need anything like this.

a part of matrix.py

>>> a, = data.toNumpy("ca1cc0") >>> a[0] array([ 0. , 5.0999999 , 3.5 , 1.39999998, 0.2 , 1. , 0. , 0. , 0. ]) >>> a[130] array([ 2. , 7.4000001 , 2.79999995, 6.0999999 , 1.89999998, 1. , 2. , 2. , 0. ])

If you prefer one of the other two numerical modules for Python, Numeric or numarray, you just need to call a different function, and they will wrap the array into a different class. Everything else stays the same.

Finally, when there is missing data, you should use toNumpyMA (or its equivalents for other modules).

From a matrix to an ExampleTable

Arrays can be converted into ExampleTables. This conversion can be explicit or implicit - generally any method that requires an example table will also accept an array and convert it on the fly. This method may not be desirable, though, since the attributes will get generic names and types, and won't be related to any other attributes. Most methods will fail if you attempt this without knowing what you are doing.

There are two general scenarios for interfacing numeric libraries and Orange: the data can origin from an ExampleTable, from where it is converted into an array, then something is done to/with it and then we want to convert it back to an ExampleTable. In the other case the data comes from somewhere else, we have it into an array and finally want to put it into an ExampleTable.

Our examples will suppose that a is an array with attribute and class values from the Iris data set:

>>> data = orange.ExampleTable("../datasets/iris") >>> a = data.toNumarray("ac")[0]

The cleaner way to create an ExampleTable is to construct or reuse a Domain, and call the ExampleTable's constructor, giving it a domain and the matrix. If the attribute is discrete, the value from the matrix is rounded to the closest integer which is then used as the attribute value's index.

Constructing a domain is trivial.

a part of matrix.py

columns = "sep length", "sep width", "pet length", "pet width" classValues = "setosa", "versicolor", "virginica" d4 = orange.Domain(map(orange.FloatVariable, columns), orange.EnumVariable("type", values=classValues)) t4 = orange.ExampleTable(d4, a)

This approach is suitable when the data doesn't come from an existing ExampleTable. When it does, we should reuse the domain, like this.

a part of matrix.py

t3 = orange.ExampleTable(data.domain, a)

There is another, quick and dirty conversion from an array to an ExampleTable: just call the ExampleTable's constructor with the array as the only argument.

a part of matrix.py

>>> t2 = orange.ExampleTable(a) >>> print t2.domain.attributes, t2.domain.classVar <FloatVariable 'a1', FloatVariable 'a2', FloatVariable 'a3', FloatVariable 'a4', FloatVariable 'a5'> None >>> print t2[0] [5.100, 3.500, 1.400, 0.200, 0.000]

Lacking any information on attributes' names and types, all attributes are continuous (FloatVariable) and have generic names (a1, a2...). There is no class attribute. Note that if you construct two such tables (even if you do it from the same matrix) the attributes will have the same names but will be essentially different attributes. Avoid doing this, it's almost as bad as implicit conversions.