Tab-delimited and similar formats

Besides supporting several common file formats that are used in machine learning (C4.5...), Orange introduces a more capable format that supports many additional features. There are several variations of it. The most powerful is the old-style tab-delimited file, with a header that gives the names of attributes, their type and role (ordinary attributes, class, meta-attribute). A simpler new-style tab-delimited file has a simpler header and when attribute types are omitted, Orange will attempt to guess them itself. Comma-separated files are essentially the same as the new-style tab-delimited, except that they have commas instead of tabulators. Finally, Orange for Windows can also read files in Excel format, provided that you have Excel installed. The organization of the file is again similar to the new-style tab-delimited files.

All file formats begin with the header; in the old-style it has three lines and in the new style it has only one. The remaining lines contain examples. These are given as lists of symbolic values, separated by tabulators in .tab and .txt files, commas in .csv, or occupying a row in an Excel file. Lines in (.tab, .txt and .csv) files that commence by "|" are comment lines and are ignored. There are no comments in Excel. Lines that are entirely empty (except for the delimiters) are skipped.

Domain description - older version (.tab)

The first line of the file in the older format contains names of the attributes. Names can contain any character but CR, LF, NUL or TAB. Spaces are allowed.

The second line contains types of attributes, one entry for each attribute. Attributes can be of the following four types.

Change: The (undocumented) symbols that could be used for declaring continuous and discrete attributes ('f' and 'float', and 'e' and 'enum') have been removed.

Note that at the moment, Orange learning methods can only use discrete and continuous attributes. String and Python attributes can be used as meta-attributes describing examples, or you can use them in specific learning methods and other algorithms.

The third line of the file contains optional flags.

i or ignore
Attributes with this flag are ignored, i.e. not read into the table.
c or class
denotes a class attribute. There can be at most one such attribute; if there are none, the last attribute is the class.
m or meta
denotes a meta-attribute. Such attributes are not used by any learning algorithm (or by algorithms for, say, measuring distances between examples) but are stored with examples. Meta attributes are most often used for weighting the examples.
-dc
followed by a value that serves as "don't care" symbol for this attribute. This option can be used more than once for each attribute if don't cares are specified with different symbols. See below for the details.

The basket can be ignored, while other flags have no effect.

The first few lines of iris dataset in this format look like this:

sepal length sepal width petal length petal width iris c c c c d class 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3.0 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa

Domain description - new version (.txt, .csv and .xls)

The newer version of tab-delimited formats is much simpler yet still powerful. The domain description is given in a single line which, in its most simple form, contains only the names of the attributes. In this case, Orange will recognize the attribute types itself, using this procedure:

  1. If the attribute descriptor with the same name is found in known descriptors (passed by use or determine by reuse), it will be used, thus specifying the attribute type as well.
  2. If the attribute is new, its values in the file are checked:
    • attributes whose values are digits from 0-9 (or some subset of this) are discrete; this is to cover the domains with coded attribute values,
    • attributes whose values can be parsed as numbers (in .txt) or whose cells contain numbers (in Excel) are continuous,
    • attributes which have more than 20 different values, yet less than half of them appear more than in one example, are strings and are put among meta attributes,
    • other attributes are discrete.
    • Symbolic values representing unknown values ("?", "~", "NA"... are ignored).

The last non-ignored non-meta attribute will be a class attribute. It is not possible to specify a classless domain in those two file formats.

This procedure is not foolproof. You can have continuous attributes whose values are accidentally only digits from 0-9. Or you can have a discrete attribute with values 1.1, 1.2, 1.3 and 2.1. You may want to designate some other attribute as class attribute, ignore another attribute and have a few meta attribute. This can be achieved by prefixes.

Prefixed attributes contain one- or two-lettered prefix, followed by "#" and the name. The first letter of the prefix can be either "m" for meta-attributes, "i" to ignore the attribute, and "c" to define the class attribute. As always, only one attribute can be a class attribue. The second letter denotes the attribute type, "D" for discrete, "C" for continuous, "S" for string attributes and "B" for baskets.

In most cases, however, the attribute detection mechanism will suffice. Therefore, Iris can be given like this:

sepal length sepal width petal length petal width iris 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3.0 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa

If you would like to ignore the first attribute, use the second as a class, explicitly require the third attribute to be discrete and have the fourth attribute be a continuous weight, you would "correct" the first line like this

i#sepal length c#sepal width D#petal length mC#petal width iris

Baskets

Baskets can be used for storing sparse data in tab delimited files. They were specifically designed for text mining needs. If text mining and sparse data is not your business, you can skip this section.

Baskets are given as a list of space-separated <name>=<value> atoms. A continuous meta attribute named <name> will be created and added to the domain as optional if it is not already there. A meta value for that variable will be added to the example. If the value is 1, you can omit the =<value> part.

It is not possible to put meta attributes of other types than continuous in the basket.

A tab delimited file with a basket can look like this:

K Ca b_foo Ba y c c basket c c meta i class 0.06 8.75 a b a c 0 1 0.48 b=2 d 0 1 0.39 7.78 0 1 0.57 8.22 c=13 0 1 These are the examples read from such a file: [0.06, 1], {"Ca":8.75, "a":2.000, "b":1.000, "c":1.000} [0.48, 1], {"Ca":?, "b":2.000, "d":1.000} [0.39, 1], {"Ca":7.78} [0.57, 1], {"Ca":8.22, "c":13.000}

It is recommended to have the basket as the last column, especially if it contains a lot of data.

Note a few things. The basket column's name, b_foo, is not used. In the first example, the value of a is 2 since it appears twice. The ordinary meta attribute, Ca, appears in all examples, even in those where its value is undefined. Meta attributes from the basket appear only where they are defined. This is due to the different nature of these meta attributes: Ca is required while the others are optional.

>>> d.domain.getmetas() {-6: FloatVariable 'd', -22: FloatVariable 'Ca', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'} >>> d.domain.getmetas(False) {-22: FloatVariable 'Ca'} >>> d.domain.getmetas(True) {-6: FloatVariable 'd', -5: FloatVariable 'c', -4: FloatVariable 'b', -3: FloatVariable 'a'} To fully understand all this, you should read the documentation on meta attributes in Domain and on the basket file format (a simple format that is limited to baskets only).

Comma separated files

Comma separated files are just like the new-format tab-delimited, except that commas are used instead of tabs. For instance, for documentation on censoring, we downloaded the new Wisconsin breast cancer data from UCI:

119513,N,31,18.02,27.6,117.5,1013,0.09489,0.1036, <...> 8423,N,61,17.99,10.38,122.8,1001,0.1184,0.2776, <...> 842517,N,116,21.37,17.44,137.5,1373,0.08836,0.1189, <...> 843483,N,123,11.42,20.38,77.58,386.1,0.1425,0.2839, <...> <...>

To import this data to Orange, we only needed to add the first line, describing the attribute names.

m#ID,c#recur,time,radius,texture,perimeter,area,smoothness, <...>

Since we don't want the ID to be used for learning, we turned it into a meta-attribute. Besides, we needed to tell Orange that "recur" is the class attribute.

Undefined values

By default, empty fields, ? and NA are interpreted as "don't care", and "~" and "*" as "don't know". You can't change this: this symbols are reserved.

You can, however, specify additional symbols to denote undefineds. This can be done either per attribute or for all attributes at ones. Per-attribute unknowns are specified using -dc option in the old-style tab-delimited files. For instance, if unknowns for some attribute are given as "UNK", add -dc UNK in the third line. There is no similar option in the .txt in .csv files.

General symbols for unknown values are not specified in the file but given as keyword arguments to ExampleTable. Three keywords are recognized: DC and DK give symbols for don't cares and don't knows, and NA for both; DC and DK have the priority over NA. Only one symbol can be specified for each kind of undefined values.

Although we can also load data in other format (such as C4.5) through calling ExampleTable, these keyword arguments only affect the formats described on this page.

To show how this works, we shall use the file undefineds.tab which looks like this.

a b c d d d -dc X -dc UNK -dc UNAVAILABLE 0 0 0 1 1 1 1 * * * ? ? ? . . . GDC GDC GDC GDK GDK GDK X X X UNK UNK UNK UNAVAILABLE UNAVAILABLE UNAVAILABLE

Let's load and print it.

part of undefineds.py (uses undefineds.tab)

import orange data = orange.ExampleTable("undefineds", DK="GDK", DC="GDC") for ex in data: print ex

Here's how the file is interpreted.

['0', '0', '0'] ['1', '1', '1'] ['?', '?', '1'] ['~', '~', '~'] ['?', '?', '?'] ['?', '?', '?'] ['?', '?', '?'] ['~', '~', '~'] ['?', 'X', 'X'] ['?', 'UNK', 'UNK'] ['UNAVAILABLE', '?', 'UNAVAILABLE']

As the call to ExampleTable specifies, symbols GDC and GDK stand for don't care and don't know for all attributes. In addition, X and UNK denote don't cares for the first attribute and UNAVAILABLE for the second. For other attributes, these symbols are just normal values.

As you have noted, undefined values are printed as "?" and "~", disregarding the way they were specified in the files they were read from. Orange cannot remember such details.

However, when saving the data back to files, you can specify the symbols to be used (for all attributes, not per-attribute). This is done in a similar fashion as when reading the data - by giving additional keyword arguments DC, DK and/or NA to the function saveTabDelimited, saveTxt or saveCSV.

For instance, if we save the file by

orange.saveTabDelimited("undefined-saved-dc-dk", data, DC="GDC", DK="GDK")all don't cares ("?") are written as "GDC" and don't knows as "GDK".

This mechanism should provide for easier exporting to other data mining programs that can handle tab- or comma-delimited files. For specific problems, such as having more names denoting different types of unknowns, possibly in combination with other attribute values, you can easily program your own input/output routines in Python.