Besides supporting several common file formats that are used in machine learning (C4.5...), Orange introduces a more capable format that supports many additional features. There are several variations of it. The most powerful is the old-style tab-delimited file, with a header that gives the names of attributes, their type and role (ordinary attributes, class, meta-attribute). A simpler new-style tab-delimited file has a simpler header and when attribute types are omitted, Orange will attempt to guess them itself. Comma-separated files are essentially the same as the new-style tab-delimited, except that they have commas instead of tabulators. Finally, Orange for Windows can also read files in Excel format, provided that you have Excel installed. The organization of the file is again similar to the new-style tab-delimited files.
All file formats begin with the header; in the old-style it has three lines and in the new style it has only one. The remaining lines contain examples. These are given as lists of symbolic values, separated by tabulators in .tab and .txt files, commas in .csv, or occupying a row in an Excel file. Lines in (.tab, .txt and .csv) files that commence by "|" are comment lines and are ignored. There are no comments in Excel. Lines that are entirely empty (except for the delimiters) are skipped.
The first line of the file in the older format contains names of the attributes. Names can contain any character but CR, LF, NUL or TAB. Spaces are allowed.
The second line contains types of attributes, one entry for each attribute. Attributes can be of the following four types.
d
or discrete
for discrete attributes. Alternatively, you can list the possible values of the attribute instead of "d" or "discrete"; values should be separated by spaces. Spaces contained in values must be "escaped" (that is, preceded by a backslash): a value named "light blue" would be written as light\ blue
. Listing attributes is useful since it prescribes the order of the values; if you only described the attribute type by "d", the order of values will be the same as encountered when reading the examples. The corresponding attribute descriptor is of type EnumVariable
.c
or continuous
for continuous attributes. They are described by an instance of FloatVariable
.string
and described by StringVariable
.basket
; this does not create a single attribute but rather tells the parser that this column will list values of optional continuous meta attributes. There can only be one basket. The attribute needs a name to simplify the parser, yet the name is not used anywhere. More on baskets in a dedicated chapter.Python attribute type
.
Change: The (undocumented) symbols that could be used for declaring continuous and discrete attributes ('f' and 'float', and 'e' and 'enum') have been removed.
Note that at the moment, Orange learning methods can only use discrete and continuous attributes. String and Python attributes can be used as meta-attributes describing examples, or you can use them in specific learning methods and other algorithms.
The third line of the file contains optional flags.
i
or ignore
c
or class
m
or meta
-dc
The basket can be ignored, while other flags have no effect.
The first few lines of iris dataset in this format look like this:
The newer version of tab-delimited formats is much simpler yet still powerful. The domain description is given in a single line which, in its most simple form, contains only the names of the attributes. In this case, Orange will recognize the attribute types itself, using this procedure:
use
or determine by reuse), it will be used, thus specifying the attribute type as well.The last non-ignored non-meta attribute will be a class attribute. It is not possible to specify a classless domain in those two file formats.
This procedure is not foolproof. You can have continuous attributes whose values are accidentally only digits from 0-9. Or you can have a discrete attribute with values 1.1, 1.2, 1.3 and 2.1. You may want to designate some other attribute as class attribute, ignore another attribute and have a few meta attribute. This can be achieved by prefixes.
Prefixed attributes contain one- or two-lettered prefix, followed by "#" and the name. The first letter of the prefix can be either "m" for meta-attributes, "i" to ignore the attribute, and "c" to define the class attribute. As always, only one attribute can be a class attribue. The second letter denotes the attribute type, "D" for discrete, "C" for continuous, "S" for string attributes and "B" for baskets.
In most cases, however, the attribute detection mechanism will suffice. Therefore, Iris can be given like this:
If you would like to ignore the first attribute, use the second as a class, explicitly require the third attribute to be discrete and have the fourth attribute be a continuous weight, you would "correct" the first line like this
Baskets can be used for storing sparse data in tab delimited files. They were specifically designed for text mining needs. If text mining and sparse data is not your business, you can skip this section.
Baskets are given as a list of space-separated <name>=<value>
atoms. A continuous meta attribute named <name> will be created and added to the domain as optional if it is not already there. A meta value for that variable will be added to the example. If the value is 1, you can omit the =<value>
part.
It is not possible to put meta attributes of other types than continuous in the basket.
A tab delimited file with a basket can look like this:
It is recommended to have the basket as the last column, especially if it contains a lot of data.
Note a few things. The basket column's name, b_foo, is not used. In the first example, the value of a is 2 since it appears twice. The ordinary meta attribute, Ca, appears in all examples, even in those where its value is undefined. Meta attributes from the basket appear only where they are defined. This is due to the different nature of these meta attributes: Ca is required while the others are optional.
Domain
and on the basket file format (a simple format that is limited to baskets only).
Comma separated files are just like the new-format tab-delimited, except that commas are used instead of tabs. For instance, for documentation on censoring, we downloaded the new Wisconsin breast cancer data from UCI:
To import this data to Orange, we only needed to add the first line, describing the attribute names.
Since we don't want the ID to be used for learning, we turned it into a meta-attribute. Besides, we needed to tell Orange that "recur" is the class attribute.
By default, empty fields, ?
and NA
are interpreted as "don't care", and "~" and "*" as "don't know". You can't change this: this symbols are reserved.
You can, however, specify additional symbols to denote undefineds. This can be done either per attribute or for all attributes at ones. Per-attribute unknowns are specified using -dc
option in the old-style tab-delimited files. For instance, if unknowns for some attribute are given as "UNK", add -dc UNK
in the third line. There is no similar option in the .txt in .csv files.
General symbols for unknown values are not specified in the file but given as keyword arguments to ExampleTable
. Three keywords are recognized: DC
and DK
give symbols for don't cares and don't knows, and NA
for both; DC
and DK
have the priority over NA
. Only one symbol can be specified for each kind of undefined values.
Although we can also load data in other format (such as C4.5) through calling ExampleTable
, these keyword arguments only affect the formats described on this page.
To show how this works, we shall use the file undefineds.tab which looks like this.
Let's load and print it.
part of undefineds.py (uses undefineds.tab)
Here's how the file is interpreted.
As the call to ExampleTable
specifies, symbols GDC and GDK stand for don't care and don't know for all attributes. In addition, X and UNK denote don't cares for the first attribute and UNAVAILABLE for the second. For other attributes, these symbols are just normal values.
As you have noted, undefined values are printed as "?" and "~", disregarding the way they were specified in the files they were read from. Orange cannot remember such details.
However, when saving the data back to files, you can specify the symbols to be used (for all attributes, not per-attribute). This is done in a similar fashion as when reading the data - by giving additional keyword arguments DC
, DK
and/or NA
to the function saveTabDelimited
, saveTxt
or saveCSV
.
For instance, if we save the file by
This mechanism should provide for easier exporting to other data mining programs that can handle tab- or comma-delimited files. For specific problems, such as having more names denoting different types of unknowns, possibly in combination with other attribute values, you can easily program your own input/output routines in Python.