orngSQL: Orange's Interface to SQL databases
The orngSQL module provides access to relational databases from Orange.
It currently supports MySQL through MySQL for Python
and Postgres through
Psycopg 2.
Support for other Python DB-API 2.0
compliant databases can and will be added upon request.
Unlike OrngMySQL, orngSQL
does not support the automatic detection of discrete attributes - this was intentional, since the required "select distinct" can be too slow on large databases. The support for
writing data back into the database exists, but is rudimentary.
SQLReader
SQLReader
establishes a connection with a database and provides the methods needed to fetch the data from the database into Orange.
Attributes
- className
- The name of the class variable
- metaNames
- The list of the names of meta attributes
- discreteNames
- The list of the names of the variables whose type is to be discrete
- query
- The SQL query to be executed
- domain
- The Orange domain used.
Methods
- execute(statement [,domain=None])
- Executes SQL select statement
statement
.
If the domain
argument is provided, the new domain is used.
- update()
- Executes a pending SQL query.
- data()
- returns The Orange ExampleTable produced by the last executed query.
- connect(uri)
- Connects to the database, specified in the
uri
.
- disconnect()
- Disconnects from the database.
SQLWriter
establishes a connection with a database and provides the methods needed to create
an appropriate table in the database and/or write the data from an ExampleTable into
the database.
Methods
- write(self, data, table, renameDict = None)
- Writes the data into the table. If
renameDict
is provided, the names used are remapped - the orange attribute "X" is written
into the database column renameDict["X"]
of the table.
- create(self, data, table, renameDict = None, typeDict = None)
- Creates the required SQL table, then writes the data into it. If
typeDict
is provided, the SQL type of the orange attribute "X" is set to
typeDict["X"]
.
- connect(uri)
- Connects to the database, specified in the
uri
.
- disconnect()
- Disconnects from the database.
loadSQL(filename, dontCheckStored = False, domain = None)
Executes the sql query stored in the file under filename
. This method allows the standard file reading functions with .sql files.
saveSQL
Does nothing. This function will be implemented as soon as someone suggests a good way to do so.
Attribute Names and Types
The rows returned by an SQL query have to be converted into orange
Example
s. Each column in a row has to be converted into a
certain attribute type. The following conversions between SQL and Orange
types are used:
STRING
and DATETIME
attributes are converted to Orange strings
- The attributes whose names are contained in
discreteNames
are converted
to Orange discrete attributes.
- The attributes whose names are listed in
metaNames
- Other attributes are converted to continuous Orange attributes
- The attribute in
className
is set as the class attribute. If no className
is set,
the column with the name "class" in the returned SQL query is set as the class attribute. If no such column exists,
the last column is set as the class attribute.
Orange domain handling
In Orange, attributes in different ExampleTables can represent different things, even if they share the same names.
To remedy this, Orange employs domains.. To make the reuse and creation of new domains easier,
Orange provides the DomainDepot.
When constructing an ExampleTable of results, sqlReader
also has to somehow create a
domain. This can be done in three ways:
- By default, the domain is constructed automatically, based on
className
, metaNames
, and discreteNames
.
- The domain can be provided as an argument to
execute()
.
- The domain can be set explicitly. The set domain is then used with the query, affecting the attribute types of
the
ExampleTable
, returned by data()
Examples
The examples below were tested with the postgress adapter.
To use them, you will probably have to modify them according to your
database setup. In the example, the name of the database is assumed to be "test",
the username "user" and the password "somepass"
We start with setting-up a small data table in the database to be used in
our examples:
DROP TABLE bus;
CREATE TABLE bus
(id varchar(5),
line integer,
daytime varchar,
temp float,
weather varchar,
arrival varchar);
\COPY "bus" FROM 'bus.txt' USING DELIMITERS ' '
SELECT * from bus;
DROP TABLE bus;
CREATE TABLE bus
(id varchar(5),
line integer,
daytime varchar,
temp float,
weather varchar,
arrival varchar);
LOAD DATA LOCAL INFILE 'bus.txt' INTO TABLE bus;
SELECT * from bus;
If everything goes well, running the script from shell by something like
mysql -u user -p < bus.sql or
psql -u user -p < bus-postgres.sql
should produce something along the lines of:
id line daytime temp weather arrival
1 10 morning 10 sunny late
2 11 morning 13 rainy late
3 9 morning 15 rainy late
4 10 evening 25 sunny on-time
5 9 evening 29 sunny on-time
6 11 morning 26 sunny late
7 9 evening 9 rainy on-time
8 9 midday 20 rainy late
9 11 midday 21 sunny late
10 10 evening 5 rainy on-time
11 10 midday 8 rainy late
12 9 morning 5 rainy on-time
Let us now write the script that reads some data from this table from
the database and produces an ExampleTable.
Running this script reports on four data instances and lists the
attributes (with types) in new Orange data table:
['1', '10', 'morning', 10.000, 'sunny', 'late']
['4', '10', 'evening', 25.000, 'sunny', 'on-time']
['10', '10', 'evening', 5.000, 'rainy', 'on-time']
['11', '10', 'midday', 8.000, 'rainy', 'late']
StringVariable 'id'
EnumVariable 'line'
EnumVariable 'daytime'
FloatVariable 'temp'
EnumVariable 'weather'
EnumVariable 'arrival'
Now let us suppose that some of the attributes should actually be treated
as discrete variables. The arrival column will now be explicitly set as the
class variable.
import orange, orngMySQL
t = orngMySQL.Connect('localhost','root','','test')
data = t.query("SELECT * FROM busclass WHERE line='10'")
for x in data:
print x
print
for a in data.domain.attributes:
print a
print 'Class:', data.domain.classVar
By running it, we get:
['10', 'morning', 10.000, 'sunny', 'late'], {"id":'1'}
['10', 'evening', 25.000, 'sunny', 'on-time'], {"id":'4'}
['10', 'evening', 5.000, 'rainy', 'on-time'], {"id":'10'}
['10', 'midday', 8.000, 'rainy', 'late'], {"id":'11'}
EnumVariable 'line'
EnumVariable 'daytime'
FloatVariable 'temp'
EnumVariable 'weather'
Class: EnumVariable 'arrival'
For our final examples, we load Iris data set in Orange, write it to
a data base, and read back only the data records whose sepal
length is below 5 mm. By default, all discrete and string orange attributes
are treated as VARCHARs and all continuous are treated as FLOATs.
# Description: Writes a data set to and reads from an SQL database
# Category: file formats
# Classes: ExampleTable, orngSQL.SQLReader, orngSQL.SQLWriter
# Uses: iris.tab
# Referenced: orngSQL.htm
import orange, orngSQL, orngTree
data = orange.ExampleTable("iris")
print "Input data domain:"
for a in data.domain.variables:
print a
r = orngSQL.SQLReader('mysql://user:somepass@localhost/test')
w = orngSQL.SQLWriter('mysql://user:somepass@localhost/test')
t.write('iris', data, overwrite=True)
sel = t.query("SELECT petal_width, petal_length FROM iris WHERE sepal_length<5.0")
print "\n%d instances returned" % len(sel)
print "Output data domain:"
for a in sel.domain.variables:
print a
The output of the script above is:
Input data domain:
FloatVariable 'sepal length'
FloatVariable 'sepal width'
FloatVariable 'petal length'
FloatVariable 'petal width'
EnumVariable 'iris'
22 instances returned
Output data domain:
FloatVariable 'petal_width'
FloatVariable 'petal_length'
Usually, one needs more control over the sql names and types. Here's the same example
as above, with some names and types set explicitly:
# Description: Writes a data set to and reads from an SQL database
# Category: file formats
# Classes: ExampleTable, orngSQL.SQLReader, orngSQL.SQLWriter
# Uses: iris.tab
# Referenced: orngSQL.htm
import orange, orngSQL, orngTree
data = orange.ExampleTable("iris")
print "Input data domain:"
for a in data.domain.variables:
print a
r = orngSQL.SQLReader('mysql://user:somepass@localhost/test')
w = orngSQL.SQLWriter('mysql://user:somepass@localhost/test')
# the following line only works with mysql because it uses the enum type.
w.create('iris', data,
renameDict = {'sepal length':'seplen',
'sepal width':'sepwidth',
'petal length':'petlen',
'petal width':'petwidth'},
typeDict = {'iris':"""enum('Iris-setosa', 'Iris-versicolor', 'Iris-virginica')"""})
r.execute("SELECT petwidth, petlen FROM iris WHERE seplen<5.0;")
data = r.data()
print "\n%d instances returned" % len(data)
print "Output data domain:"
for a in data.domain.variables:
print a
And the output:
Input data domain:
FloatVariable 'sepal length'
FloatVariable 'sepal width'
FloatVariable 'petal length'
FloatVariable 'petal width'
EnumVariable 'iris'
CREATE TABLE "iris" ( "seplen" DOUBLE, "sepwidth" DOUBLE, "petlen" DOUBLE, "petwidth" DOUBLE, "iris" enum('Iris-setosa', 'Iris-versicolor', 'Iris-virginica') );
22 instances returned
Output data domain:
FloatVariable 'petwidth'
FloatVariable 'petlen'
When accessing data, stored in a relational database, SQL queries are often reused.
Unfortunately, an sql query is not sufficient to retrieve the data. Before executing the
query, one must first connect to a database. It is also impossible to infer important
properties, such as the desired class variable, from the retrieved data.
The simplest method of accessing data, stored in a relational database, from Orange, in most cases,
is through .sql files with special comments. The examples listed below should be more or less self-explanatory.
The statements after the keywords uri
, discrete
, class
and metas
are actually evaluated by the python interpeter. The returned values are then used appropriately.
For now, there is no write support for .sql files. To transfer data from Orange back into an SQL database,
the orng should be used directly.
--orng uri 'mysql://user:somepass@localhost/test'
--orng discrete ['registration', 'num', 'time of day', 'arrival']
--orng meta ['weather', 'arrival', 'time']
--orng class ['arrival']
SELECT
"id" as registration,
line as num,
daytime as "time of day",
temp as temperature,
weather,
arrival
FROM
bus
WHERE
line='10';
(uses sql-new-embed.sql)
# Description: Reads data from a database
# Category: file formats
# Classes: orngSQL.SQLReader
# Referenced: orngSQL.htm
import orange, orngSQLFile
orange.registerFileType("SQL", orngSQLFile.loadSQL, None, ".sql")
data = orange.ExampleTable('sql-new-embed.sql')
for x in data:
print x
print
for a in data.domain.attributes:
print a
The expected output:
(('registration', 253, 2, 2, 2, 31, 0), ('num', 254, 2, 2, 2, 0, 1), ('time of day', 254, 7, 7, 7, 0, 1), ('temperature', 4, 2, 12, 12, 31, 1), ('weather', 254, 5, 5, 5, 0, 1), ('arrival', 254, 7, 7, 7, 0, 1))
(('registration', 253, 2, 2, 2, 31, 0), ('num', 254, 2, 2, 2, 0, 1), ('time of day', 254, 7, 7, 7, 0, 1), ('temperature', 4, 2, 12, 12, 31, 1), ('weather', 254, 5, 5, 5, 0, 1), ('arrival', 254, 7, 7, 7, 0, 1))
['id', '10', 'morning', 10.0], {"weather":'sunny', "arrival":'late'}
['id', '10', 'evening', 25.0], {"weather":'sunny', "arrival":'on-time'}
['id', '10', 'evening', 5.0], {"weather":'rainy', "arrival":'on-time'}
['id', '10', 'midday', 8.0], {"weather":'rainy', "arrival":'late'}
EnumVariable 'registration'
EnumVariable 'num'
EnumVariable 'time of day'
FloatVariable 'temperature'