Class Query
source code
object --+
|
Query
- Known Subclasses:
-
A Class representing a structured database query
Objects of this class have properties that model the attributes of
the query, and methods for performing the request.
SYNOPSIS
example:
>>> service = Service("http://www.flymine.org/query/service")
>>> query = service.new_query()
>>>
>>> query.add_view("Gene.symbol", "Gene.pathways.name", "Gene.proteins.symbol")
>>> query.add_sort_order("Gene.pathways.name")
>>>
>>> query.add_constraint("Gene", "LOOKUP", "eve")
>>> query.add_constraint("Gene.pathways.name", "=", "Phosphate*")
>>>
>>> query.set_logic("A or B")
>>>
>>> for row in query.rows():
... handle_row(row)
OR, using an SQL style DSL:
>>> s = Service("www.flymine.org/query")
>>> query = s.query("Gene").\
... select("*", "pathways.*").\
... where("symbol", "=", "H").\
... outerjoin("pathways").\
... order_by("symbol")
>>> for row in query.rows(start=10, size=5):
... handle_row(row)
OR, for a more SQL-alchemy, ORM style:
>>> for gene in s.query(s.model.Gene).filter(s.model.Gene.symbol == ["zen", "H", "eve"]).add_columns(s.model.Gene.alleles):
... handle(gene)
Query objects represent structured requests for information over
the database housed at the datawarehouse whose webservice you are
querying. They utilise some of the concepts of relational databases,
within an object-related ORM context. If you don't know what that
means, don't worry: you don't need to write SQL, and the queries will
be fast.
To make things slightly more familiar to those with knowledge of
SQL, some syntactical sugar is provided to make constructing queries
a bit more recognisable.
PRINCIPLES
The data model represents tables in the databases as classes, with
records within tables as instances of that class. The columns of the
database are the fields of that object:
The Gene table - showing two records/objects
+---------------------------------------------------+
| id | symbol | length | cyto-location | organism |
+----------------------------------------+----------+
| 01 | eve | 1539 | 46C10-46C10 | 01 |
+----------------------------------------+----------+
| 02 | zen | 1331 | 84A5-84A5 | 01 |
+----------------------------------------+----------+
...
The organism table - showing one record/object
+----------------------------------+
| id | name | taxon id |
+----------------------------------+
| 01 | D. melanogaster | 7227 |
+----------------------------------+
Columns that contain a meaningful value are known as 'attributes'
(in the tables above, that is everything except the id columns). The
other columns (such as "organism" in the gene table) are
ones that reference records of other tables (ie. other objects), and
are called references. You can refer to any field in any class, that
has a connection, however tenuous, with a table, by using dotted path
notation:
Gene.organism.name -> the name column in the organism table, referenced by a record in the gene table
These paths, and the connections between records and tables they
represent, are the basis for the structure of InterMine queries.
THE STUCTURE OF A QUERY
A query has two principle sets of properties:
-
its view: the set of output columns
-
its constraints: the set of rules for what to include
A query must have at least one output column in its view, but
constraints are optional - if you don't include any, you will get
back every record from the table (every object of that type)
In addition, the query must be coherent: if you have information
about an organism, and you want a list of genes, then the
"Gene" table should be the basis for your query, and as
such the Gene class, which represents this table, should be the root
of all the paths that appear in it:
So, to take a simple example:
I have an organism name, and I want a list of genes:
The view is the list of things I want to know about those
genes:
>>> query.add_view("Gene.name")
>>> query.add_view("Gene.length")
>>> query.add_view("Gene.proteins.sequence.length")
Note I can freely mix attributes and references, as long as every
view ends in an attribute (a meaningful value). As a short-cut I can
also write:
>>> query.add_views("Gene.name", "Gene.length", "Gene.proteins.sequence.length")
or:
>>> query.add_views("Gene.name Gene.length Gene.proteins.sequence.length")
They are all equivalent. You can also use common SQL style
shortcuts such as "*" for all attribute fields:
>>> query.add_views("Gene.*")
You can also use "select" as a synonymn for
"add_view"
Now I can add my constraints. As, we mentioned, I have information
about an organism, so:
>>> query.add_constraint("Gene.organism.name", "=", "D. melanogaster")
(note, here I can use "where" as a synonymn for
"add_constraint")
If I run this query, I will get literally millions of results - it
needs to be filtered further:
>>> query.add_constraint("Gene.proteins.sequence.length", "<", 500)
If that doesn't restrict things enough I can add more filters:
>>> query.add_constraint("Gene.symbol", "ONE OF", ["eve", "zen", "h"])
Now I am guaranteed to get only information on genes I am
interested in.
Note, though, that because I have included the link (or
"join") from Gene -> Protein, this, by default, means
that I only want genes that have protein information associated with
them. If in fact I want information on all genes, and just want to
know the protein information if it is available, then I can specify
that with:
>>> query.add_join("Gene.proteins", "OUTER")
And if perhaps my query is not as simple as a strict cumulative
filter, but I want all D. mel genes that EITHER have a short protein
sequence OR come from one of my favourite genes (as unlikely as that
sounds), I can specify the logic for that too:
>>> query.set_logic("A and (B or C)")
Each letter refers to one of the constraints - the codes are
assigned in the order you add the constraints. If you want to be
absolutely certain about the constraints you mean, you can use the
constraint objects themselves:
>>> gene_is_eve = query.add_constraint("Gene.symbol", "=", "eve")
>>> gene_is_zen = query.add_constraint("Gene.symbol", "=", "zne")
>>>
>>> query.set_logic(gene_is_eve | gene_is_zen)
By default the logic is a straight cumulative filter (ie: A and B
and C and D and ...)
Putting it all together:
>>> query.add_view("Gene.name", "Gene.length", "Gene.proteins.sequence.length")
>>> query.add_constraint("Gene.organism.name", "=", "D. melanogaster")
>>> query.add_constraint("Gene.proteins.sequence.length", "<", 500)
>>> query.add_constraint("Gene.symbol", "ONE OF", ["eve", "zen", "h"])
>>> query.add_join("Gene.proteins", "OUTER")
>>> query.set_logic("A and (B or C)")
This can be made more concise and readable with a little DSL
sugar:
>>> query = service.query("Gene")
>>> query.select("name", "length", "proteins.sequence.length"). ... where('organism.name' '=', 'D. melanogaster'). ... where("proteins.sequence.length", "<", 500). ... where('symbol', 'ONE OF', ['eve', 'h', 'zen']). ... outerjoin('proteins'). ... set_logic("A and (B or C)")
And the query is defined.
Result Processing: Rows
calling ".rows()" on a query will return an iterator of
rows, where each row is a ResultRow object, which can be treated as
both a list and a dictionary.
Which means you can refer to columns by name:
>>> for row in query.rows():
... print "name is %s" % (row["name"])
... print "length is %d" % (row["length"])
As well as using list indices:
>>> for row in query.rows():
... print "The first column is %s" % (row[0])
Iterating over a row iterates over the cell values as a list:
>>> for row in query.rows():
... for column in row:
... do_something(column)
Here each row will have a gene name, a gene length, and a sequence
length, eg:
>>> print row.to_l
["even skipped", "1359", "376"]
To make that clearer, you can ask for a dictionary instead of a
list:
>>> for row in query.rows()
... print row.to_d
{"Gene.name":"even skipped","Gene.length":"1359","Gene.proteins.sequence.length":"376"}
If you just want the raw results, for printing to a file, or for
piping to another program, you can request strings instead:
>>> for row in query.result("string")
... print(row)
Result Processing: Results
Results can also be processing on a record by record basis. If you
have a query that has output columns of "Gene.symbol",
"Gene.pathways.name" and
"Gene.proteins.proteinDomains.primaryIdentifier", than
processing it by records will return one object per gene, and that
gene will have a property named "pathways" which contains
objects which have a name property. Likewise there will be a proteins
property which holds a list of proteinDomains which all have a
primaryIdentifier property, and so on. This allows a more object
orientated approach to database records, familiar to users of other
ORMs.
This is the format used when you choose to iterate over a query
directly, or can be explicitly chosen by invoking intermine.query.Query.results:
>>> for gene in query:
... print gene.name, map(lambda x: x.name, gene.pathways)
The structure of the object and the information it contains
depends entirely on the output columns selected. The values may be
None, of course, but also any valid values of an object (according to
the data model) will also be None if they were not selected for
output. Attempts to access invalid properties (such as
gene.favourite_colour) will cause exceptions to be thrown.
Getting us to Generate your Code
Not that you have to actually write any of this! The webapp will
happily generate the code for any query (and template) you can build
in it. A good way to get started is to use the webapp to generate
your code, and then run it as scripts to speed up your queries. You
can always tinker with and edit the scripts you download.
To get generated queries, look for the "python" link at
the bottom of query-builder and template form pages, it looks a bit
like this:
. +=====================================+=============
| |
| Perl | Python | Java [Help] |
| |
+==============================================
|
__init__(self,
model,
service=None,
validate=True,
root=None)
Construct a new query for making database queries against an
InterMine data warehouse. |
source code
|
|
|
__iter__(self)
Return an iterator over all the objects returned by this query |
source code
|
|
|
__len__(self)
Return the number of rows this query will return. |
source code
|
|
|
|
|
verify(self)
Invalid queries will fail to run, and it is not always obvious why. |
source code
|
|
|
|
|
|
|
|
|
clear_view(self)
Deletes all entries currently in the view list. |
source code
|
|
|
|
intermine.constraints.Constraint
|
|
|
where(self,
*args,
**kwargs)
In contrast to add_constraint, this method also adds all attributes
to the query if no view has been set, and returns self to support
method chaining. |
source code
|
|
|
column(self,
string)
This method is part of the SQLAlchemy compatible API. |
source code
|
|
|
|
intermine.constraints.CodedConstraint
|
|
intermine.pathfeatures.Join
|
|
|
outerjoin(self,
column)
Alias for add_join(column, "OUTER") |
source code
|
|
|
|
intermine.pathfeatures.PathDescription
|
|
|
|
intermine.constraints.LogicGroup
|
|
|
|
|
validate_logic(self,
logic=None)
Attempts to validate the logic by checking that every
coded_constraint is included at least once |
source code
|
|
intermine.pathfeatures.SortOrderList
|
|
intermine.pathfeatures.SortOrderList
|
|
|
|
|
|
dict(string, string)
|
get_subclass_dict(self)
This method returns a mapping of classes used by the model for
assessing whether certain paths are valid. |
source code
|
|
intermine.webservice.ResultIterator
|
|
iterable<intermine.webservice.ResultRow>
|
rows(self,
start=0,
size=None)
This is a shortcut for results("rr") |
source code
|
|
|
one(self,
row=' jsonobjects ' )
Return one result, and raise an error if the result size is not 1 |
source code
|
|
|
first(self,
row=' jsonobjects ' ,
start=0)
Return the first result, or None if the results are empty |
source code
|
|
|
get_results_list(self,
*args,
**kwargs)
This method is a shortcut so that you do not have to do a list
comprehension yourself on the iterator that is normally returned. |
source code
|
|
|
|
int
|
count(self)
Obtain the number of rows a particular query will return, without
having to fetch and parse all the actual data. |
source code
|
|
str
|
|
str
|
|
str
|
|
list
|
|
dict
|
|
xml.minidom.Node
|
to_Node(self)
This is an intermediate step in the creation of the xml serialised
version of the query. |
source code
|
|
string
|
to_xml(self)
This method serialises the current state of the query to an xml
string, suitable for storing, or sending over the internet to the
webservice. |
source code
|
|
string
|
to_formatted_xml(self)
This method serialises the current state of the query to an xml
string, suitable for storing, or sending over the internet to the
webservice, only more readably. |
source code
|
|
|
clone(self)
This method will produce a clone that is independent, and can be
altered without affecting the original, but starts off with the exact
same state as it. |
source code
|
|
Inherited from object :
__delattr__ ,
__format__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__repr__ ,
__setattr__ ,
__sizeof__ ,
__subclasshook__
|
|
SO_SPLIT_PATTERN = re.compile(r'(?i) \s* ( asc| desc) \s* ')
|
|
LOGIC_SPLIT_PATTERN = re.compile(r'(?i) \s* (?: and| or| \(| \)) \s* ')
|
|
TRAILING_OP_PATTERN = re.compile(r'(?i) \s* ( and| or) \s* $')
|
|
LEADING_OP_PATTERN = re.compile(r'(?i) ^\s* ( and| or) \s* ')
|
|
LOGIC_OPS = [ ' and ' , ' or ' ]
|
|
LOGIC_PRODUCT = [ ( ' and ' , ' and ' ) , ( ' and ' , ' or ' ) , ( ' or ' , ' and ' ) , ...
|
|
x = ' or '
|
|
y = ' or '
|
__init__(self,
model,
service=None,
validate=True,
root=None)
(Constructor)
| source code
|
Construct a new Query
Construct a new query for making database queries against an
InterMine data warehouse.
Normally you would not need to use this constructor directly, but
instead use the factory method on intermine.webservice.Service, which
will handle construction for you.
- Parameters:
model - an instance of intermine.model.Model. Required
service - an instance of l{intermine.service.Service}. Optional, but you
will not be able to make requests without one.
validate - a boolean - defaults to True. If set to false, the query will not
try and validate itself. You should not set this to false.
- Overrides:
object.__init__
|
from_xml(cls,
xml,
*args,
**kwargs)
Class Method
| source code
|
Deserialise a query serialised to XML
This method is used to instantiate serialised queries. It is used by
intermine.webservice.Service objects to instantiate Template objects
and it can be used to read in queries you have saved to a file.
- Parameters:
xml - The xml as a file name, url, or string
- Returns: Query
- Raises:
|
__str__(self)
(Informal representation operator)
| source code
|
Return the XML serialisation of this query
- Overrides:
object.__str__
|
Validate the query
Invalid queries will fail to run, and it is not always obvious why.
The validation routine checks to see that the query will not cause
errors on execution, and tries to provide informative error
messages.
This method is called immediately after a query is fully
deserialised.
- Raises:
|
Replace the current selection of output columns with this one
example:
query.select("*", "proteins.name")
This method is intended to provide an API familiar to those with
experience of SQL or other ORM layers. This method, in contrast to
other view manipulation methods, replaces the selection of output
columns, rather than appending to it.
Note that any sort orders that are no longer in the view will be
removed.
- Parameters:
paths - The output columns to add
|
Add one or more views to the list of output columns
example:
query.add_view("Gene.name Gene.organism.name")
This is the main method for adding views to the list of output
columns. As well as appending views, it will also split a single, space
or comma delimited string into multiple paths, and flatten out lists,
or any combination. It will also immediately try to validate the
views.
Output columns must be valid paths according to the data model, and
they must represent attributes of tables
Also available as:
-
add_views
-
add_column
-
add_columns
-
add_to_select
- See Also:
-
intermine.model.Model,
intermine.model.Path,
intermine.model.Attribute
|
Check to see if the views given are valid
This method checks to see if the views:
-
are valid according to the model
-
represent attributes
- Raises:
|
Add a constraint (filter on records)
example:
query.add_constraint("Gene.symbol", "=", "zen")
This method will try to make a constraint from the arguments given,
trying each of the classes it knows of in turn to see if they accept
the arguments. This allows you to add constraints of different types
without having to know or care what their classes or implementation
details are. All constraints derive from
intermine.constraints.Constraint, and they all have a path attribute,
but are otherwise diverse.
Before adding the constraint to the query, this method will also try
to check that the constraint is valid by calling
Query.verify_constraint_paths()
- Returns: intermine.constraints.Constraint
|
Add a constraint to the query
In contrast to add_constraint, this method also adds all attributes
to the query if no view has been set, and returns self to support
method chaining.
Also available as Query.filter
|
Return a Column object suitable for using to construct constraints with
This method is part of the SQLAlchemy compatible API.
Also available as Query.c
|
Check that the constraints are valid
This method will check the path attribute of each constraint. In
addition it will:
-
Check that BinaryConstraints and MultiConstraints have an Attribute
as their path
-
Check that TernaryConstraints have a Reference as theirs
-
Check that SubClassConstraints have a correct subclass relationship
-
Check that LoopConstraints have a valid loopPath, of a compatible
type
-
Check that ListConstraints refer to an object
- Parameters:
cons - The constraints to check (defaults to all constraints on the
query)
- Raises:
|
Returns the constraint with the given code
Returns the constraint with the given code, if if exists. If no such
constraint exists, it throws a ConstraintError
- Returns: intermine.constraints.CodedConstraint
- the constraint corresponding to the given code
|
Add a join statement to the query
example:
query.add_join("Gene.proteins", "OUTER")
A join statement is used to determine if references should restrict
the result set by only including those references exist. For example,
if one had a query with the view:
"Gene.name", "Gene.proteins.name"
Then in the normal case (that of an INNER join), we would only get
Genes that also have at least one protein that they reference. Simply
by asking for this output column you are placing a restriction on the
information you get back.
If in fact you wanted all genes, regardless of whether they had
proteins associated with them or not, but if they did you would rather
like to know _what_ proteins, then you need to specify this reference
to be an OUTER join:
query.add_join("Gene.proteins", "OUTER")
Now you will get many more rows of results, some of which will have
"null" values where the protein name would have been,
This method will also attempt to validate the join by calling
Query.verify_join_paths(). Joins must have a valid path, the style can
be either INNER or OUTER (defaults to OUTER, as the user does not need
to specify inner joins, since all references start out as inner joins),
and the path must be a reference.
- Returns: intermine.pathfeatures.Join
- Raises:
ModelError - if the path is invalid
TypeError - if the join style is invalid
|
Check that the joins are valid
Joins must have valid paths, and they must refer to references.
- Raises:
|
add_path_description(self,
*args,
**kwargs)
| source code
|
Add a path description to the query
example:
query.add_path_description("Gene.proteins.proteinDomains", "Protein Domain")
This allows you to alias the components of long paths to improve the
way they display column headers in a variety of circumstances. In the
above example, if the view included the unwieldy path
"Gene.proteins.proteinDomains.primaryIdentifier", it would
(depending on the mine) be displayed as "Protein Domain > DB
Identifer". These setting are taken into account by the webservice
when generating column headers for flat-file results with the
columnheaders parameter given, and always supplied when requesting
jsontable results.
- Returns: intermine.pathfeatures.PathDescription
|
Check that the path of the path description is valid
Checks for consistency with the data model
- Raises:
|
Returns the logic expression for the query
This returns the up to date logic expression. The default value is
the representation of all coded constraints and'ed together.
If the logic is empty and there are no constraints, returns an empty
string.
The LogicGroup object stringifies to a string that can be parsed to
obtain itself (eg: "A and (B or C or D)").
- Returns: intermine.constraints.LogicGroup
|
Sets the Logic given the appropriate input
example:
Query.set_logic("A and (B or C)")
This sets the logic to the appropriate value. If the value is
already a LogicGroup, it is accepted, otherwise the string is tokenised
and parsed.
The logic is then validated with a call to validate_logic()
raise LogicParseError: if there is a syntax error in the logic
|
Validates the query logic
Attempts to validate the logic by checking that every
coded_constraint is included at least once
- Raises:
QueryError - if not every coded constraint is represented
|
add_sort_order(self,
path,
direction=' asc ' )
| source code
|
Adds a sort order to the query
example:
Query.add_sort_order("Gene.name", "DESC")
This method adds a sort order to the query. A query can have
multiple sort orders, which are assessed in sequence.
If a query has two sort-orders, for example, the first being
"Gene.organism.name asc", and the second being
"Gene.name desc", you would have the list of genes grouped by
organism, with the lists within those groupings in reverse alphabetical
order by gene name.
This method will try to validate the sort order by calling
validate_sort_order()
Also available as Query.order_by
|
Check the validity of the sort order
Checks that the sort order paths are:
- Raises:
|
Return the current mapping of class to subclass
This method returns a mapping of classes used by the model for
assessing whether certain paths are valid. For intance, if you subclass
MicroArrayResult to be FlyAtlasResult, you can refer to the
.presentCall attributes of fly atlas results. MicroArrayResults do not
have this attribute, and a path such as:
Gene.microArrayResult.presentCall
would be marked as invalid unless the dictionary is provided.
Users most likely will not need to ever call this method.
- Returns: dict(string, string)
|
results(self,
row=' object ' ,
start=0,
size=None)
| source code
|
Return an iterator over result rows
Usage:
>>> for gene in query.results():
... print gene.symbol
Note that if your query contains any kind of collection, it is
highly likely that start and size won't do what you think, as they
operate only on the underlying rows used to build up the returned
objects. If you want rows back, you are recommeded to use the simpler
rows method.
- Parameters:
row (string) - the format for the row. Defaults to "object". Valid
options are "rr", "dict", "list",
"jsonrows", "object", jsonobjects",
"tsv", "csv".
- Returns: intermine.webservice.ResultIterator
- Raises:
|
Return the results as rows of data
This is a shortcut for results("rr")
Usage:
>>> for row in query.rows(start=10, size=10):
... print row["proteins.name"]
- Returns: iterable<intermine.webservice.ResultRow>
|
Get a list of result rows
This method is a shortcut so that you do not have to do a list
comprehension yourself on the iterator that is normally returned. If
you have a very large result set (and these can get up to 100's of
thousands or rows pretty easily) you will not want to have the whole
list in memory at once, but there may be other circumstances when you
might want to keep the whole list in one place.
It takes all the same arguments and parameters as Query.results
Also available as Query.all
|
Return the total number of rows this query returns
Obtain the number of rows a particular query will return, without
having to fetch and parse all the actual data. This method makes a
request to the server to report the count for the query, and is sugar
for a results call.
Also available as Query.size
- Returns: int
- Raises:
|
Returns the uri to use to create a list from this query
Query.get_list_upload_uri() -> str
This method is used internally when performing list operations on
queries.
- Returns: str
|
Returns the uri to use to create a list from this query
Query.get_list_append_uri() -> str
This method is used internally when performing list operations on
queries.
- Returns: str
|
Returns the path section pointing to the REST resource
Query.get_results_path() -> str
Internally, this just calls a constant property in
intermine.service.Service
- Returns: str
|
Returns the child objects of the query
This method is used during the serialisation of queries to xml. It
is unlikely you will need access to this as a whole. Consider using
"path_descriptions", "joins",
"constraints" instead
- Returns: list
- the child element of this query
- See Also:
-
Query.path_descriptions,
Query.joins,
Query.constraints
|
Returns the parameters to be passed to the webservice
The query is responsible for producing its own query parameters.
These consist simply of:
-
query: the xml representation of the query
- Returns: dict
|
Returns a DOM node representing the query
This is an intermediate step in the creation of the xml serialised
version of the query. You probably won't need to call this
directly.
- Returns: xml.minidom.Node
|
Return an XML serialisation of the query
This method serialises the current state of the query to an xml
string, suitable for storing, or sending over the internet to the
webservice.
- Returns: string
- the serialised xml string
|
Return a readable XML serialisation of the query
This method serialises the current state of the query to an xml
string, suitable for storing, or sending over the internet to the
webservice, only more readably.
- Returns: string
- the serialised xml string
|
Performs a deep clone
This method will produce a clone that is independent, and can be
altered without affecting the original, but starts off with the exact
same state as it.
The only shared elements should be the model and the service, which
are shared by all queries that refer to the same webservice.
- Returns:
- same class as caller
|
LOGIC_PRODUCT
- Value:
[ ( ' and ' , ' and ' ) , ( ' and ' , ' or ' ) , ( ' or ' , ' and ' ) , ( ' or ' , ' or ' ) ]
|
|
constraints
Returns the constraints of the query
Query.constraints → list(intermine.constraints.Constraint)
Constraints are returned in the order of their code (normally the
order they were added to the query) and with any subclass contraints at
the end.
- Get Method:
- unreachable.constraints(self)
- Query.constraints → list(intermine.constraints.Constraint)
- Type:
- list(Constraint)
|
coded_constraints
Returns the list of constraints that have a code
Query.coded_constraints →
list(intermine.constraints.CodedConstraint)
This returns an up to date list of the constraints that can be used
in a logic expression. The only kind of constraint that this excludes,
at present, is SubClassConstraints
- Get Method:
- unreachable.coded_constraints(self)
- Query.coded_constraints → list(intermine.constraints.CodedConstraint)
- Type:
- list(intermine.constraints.CodedConstraint)
|