AdvancedQuery

Introduction

AdvancedQuery is a Zope product extending Zope's search engine ZCatalog with the following key features:

Query Objects

Queries are specified by (full blown) Python objects. They are constructed in the following way:

Expressionprinted asMeaning
Eq(index, value, filter=False) index = value the documents indexed by index under value
Le(index, value, filter=False) index <= value the documents indexed by index under a value less or equal value
Ge(index, value, filter=False) index >= value the documents indexed by index under a value greater or equal value
Between(index, min, max, filter=False) min <= index <= max the documents indexed by index under a value between min and max
In(index, sequence, filter=False) index in sequence the documents indexed by index under a value in sequence
Generic(index, value, filter=False) index ~~ value this query type is used to pass any search expression to index as understood by it. Such search expressions usually take the form of a dictionary with query as the most essential key. Generic is necessary to use the full power of specialized indexes, such as the level argument for PathIndex searches.
Indexed(index) Indexed(index) the documents that are indexed by index. This does not work for all index types.
MatchGlob(index, pattern, filter=False) index =~ pattern the documents indexed by index under a value matching the glob pattern. A glob pattern can contain wildcards * (matches any sequence of characters) and ? (matches any single character).
This query type is only supported for indexes which can be adapted to IKeyedIndex. In addition, the index must index text values.
MatchRegexp(index, regexp, filter=False) index =~~ regexp the documents indexed by index under a value matching the regular expression regexp. See the re module documentation in the Python Library Reference, for a description of regular expressions.
This query type is only supported for indexes which can be adapted to IKeyedIndex. In addition, the index must index text values.
Filter(index, filter) Filtered(index, filter) filter out documents not accepted by filter.
filter is called with the document's indexed value; if it returns a true value, the document is accepted and rejected otherwise. Note that you must precisely know how index determines a document's indexed value to use this properly.
LiteralResultSet(set) LiteralResultSet(set) the documents specified by set.
set must be an IISet, IITreeSet or sequence of catalog "data_record_id_"s.
This can e.g. be used to further restrict the document set previously obtained through a query (e.g. for "facetting").
~ query ~ query Not: the documents that do not satisfy query
query1 & query2 (query1 & query2) And: the documents satisfying both query1 and query2
And(*queries) (query1 & ... & queryn) And: the documents satisfying all queries; if queries is empty, any document satisfies this And query
query1 | query2 (query1 | query2) Or: the documents satisfying either query1 or query2 (or both)
Or(*queries) (query1 | ... | queryn) Or: the documents satisfying (at least) one of queries; if queries is empty, no document satisfies this Or query

A true filter value calls for incremental filtering. It is supported only for indexes which can be adapted to IFilterIndex.

And and Or queries are so called CompositeQuerys. They possess a method addSubquery(query) to add an additional subquery.

The constructors are imported from Products.AdvancedQuery.

AdvancedQuery uses so called Monkey Patching to give ZCatalog the new method makeAdvancedQuery(catalogSearchSpec). A catalogSearchSpec is a search specification as described in the Zope Book for ZCatalog searches (essentially a dictionary mapping index names to search specifications). makeAdvancedQuery returns the equivalent AdvancedQuery search object.

Query evaluation

AdvancedQuery uses so called Monkey Patching to give ZCatalog the new methods evalAdvancedQuery(query, sortSpecs=(), withSortValues=_notPassed, **kw) and _unrestrictedEvalAdvancedQuery(query, sortSpecs=(), withSortValues=_notPassed, restricted=False, **kw).

evalAdvancedQuery evaluates query and then sorts the document result set according to sortSpecs.
If withSortValues is not passed in, it is set to True if sortSpecs contains a ranking specification (as you are probably interested in the rank) and to False otherwise.
If withSortValues, then the data_record_score_ attribute of the returned proxies is abused to hold the sort value. It is a tuple with one component per component in sortSpecs. The attribute data_record_normalized_score_ is set to None.

Classes derived from ZCatalog can by default automatically restrict queries. For example, Products.CMFCore.CatalogTool.CatalogTool retricts queries automatically to those documents for which the current user has View rights and which are "active". _unrestrictedEvalAdvancedQuery allows to avoid this automatic restriction.

Sorting

AdvancedQuery supports incremental multi-level lexicographic sorting via field index like indexes. If an index used for sorting is not field index like (i.e. does not index an object under at most one value), you may get funny (and partly non determistic) results.

Sorting is specified by a sequence of sort specifications, each for a single level. Such a specification is either an index name, a pair index name and direction or a ranking specification (see below). A direction is 'asc' (ascending) or 'desc' (descending); if the direction is not specified, 'asc' is assumed.

When the result contains documents not indexed by a sorting index, such documents are delivered after indexed documents. This happens always, independant of search direction.

Incremental Filtering

From version 1.1 on, AdvancedQuery supports incremental filtering. Incremental filtering can be very promissing for an unspecific subquery inside an otherwise specific And query, especially for large Le, Ge, Between and range subqueries. If we use the index in the normal way a huge Or query is constructed for such subqueries. Even dm.incrementalsearch cannot fully optimize the search against this huge Or query. Whith incremental filtering the index is not used in the normal way. Instead, the remaining And subqueries are used to produce a set of document candidates. These are then filtered by the filtering subquery, discarding documents not matching the subquery. Provided that the other And subqueries already have reduced the document set sufficiently, incremental filtering can save a lot of time.

You request incremental filtering for an (elementary) subquery with the filter keyword argument. Usually, you use it only for some subqueries of specific And queries. Otherwise, incremental filtering may not reduce but increase the query time (even considerably).

If you have more than a single filtering subquery in an And query, their order might be relevant for efficiency. You should put filtering subqueries that are likely to reduce the document set more before other filtering subqueries.

Incremental filtering requires that the affected index can be adapted to IFilterIndex; otherwise, the filter argument is ignored. In addition, you should consider the use of dm.incrementalsearch when you make significant use of incremental filtering. dm.incrementalsearch can globally optimize incremental filtering while otherwise only a local optimization is possible.

Ranking

From version 2.0 on, AdvancedQuery supports incremental ranking. Ranking is a form of sorting. Therefore, you specify it as a sort spec. Ranking can be combined with other sort specs in the usual way (leading to multi-level sorting).

Like sorting in general, ranking is performed incrementally -- just as far as you have looked at the result. Therefore, although ranking in general is very expensive, its effect can be small if you only look at the first few (hundred) result objects (rather than the several hundred thousands).

Currently, the ranking specifications RankByQueries_Sum, and RankByQueries_Max are supported. In both cases, you call the constructors with one or more pairs (q, vq), i.e. with a sequence of weighted queries.
The rank of a document is the sum or the maximum of the weights for queries matching the document, respectively.
Note that the runtime behaviour for RankByQueries_Sum is exponential, that of RankByQueries_Max linear in the number of queries involved in the ranking.
Note that you probably want to normalize the document ranks. The ranking classes above have methods getQueryValueSum() and getQueryValueMax(), respectively, that can help with this.

Examples

from Products.AdvancedQuery import Eq, Between, Le

# search for objects below 'a/b/c' with ids between 'a' and 'z~'
query = Eq('path','a/b/c') & Between('id', 'a', 'z~')

# evaluate and sort descending by 'modified' and ascending by 'Creator'
context.Catalog.evalAdvancedQuery(query, (('modified','desc'), 'Creator',))

# search 'News' not yet archived and 'File's not yet expired.
now = context.ZopeTime()
query = Eq('portal_type', 'News') & ~ Le('ArchivalDate', now)
	| Eq('portal_type', 'File') & ~ Le('expires', now)
context.Catalog.evalAdvancedQuery(query)

# search 'News' containing 'AdvancedQuery' and filter out
# not yet effective or still expired documents.
query = Eq('portal_type', 'News') & Eq('SearchableText', 'AdvancedQuery') \
  & Ge('expires', now, filter=True) & Le('effective', now, filter=True)
context.Catalog.evalAdvancedQuery(query)

# search for 'ranking' in 'SearchableText' and rank very high
# when the term is in 'Subject' and high when it is in 'Title'.
# print the id and the normalized rank
from Products.AdvancedQuery import RankByQueries_Sum
term = 'ranking'
rs = RankByQueries_Sum((Eq('Subject', term),16), (Eq('Title', term),8))
norm = 1 + rs.getQueryValueSum()
for r in context.Catalog.evalAdvancedQuery(
    Eq('SearchableText', term), (rs,)
    ):
    print r.getId, (1 + r.data_record_score_) / norm

Important note about caching

You must not cache the result of an AdvancedQuery unless you have ensured that sorting has finished (e.g. by accessing the last element in the result). This is because AdvancedQuery uses incremental sorting with BTrees iterators. Like any iterator, they do not like when the base object changes during iteration. Nasty types of (apparently) non-deterministic errors can happen when the index changes during sorting.

Download and installation

The current version supports Zope 4 (and above), is maintained on PyPI and can be pip installed. To use it, its configure.zcml must be "executed" at startup (which typically happens automatically).

For the use in Plone (version 5.2+), the companion package dm.plone.advancedquery must be installed and its configure.zcml "executed" at startup.

License

This software is open source and licensed under a BSD style license. See the license file in the distribution for details.

Optimizations

Former versions relied entirely on dm.incrementalsearch for optimizations. To get the full potential, the indexes should have known about dm.incrementalsearch as well and used it for their lookup; likely only Products.ManagableIndex indexes did this. From version 4 on, optimizations no longer rely on dm.incrementalsearch (even though this is still used, if installed). Optimizations now rely on (conditional) adapters. In fact, (almost) the complete query evaluation is controlled via adapters -- and by overriding the package's adapters, you could (in principle) take over complete control over the query evaluation. Likely, you will not do this but maybe register additional adapters to provide optimizations for new indexes.

Query evaluation

Query evaluation proceeds in the following steps:

  1. The query is "optimized" on the query level. For example, "and" subqueries in an "and" query are dissolved by moving their subqueries to the enclosing query; nested empty queries are eliminated; Generic queries are transformed into specific queries (if possible).
    You could e.g. define an adapter for this step to make the optimizations of CompositeIndex available.
  2. The query is transformed into an evaluation tree. The leaves of those trees are "Set"s, "Lookup"s or "Filter"s, the inner nodes correspond to "and", "or" and "not" combinations of the subtrees. To get the subtree corresponding to an index query (i.e. an elementary query with a parameter index), an adapter for the index and the query is looked up. If the index supports the query, then such an adapter (based on Products.PluginIndexes.interfaces.IPluggableIndex) is available. By defining a more specific adapter, the index's lookup can be "white boxed" by specifying how the lookup result is combined from more elementary lookups via "and", "or", "not" (and potentially "filter"). This allows for more optimizations over the case that the index is treated as a "black box".
  3. The evaluation tree is optimized -- using elementary properties of "and", "or" and "not".
  4. The evaluation tree is evaluated into a set of document ids -- the result set of the query.
  5. The result set is optionally sorted.

Supporting a new index

AdvancedQuery should be able to work with any index implementing Products.PluginIndexes.interfaces.IPluggableIndex out of the box. No index specific configuration should be necessary for search features also supported by ZCatalog.

If AdvancedQuery extensions should be supported for the new index (e.g. filtering or matching) or if searches involving the index should benefit from index specific optimizations, then it might become necessary to register corresponding adapters for the new index. Those adapters would typically have as "provided" interface IQueryNodeOptimizer, IQueryConverter, IFilterIndex, IIndexedValue, IMultiplicityAware, ITermValueMatch, IIndexed, IKeyedIndex, IKeyNormalizingIndex, ILookupIndex, or ILookupTreeIndex, all defined in Products.AdvancedQuery.eval.interfaces. It is typically not necessary to define adapters for all those interfaces. For example, the IQueryNodeOptimizer adapter is necessary only when the index wants to perform optimizations on the query level (as e.g. CompositeIndex does). IFilterIndex, IIndexedValue, IMultiplicityAware and ITermValueMatch may be relevant for filtering. IMultiplicityAware is used in the optimization of not, if available. IIndexed is required for an index, when the Indexed query should be supported for this index. IKeyedIndex is typically required for the matching queries; and used for optimized convertions of Le, Ge and between queries. If the new index normalizes its search terms and you define an IKeyedIndex or IFilterIndex adapter, then likely an IKeyNormalizingIndex adapter is required. The "Lookup" and IQueryConverter adapters are always optional and used for optimizations; typically, at most one of those would be defined for an index.

There are roughly two cases:

  1. The index is fairly simple. Then there is a good change, that all AdvancedQuery extensions can be supported. One would register adapters for the "provided" interfaces IFilterIndex, IIndexedValue, IMultiplicityAware, IIndexed, IKeyedIndex, IKeyNormalizingIndex and optionally for ILookupIndex or ILookupTreeIndex. Many of those adapters could be taken over from those for UnIndex. Examples are in Products.AdvancedQuery.eval.adatper.*index.
  2. The index is fairly complex. In this case, one would likely do without AdvancedQuery extensions (such as filtering, Indexed queries, ...) and either define no adapter at all or define one or several IQueryConverter adapters for this index. Examples are in Products.AdvancedQuery.eval.adapter.query.converter.*index.

Conditional adaptation

Whereever this documentation speaks of adaptation, it actually means "conditional adaptation". A conditional adapter is a zope.interface "subscription adapter" usually with an associated condition. Products.AdvancedQuery.eval.adapter contains functions to define and look up conditional adapters as well as typical conditions.

The new concept "conditional adapter" is necessary because Zope's standard adapter concept makes assumptions not valid in our context. For example, an adapter defined for an index I would be considered adequate for any index J inheriting from I unless this adapter was overridden by another adapter registered for index K inheriting from I and either J is K or inherits from it. The adapters employed by AdvancedQuery for an index I are typically not adequate for all indexes J inheriting from I. If AdvancedQuery would use "normal" adapters, then such an index J would require the registration of an adequate overriding adapter for J, otherwise search results involving J could be wrong. As Zope's index system is open (flexibly extendable), the risk would be too great. Therefore, AdvancedQuery uses conditional adapters with a condition typically of the form "applicable to index I and derived indexes provided they do not override any of the following methods". A conditional adatper is looked up like a "normal" adapter with the exception that non applicable adapters are skipped. This makes it possible that a more general adapter can override a more specific one -- provided that the latter is not applicable.


Dieter Maurer
Last modified: Mon Apr 22 07:50:26 CEST 2019