ManagableIndex
can mean different things for different
people. For a content manager, it brings flexible field, keyword
and efficient range indexes managable via the ZMI to
portal_type
to the Subject
index
transforming it into a sequence (as required by a keyword index).
For a developer, ManagableIndex
provides
a framework for index definition, improving on PluginIndexes
.
It provides for managability,
automatically and intelligently handles unindexing when an object
is reindexed and implements and, or and
range queries (for not too complex indexes).
The main tasks of an index are to index objects under a set of index terms derived from the object and efficiently locate all objects indexed under a given term.
Indexing consists of 3 stages: evaluation of the object to obtain the object's value (with respect to the index), deriving index terms from the value and storing the association term/object in a way such that the objects associated with a term can quickly be retrieved.
Evaluation is specified by a sequence of ValueProviders
associated with the index.
A ValueProvider is a device that returns a value for an object.
If the return value is not None
,
then it is interpreted as the
object's value with respect to this ValueProvider.
A ValueProvider
can have an associated
IgnorePredicate, a TALES expression.
When the IgnorePredicate
yields true for
a value, then it is ignored. You may e.g. specify that
empty strings should be ignored.
A ValueProvider
can have an associated
Normalizer, a TALES expression.
The Normalizer
is applied to not ignored values
to transform them in a standard form, e.g. the normalizer
for a keyword index can transform a string into a one element
tuple containing the string as a keyword index requires a sequence
value.
The most essential ValueProviders are AttributeLookups. An AttributeLookup determines the object's value through the lookup of an attribute. The AttributeLookup's configuration determines whether acquisition can be used, whether a callable value should be called and whether exceptions should be ignored.
ExpressionEvaluators are another type of ValueProviders.
These are TALES expressions defining an objects value.
ExpressionEvaluator
s often avoid to define
simple scripts just for indexing purposes.
Warning: Until Zope 2.7, TALES
expressions have been trusted when used outside of Zope; from
Zope 2.8 on, TALES expression evaluation is always subject
to Zope's security restrictions even when used outside of Zope.
This may have strange consequences when you perform index management
(e.g. mass reindexing) in an external script (run outside of Zope).
In such cases, you should probably let the script login as a Zope user
with sufficient priviledges.
When an index has several ValueProviders, their values must be combined to define the object's value.
None
and ignored values are always ignored
and are not combined.
Currently, there are three combiners:
useFirst
union
aggregate
RangeIndex
.Once the object's value (with respect to the index) has been
determined, the value is split into a set of index terms
in an index specific way (e.g. a FieldIndex
uses the value directly, a KeyWordIndex
requires the value to be a sequence and uses its elements as
index terms and a TextIndex
splits the value into
words and uses the words).
The terms are then standardized.
Standardization consists of serveral steps: prenormalization such as case normalization, stemming, phonetic normalization, ... elimination of stop terms, normalization such as type conversion, type checking and copying.
A index can define a term prenormalizer, a TALES expression. The prenormalizer is applied to terms before term expansion in queries and always as the first step of normalization during indexing. It can be used e.g. for case normalization, stemming, phonetic normalization, synonym normalization, ...
An index can define a StopTermPredicate, a TALES expression. When the predicate yields true for a term, the term is ignored.
An index can define a NormalizeTerm TALES expression. The expression can be used to transform the term into some standard form, e.g. convert a descriptive string or complex object into a code used for efficient indexing.
The BTrees used in index implementation require that any index term must be persistently comparable to any other index term in this index (otherwise, the index gets corrupted). To help observing this essential property, the index term's type can be restricted.
The following type restrictions are defined:
not checked
numeric
int
,
float
or long
),string
ustring
TermTypeExtra
can specify an encoding used for conversion.integer
DateTime
DateTime
object or float --
the index stores the date as a float (seconds since epoch),DateTimeInteger
DateTime
object or integer -- the index stores the date as an integer (seconds since epoch; truncated, if necessary),DateInteger
DateTime
object or integer -- the index stores an integer representing the date
(400 * year + 31 * (month-1) + (day-1)
,tuple
with tuplespec in TermTypeExtra
n
(numeric), s
(string),
u
(unicode string),
d
(datetime).
An encoding for unicode conversion can be specified after tuplespec,
separated by ;
.
instance
with fullClassName in TermTypeExtra
__cmp__
method.
It is assumed without check, that this __cmp__
implements a persistent comparison.
expression checked
with TermTypeExtra
specifying checking TALES expressionNone
or
raises an exception, the type is considered inacceptable, otherwise
the value is used instead of the original term.
All checkers try to convert a term into an acceptable type. Only when this fails, an exception is raised.
If you choose one of the integer types, i.e. integer
,
DateTimeInteger
or DateInteger
,
an especially efficient index type is build.
You must clear and reindex the index when the index type is changed.
You must clear the index even when it is already empty (because the data
structures may need to be changed).
It is dangerous to index mutable types. If a indexed mutable object is later changed, its ordering with respect to other indexed objects may change, corrupting the index. Corruption of this kind can be avoided when the mutable object is copied before it is stored in the index.
TermCopy
specifies whether the value should be directly used,
shallow copied or deep copied.
Currently, ManagableIndexes
supports 3 types of
indexes: FieldIndex
, KeywordIndex
and
RangeIndex
. Further types can easily be implemented.
FieldIndex
A FieldIndex
indexes an object under a single term,
the object's value with respect to the index.
You get efficient DateTime and Date indexes by selecting
DateTime
(stored as float),
DateTimeInteger
(stored as integer) or
DateInteger
(stored as integer) as
TermType
. You must clear and reindex when you
change the type.
You can very efficiently sort search results via a FieldIndex
.
The ZCatalog
's method searchResults
(aka __call__
) supports this via its sort_on
argument. My
AdvancedQuery product supports arbitrary levels of sorting
via FieldIndex
es.
AdvancedQuery
's FieldIndex
based ascending sorting
is much more efficient than descending sorting. To make descending sorting
more efficient, you can tell your FieldIndex
to
maintain the sort keys (also) in reverse order. This will slow indexing down
a bit but make descending sorts much faster. When you change this
attribute, you must clear and reindex the index (otherwise, the
reverse order is not consistent).
KeywordIndex
A KeywordIndex
expects the value of an object to
be a sequence. It indexes the object under each element of this sequence.
RangeIndex
A RangeIndex
expects an object's value
to be a pair specifying an interval low to high.
The index can efficiently locate documents for which a given term
t lies within the document's low and high
bounds.
The object's value can either by constructed by a single
attribute or with an aggregate
combiner.
To provide for a partial plugin replacement for CMF's
effective
and expires
indexes,
RangeIndex
supports a Boundary names
property. If set, it should be a pair of two names
low and high. The index will then
execute queries of the form low <= val <= high
.
To be compatible with AdvancedQuery
,
the index replacing effective
and expires
should have the name ValidityRange
.
RangeIndex
efficiently supports improper ranges,
i.e. those where at least one boundary is unlimited. You
use its Mininal value
and Maximal value
properties to define which values should be considered as
unlimited. These properties are TALES valued and are evaluated
when the index is cleared. All values at or below (the value of)
Minimal value
are identified and interpreted
as no lower limit; similarly, all values at or above (the value of)
Maximal value
are identified and interpreted as no
upper limit.
The boolean property Organisation 'high-then-low'
controls the index organisation. With high-then-low
organisation, the high
index is primary and
the low
index is subordinate; low-then-high
indicates the opposite organisation. You should use high-then-low
when val <= high
is less likely than low <= val
.
This is to be expected for date ranges and typical queries against them.
WordIndex
A WordIndex
indexes an object under a set of words.
It uses a ZCTextIndex.PLexicon
for splitting a
text into a sequence of words.
A WordIndex
lies between a KeywordIndex
and a TextIndex
.
Like a TextIndex
, it uses a Lexicon
to split values into
a sequence of words, stores integer word ids in its index
and does not support range queries.
Like a KeywordIndex
, it indexes an object under a set of words --
no near or phrase queries and no relevancy ranking.
Due to these restrictions, a WordIndex
is very efficient with respect
to transaction and load size.
The motivation to implement a
WordIndex
came from my observation, that almost all our ZEO loads
above 100ms were caused by loading large word frequency
IOBuckets used by TextIndex
es for relevancy ranking -- a feature
we do not need and use. Many of these loads transfered buckets
of several 10 kB and a considerable portion of them took more
than a second. As word frequency information is necessary for
each document in a hit, you can imagine how fast our queries
were.
On the other hand, a WordIndex
is only useful when you
use a flexible query framework, such as e.g. AdvancedQuery
or CatalogQuery
.
The standard ZCatalog
framework is too weak as it does not
allow to have several subqueries against the same index
in a query. That's the reason why TextIndex
es come with
their own query parser. I live in Germany and therefore
the standard query parser is useless for us (we use und
, oder
,
nicht
instead of and
, or
and not
) and I have AdvancedQuery
--
thus I did not care to give the new WordIndex
a query parser.
You could easily provide one -- should you feel a need for it.
SimpleTextIndex
A simple text index can perform word or phrase queries.
It uses a ZCLexicon
like lexicon to parse text into a word
sequence.
Unlike almost all Zope text indexes, it does not have a built
in query parser. If you need more complex queries use
something like AdvancedQuery
to specify your complex queries.
Moreover, SimpleTextIndex
does not support ranking.
Search terms are either a text or a sequence of ints.
A text is converted via the lexicon into a sequence of ints.
If an int sequence is passed in, then it is assumed that
the text -> wordid conversion was performed outside.
The query is either interpreted as an and query of
the given words or as a phrase query, dependent on the
phrase
option.
For phrase queries, the use of IncrementalSearch2
is strongly
recommended as it drastically speeds up phrase queries.
PathIndex
A PathIndex
indexes an object under a path. A path is
is either a tuple or a '/' separated string. The index supports
path queries. For a given path, it locates objects that contain
this path as subpath in its path value. Two search parameters
level and depth control where in the
object's path the given path may occur.
level and depth can be either None
or an integer. When level and depth are
both greater or equal 0, then an object with path op
matches path p with respect to level
and depth, iff op = p1 p p2
and
len(p1) = level
and
len(p2) = depth
.
This means that level controls p's distance
from the beginning
and depth that from the end of op.
A None
value means
that there is no restriction for the respective side.
A negative value means that the distance from the respective side
may be up to (including) the negated value. E.g. a "-2" allows
up to 2 segments.
The default value for level is 0
,
that for depth is None
.
Note, that getPhysicalPath
requires acquisition
to work properly. You must not set the acquisition type for
such a value provider to none
.
Most string and unicode based indexes (exception RangeIndex
)
support regular expression and glob matching. You use
the query option match
with a value of 'glob'
or 'regexp'
to call for this feature.
Note that for large indexes, the match term should begin with an (easily recognizable) plain text string. Otherwise, this query type can be very inefficient.
ManagableIndex
uses TALES expressions at many places.
They always can use the predefined variables:
index
catalog
root
modules
nothing
None
value
object
If the TALES expression evaluates to a callable object, then this is called on the value and the result used; otherwise, the evaluation result is used directly.
The module Utils
contains some useful functions:
convertToDateTime(value)
,
convertToDateTimeInteger(value, exc=0)
and
convertToDateInteger(value, round_dir=-1)
.
Please see the source documentation, for details.
A primary goal for ManagableIndex
is the easy and flexible
configuration via the Zope Management Interface (ZMI). Occasionally, however,
you want to create your ManagableIndexes programmatically and want to
configure them in the same step. The ZCatalog
's
[manage_]addIndex
allow you to pass configuration information
down to the index construction in the extra
argument.
For the creation of a ManagableIndex, extra
must have
a dict value.
Its key 'ValueProviders'
determines the indexes'
value providers. All other keys can provide values for the indexes
properties.
If the key 'ValueProviders'
is present, its value
must be a sequence of value provider specifications which define
the value providers for the index. If the key is not present, a
default value provider is created.
A value provider specification is a dict with the mandatory keys
'id'
and 'type'
. type
specifies the value provider's type ('AttributeLookup'
or
'ExpressionEvaluator'
). Additional keys can defined values
for the value providers properties.
If the configuration dicts contain keys different from the above mentioned
special ones and not corresponding to a property,
a ValueError
exception is raised.
Look for the _properties
definitions in the source, to
determine the available properties for the various types of objects.