cubicweb logo

Table Of Contents

Previous topic

9. Profiling and performance

Next topic

Web side development

This Page

10. Full Text Indexing in CubicWeb

When an attribute is tagged as fulltext-indexable in the datamodel, CubicWeb will automatically trigger hooks to update the internal fulltext index (i.e the appears SQL table) each time this attribute is modified.

CubicWeb also provides a db-rebuild-fti command to rebuild the whole fulltext on demand:

cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance

You can also rebuild the fulltext index for a given set of entity types:

cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance Ticket Version

In the above example, only fulltext index of entity types Ticket and Version will be rebuilt.

10.1. Standard FTI process

Considering an entity type ET, the default fti process is to :

  1. fetch all entities of type ET
  2. for each entity, adapt it to IFTIndexable (see IFTIndexableAdapter)
  3. call get_words() on the adapter which is supposed to return a dictionary weight -> list of words as expected by index_object(). The tokenization of each attribute value is done by tokenize().

See IFTIndexableAdapter for more documentation.

10.2. Yams and fulltext_container

It is possible in the datamodel to indicate that fulltext-indexed attributes defined for an entity type will be used to index not the entity itself but a related entity. This is especially useful for composite entities. Let’s take a look at (a simplified version of) the base schema defined in CubicWeb (see cubicweb.schemas.base):

class CWUser(WorkflowableEntityType):
    login     = String(required=True, unique=True, maxsize=64)
    upassword = Password(required=True)

class EmailAddress(EntityType):
    address = String(required=True,  fulltextindexed=True,
                     indexed=True, unique=True, maxsize=128)


class use_email_relation(RelationDefinition):
    name = 'use_email'
    subject = 'CWUser'
    object = 'EmailAddress'
    cardinality = '*?'
    composite = 'subject'

The schema above states that there is a relation between CWUser and EmailAddress and that the address field of EmailAddress is fulltext indexed. Therefore, in your application, if you use fulltext search to look for an email address, CubicWeb will return the EmailAddress itself. But the objects we’d like to index are more likely to be the associated CWUser than the EmailAddress itself.

The simplest way to achieve that is to tag the use_email relation in the datamodel:

class use_email(RelationType):
    fulltext_container = 'subject'

10.3. Customizing how entities are fetched during db-rebuild-fti

db-rebuild-fti will call the cw_fti_index_rql_queries() class method on your entity type.

classmethod AnyEntity.cw_fti_index_rql_queries(req)

return the list of rql queries to fetch entities to FT-index

The default is to fetch all entities at once and to prefetch indexable attributes but one could imagine iterating over “smaller” resultsets if the table is very big or returning a subset of entities that match some business-logic condition.

Now, suppose you’ve got a _huge_ table to index, you probably don’t want to get all entities at once. So here’s a simple customized example that will process block of 10000 entities:

class MyEntityClass(AnyEntity):
    __regid__ = 'MyEntityClass'

@classmethod
def cw_fti_index_rql_queries(cls, req):
    # get the default RQL method and insert LIMIT / OFFSET instructions
    base_rql = super(SearchIndex, cls).cw_fti_index_rql_queries(req)[0]
    selected, restrictions = base_rql.split(' WHERE ')
    rql_template = '%s ORDERBY X LIMIT %%(limit)s OFFSET %%(offset)s WHERE %s' % (
        selected, restrictions)
    # count how many entities you'll have to index
    count = req.execute('Any COUNT(X) WHERE X is MyEntityClass')[0][0]
    # iterate by blocks of 10000 entities
    chunksize = 10000
    for offset in xrange(0, count, chunksize):
        print 'SENDING', rql_template % {'limit': chunksize, 'offset': offset}
        yield rql_template % {'limit': chunksize, 'offset': offset}

Since you have access to req, you can more or less fetch whatever you want.

10.4. Customizing get_words()

You can also customize the FTI process by providing your own get_words() implementation:

from cubicweb.entities.adapters import IFTIndexableAdapter

class SearchIndexAdapter(IFTIndexableAdapter):
    __regid__ = 'IFTIndexable'
    __select__ = is_instance('MyEntityClass')

    def fti_containers(self, _done=None):
        """this should yield any entity that must be considered to
        fulltext-index self.entity

        CubicWeb's default implementation will look for yams'
        ``fulltex_container`` property.
        """
        yield self.entity
        yield self.entity.some_related_entity


    def get_words(self):
        # implement any logic here
        # see http://www.postgresql.org/docs/9.1/static/textsearch-controls.html
        # for the actual signification of 'C'
        return {'C': ['any', 'word', 'I', 'want']}