Package turbolucene
[hide private]
[frames] | no frames]

Source Code for Package turbolucene

   1  # -*- coding: utf-8 -*- 
   2   
   3  #---Header--------------------------------------------------------------------- 
   4   
   5  #============================================================================== 
   6  # turbolucene/__init__.py 
   7  # 
   8  # This is part of the TurboLucene project (http://dev.krys.ca/turbolucene/). 
   9  # 
  10  # Copyright (c) 2007 Krys Wilken <krys AT krys DOT ca> 
  11  # 
  12  # This software is licensed under the MIT license.  See the LICENSE file for 
  13  # licensing information. 
  14  # 
  15  #============================================================================== 
  16   
  17  """Provides search functionality for TurboGears_ using PyLucene_. 
  18   
  19  This module uses PyLucene to do all the heavy lifting, but as a result this 
  20  module does some fancy things with threads. 
  21   
  22  PyLucene requires that all threads that use it must inherit from 
  23  ``PythonThread``.  This means either patching CherryPy_ and/or TurboGears, or 
  24  having the CherryPy thread hand off the request to a ``PythonThread`` and, in 
  25  the case of searching, wait for the result.  The second method was chosen so 
  26  that a patched CherryPy or TurboGears does not have to be maintained. 
  27   
  28  The other advantage to the chosen method is that indexing happens in a separate 
  29  thread so the web request can return more quickly by not waiting for the 
  30  results. 
  31   
  32  The main disadvantage with PyLucene and CherryPy, however, is that *autoreload* 
  33  does not work with it.  You **must** disable it by adding 
  34  ``autoreload.on = False`` to your ``dev.cfg``. 
  35   
  36  Configuration options 
  37  ===================== 
  38   
  39  TurboLucene_ uses the following configuration options: 
  40   
  41    **turbolucene.search_fields**: 
  42      The list of fields that should be searched by default when a specific field 
  43      is not specified.  (e.g. ``['id', 'title', 'text', 'categories']``) 
  44      (Default: ``['id']``) 
  45    **turbolucene.default_language**: 
  46      The default language to use if a language is not given calling 
  47      `add`/`update`/`search`/etc.  (Default: ``'en'``) 
  48    **turbolucene.languages**: 
  49      The list of languages to support.  This is a list of ISO language codes 
  50      that you want to support in your application.  The languages must be 
  51      supported by PyLucene and must be configured in the languages 
  52      configuration file.  Currently the choice of languages that are possible 
  53      out-of-the-box are : *Czech (cs)*, *Danish (da)*, *German (de)*, *Greek 
  54      (el)*, *English (en)*, *Spanish (es)*, *Finnish (fi)*, *French (fr)*, 
  55      *Italian (it)*, *Japanese (ja)*, *Korean (ko)*, *Dutch (nl)*, *Norwegian 
  56      (no)*, *Portuguese (pt)*, *Brazilian (pt-br)*, *Russian (ru)*, *Swedish 
  57      (sv)*, and *Chinese (zh)*.  (Default: ``[<default_language>]``) 
  58    **turbolucene.default_operator**: 
  59      The default search operator to use between search terms when non is 
  60      specified.  (Default: ``'AND'``)  This must be a valid operator object from 
  61      the ``PyLucene.MultiFieldQueryParser.Operator`` namespace. 
  62    **turbolucene.optimize_days**: 
  63      The list of days to schedule index optimization.  Index optimization cleans 
  64      up and compacts the indexes so that searches happen faster.  This is a list 
  65      of day numbers (Sunday = 1).  Optimization of all indexes will occur on 
  66      those days.  (Default: ``[1, 2, 3, 4, 5, 6, 7]``, i.e. every day) 
  67    **turbolucene.optimize_time**: 
  68      A tuple containing the hour (24 hour format) and minute of the time to run 
  69      the scheduled index optimizations.  (Default: ``(00, 00)``, i.e. midnight) 
  70    **turbolucene.index_root**: 
  71      The base path in which to store the indexes.  There is one index per 
  72      supported language.  Each index is a directory.  Those directories will be 
  73      sub-directories of this base path.  If the path is relative, it is 
  74      relative to your project's root.  Normally you should not need to override 
  75      this unless you specifically need the indexes to be located somewhere else. 
  76      (Default: ``u'index'``) 
  77    **turbolucene.languages_file**: 
  78      The path to the languages configuration file.  The languages configuration 
  79      file provides the configuration information for all the languages that 
  80      *TurboLucene* supports.  Normally you should not need to override this. 
  81      (Default: the ``u'languages.cfg'`` file in the `turbolucene` package) 
  82    **turbolucene.languages_file_encoding**: 
  83      The encoding of the languages file.  (Default: ``'utf-8'``) 
  84    **turbolucene.stopwords_root**: 
  85      The languages file can specify files that contain stopwords.  If a 
  86      stopwords file path is relative, this path with be prepended to it.  This 
  87      allows for all stopword files to be customized without needing to specify 
  88      full paths for every one.  Normally you should not need to override this. 
  89      (Default: the ``stopwords`` directory in the `turbolucene` package) 
  90   
  91  All fields are optional, but at the minimum, you will likely want to specify 
  92  ``turbolucene.search_fields``. 
  93   
  94  :See: `_load_language_data` for details about the languages configuration file. 
  95   
  96  :Warning: Do not forget to turn off *autoreload* in ``dev.cfg``. 
  97   
  98  :Requires: TurboGears_ and PyLucene_ 
  99   
 100  .. _TurboGears: http://turbogears.org/ 
 101  .. _PyLucene: http://pylucene.osafoundation.org/ 
 102  .. _CherryPy: http://cherrypy.org/ 
 103  .. _TurboLucene: http://dev.krys.ca/turbolucene/ 
 104   
 105  :newfield api_version: API Version 
 106  :newfield revision: Revision 
 107   
 108  :group Objects to use in make_document: Document, Field, STORE, COMPRESS, 
 109    TOKENIZED, UN_TOKENIZED 
 110  :group Public API: start, add, update, remove, search 
 111   
 112  """ 
 113   
 114  __author__ = 'Krys Wilken' 
 115  __contact__ = 'krys AT krys DOT ca' 
 116  __copyright__ = '(c) 2007 Krys Wilken' 
 117  __license__ = 'MIT' 
 118  __version__ = '0.2' 
 119  __api_version__ = '2.0' 
 120  __revision__ = '$Id: __init__.py 47 2007-04-01 22:36:05Z krys $' 
 121  __docformat__ = 'restructuredtext en' 
 122  __all__ = ['start', 'add', 'update', 'remove', 'search', 'Document', 'Field', 
 123    'STORE', 'COMPRESS', 'TOKENIZED', 'UN_TOKENIZED'] 
 124   
 125   
 126  #---Imports-------------------------------------------------------------------- 
 127   
 128  #---  Standard library imports 
 129  from Queue import Queue 
 130  from os.path import exists, join, isabs 
 131  from logging import getLogger 
 132  from atexit import register 
 133  from codecs import EncodedFile, open as codecs_open 
 134   
 135  #---  Framework imports 
 136  from turbogears import scheduler, config 
 137  from configobj import ConfigObj 
 138  # PyLint does not like this setuptools voodoo, but it works. 
 139  from pkg_resources import resource_stream # pylint: disable-msg=E0611 
 140   
 141  #---  Third-party imports 
 142  import PyLucene 
 143  from PyLucene import (PythonThread, IndexModifier, JavaError, Term, 
 144    IndexSearcher, MultiFieldQueryParser) 
 145  # For use in make_document 
 146  from PyLucene import Document, Field 
 147   
 148   
 149  #---Globals-------------------------------------------------------------------- 
 150   
 151  #: Default language to use if none is specified in `config`. 
 152  _DEFAULT_LANGUAGE = 'en' 
 153  # These are intentionally module-level globals, so C0103 does not apply. 
 154  #: Logger for this module 
 155  _log = getLogger('turbolucene') # pylint: disable-msg=C0103 
 156  #: This will hold the language support data read from file. 
 157  _language_data = None # pylint: disable-msg=C0103 
 158  #: This will hold the `_Indexer` singleton class. 
 159  _indexer = None # pylint: disable-msg=C0103 
 160  #: This will hold the `_SearcherFactory` singleton class. 
 161  _searcher_factory = None # pylint: disable-msg=C0103 
 162   
 163  #---  Convenience constants 
 164   
 165  #: Tells `Field` not to compress the field data 
 166  STORE = Field.Store.YES 
 167  #: Tells `Field` to compress the field data 
 168  COMPRESS = Field.Store.COMPRESS 
 169  #: Tells `Field` to tokenize and do stemming on the field data 
 170  TOKENIZED = Field.Index.TOKENIZED 
 171  #: Tells `Field` not to tokenize and do stemming on the field data 
 172  UN_TOKENIZED = Field.Index.UN_TOKENIZED 
 173   
 174   
 175  #---Functions------------------------------------------------------------------ 
 176   
177 -def _load_language_data():
178 """Load all the language data from the configured languages file. 179 180 The languages configuration file can be set with the 181 ``turbolucene.languages_file`` configuration option and it's encoding is 182 set with ``turbolucene.languages_file_encoding``. 183 184 Configuration file format 185 ========================= 186 187 The languages file is an INI-type (ConfigObj_) file. Each section is 188 defined by an ISO language code (``en``, ``de``, ``el``, ``pt-br``, etc.). 189 In each section the following keys are possible: 190 191 **analyzer_class**: 192 The PyLucene analyzer class to use for this language. (e.g. 193 ``SnowballAnalyzer``) (Required) 194 **analyzer_class_args**: 195 Any arguments that should be passed to the analyzer class. (e.g. 196 ``Danish``) (Optional) 197 **stopwords**: 198 A list of stopwords (words that do not get indexed) to pass to the 199 analyzer class. This is not normally used as ``stopwords_file`` is 200 generally preferred. (Optional) 201 **stopwords_file**: 202 The path to the file that contains the list of stopwords to pass to the 203 analyzer class. (e.g. ``stopwords_da.txt``) (Optional) 204 **stopwords_file_encoding**: 205 The encoding of the stopwords file. (e.g. ``windows-1252``) 206 207 If neither ``stopwords`` or ``stopwords_file`` is defined for a language, 208 then any stopwords that are used are determined automatically by the 209 analyzer class' constructor. 210 211 Example 212 ------- 213 214 :: 215 216 # German 217 [de] 218 analyzer_class = SnowballAnalyzer 219 analyzer_class_args = German2 220 stopwords_file = stopwords_de.txt 221 stopwords_file_encoding = windows-1252 222 223 :Exceptions: 224 - `IOError`: Raised of the languages configuration file could not be 225 opened. 226 - `configobj.ParseError`: Raised if the languages configuration file is 227 contains errors. 228 229 :See: 230 - `turbolucene` (module docstring) for details about configuration 231 settings. 232 - `_read_stopwords` for details about stopwords files. 233 234 .. _ConfigObj: http://www.voidspace.org.uk/python/configobj.html 235 236 """ 237 # Use of global here is intentional and necessary. W0603 does not apply. 238 global _language_data # pylint: disable-msg=W0603 239 languages_file = config.get('turbolucene.languages_file', None) 240 languages_file_encoding = config.get('turbolucene.languages_file_encoding', 241 'utf-8') 242 if languages_file: 243 _log.info(u'Loading custom language data from "%s"' % languages_file) 244 else: 245 _log.info(u'Loading default language data') 246 languages_file = resource_stream(__name__, u'languages.cfg') 247 _language_data = ConfigObj(languages_file, 248 encoding=languages_file_encoding, file_error=True, raise_errors=True)
249 250
251 -def _schedule_optimization():
252 """Schedule index optimization using the TurboGears scheduler. 253 254 This function reads it's configuration data from 255 ``turbolucene.optimize_days`` and ``turbolucene.optimize_time``. 256 257 :Exceptions: 258 - `TypeError`: Raised if ``turbolucene.optimize_time`` is invalid. 259 260 :See: `turbolucene` (module docstring) for details about configuration 261 settings. 262 263 """ 264 optimize_days = config.get('turbolucene.optimize_days', range(1, 8)) 265 optimize_time = config.get('turbolucene.optimize_time', (00, 00)) 266 scheduler.add_weekday_task(_optimize, optimize_days, optimize_time) 267 _log.info(u'Index optimization scheduled on %s at %s' % (unicode( 268 optimize_days), unicode(optimize_time)))
269 270
271 -def _get_index_path(language):
272 """Return the path to the index for the given language. 273 274 This function gets it's configuration data from ``turbolucene.index_root``. 275 276 :Parameters: 277 language : `str` 278 An ISO language code. (e.g. ``en``, ``pt-br``, etc.) 279 280 :Returns: The path to the index for the given language. 281 :rtype: `unicode` 282 283 :See: `turbolucene` (module docstring) for details about configuration 284 settings. 285 286 """ 287 index_base_path = config.get('turbolucene.index_root', u'index') 288 return join(index_base_path, language)
289 290
291 -def _read_stopwords(file_path, encoding):
292 """Read the stopwords from the given a stopwords file path. 293 294 Stopwords are words that should not be indexed because they are too common 295 or have no significant meaning (e.g. *the*, *in*, *with*, etc.) They are 296 language dependent. 297 298 This function gets it's configuration data from 299 ``turbolucene.stopwords_root``. 300 301 If `file_path` is not an absolute path, then it will be appended to the 302 path configured in ``turbolucene.stopwords_root``. 303 304 Stopwords files are text files (in the given encoding), with one stopword 305 per line. Comments are marked by a ``|`` character. This is for 306 compatibility with the stopwords files found at 307 http://snowball.tartarus.org/. 308 309 :Parameters: 310 file_path : `unicode` 311 The path to the stopwords file to read. 312 encoding : `str` 313 The encoding of the stopwords file. 314 315 :Returns: The list of stopwords from the file. 316 :rtype: `list` of `unicode` strings 317 318 :Exceptions: 319 - `IOError`: Raised if the stopwords file could not be opened. 320 321 :See: `turbolucene` (module docstring) for details about configuration 322 settings. 323 324 """ 325 stopwords_base_path = config.get('turbolucene.stopwords_root', None) 326 if isabs(file_path) or stopwords_base_path: 327 if not isabs(file_path): 328 file_path = join(stopwords_base_path, file_path) 329 _log.info(u'Reading custom stopwords file "%s"' % file_path) 330 stopwords_file = codecs_open(file_path, 'r', encoding) 331 else: 332 _log.info(u'Reading default stopwords file "%s"' % file_path) 333 stopwords_file = EncodedFile(resource_stream(__name__, join( 334 u'stopwords', file_path)), encoding) 335 stopwords = [] 336 for line in stopwords_file: 337 # Stopword files can have comments after a '|' character on each line. 338 # This is to support the stopword files that come from 339 # http://snowball.tartarus.org/ 340 stopword = line.split(u'|')[0].strip() 341 if stopword: 342 stopwords.append(stopword) 343 stopwords_file.close() 344 return stopwords
345 346
347 -def _analyzer_factory(language):
348 """Produce an analyzer object appropriate for the given language. 349 350 This function uses the data that was read in from the languages 351 configuration file to determine and instantiate the analyzer object. 352 353 :Parameters: 354 language : `str` or `unicode` 355 An ISO language code that is configured in the languages configuration 356 file. 357 358 :Returns: An instance of the configured analyser class for given language. 359 :rtype: ``PyLucene.Analyzer`` sub-class 360 361 :Exceptions: 362 - `KeyError`: Raised if the given language is not configured or if the 363 configuration for that language does not have a *analyzer_class* key. 364 - `PyLucene.InvalidArgsError`: Raised if any of the parameters passed to 365 the analyzer class are invalid. 366 367 :See: `_load_language_data` for details about the language configuration 368 file. 369 370 """ 371 ldata = _language_data[language] 372 args = (u'analyzer_class_args' in ldata and ldata[u'analyzer_class_args'] 373 or []) 374 if not isinstance(args, list): 375 args = [args] 376 # Note: It seems that the <LANGUAGE>_STOP_WORDS class variables are not 377 # exposed very often in PyLucene. They are also not very complete anyway, 378 # so I use stopwords from other sources. 379 stopwords = [] 380 if u'stopwords' in ldata and ldata[u'stopwords']: 381 stopwords = [ldata.stopwords] 382 elif u'stopwords_file' in ldata and u'stopwords_file_encoding' in ldata: 383 stopwords = [_read_stopwords(ldata[u'stopwords_file'], 384 ldata[u'stopwords_file_encoding'])] 385 # This function assumes that the stopwords parameter is always the last 386 # argument to the analyzer constructor. According to the Lucene docs, this 387 # is true in all cases so far. 388 args += stopwords 389 # Use of *args here is deliberate and necessary, so W0142 does not apply. 390 return getattr(PyLucene, ldata[ #pylint: disable-msg=W0142 391 u'analyzer_class'])(*args)
392 393
394 -def _stop():
395 """Shutdown search engine threads.""" 396 _searcher_factory.stop() 397 _indexer('stop') 398 _log.info(u'Search engine stopped.')
399 400
401 -def _optimize():
402 """Tell the search engine to optimize it's index.""" 403 _indexer('optimize')
404 405 406 #--- Public API 407
408 -def start(make_document, results_formatter=None):
409 """Initialize and start the search engine threads. 410 411 This function loads the language configuration information, starts the 412 search engine threads, makes sure the search engine will be shutdown upon 413 shutdown of TurboGears and starts the optimization scheduler to run at the 414 configured times. 415 416 The `make_document` and `results_formatter` parameters are 417 callables. Here are examples of how they should be defined: 418 419 Example `make_document` function: 420 =================================== 421 422 .. python:: 423 424 def make_document(entry): 425 '''Make a new PyLucene Document instance from an Entry instance.''' 426 document = Document() 427 # An 'id' string field is required. 428 document.add(Field('id', str(entry.id), STORE, UN_TOKENIZED)) 429 document.add(Field('posted_on', entry.rendered_posted_on, STORE, 430 TOKENIZED)) 431 document.add(Field('title', entry.title, STORE, TOKENIZED)) 432 document.add(Field('text', strip_tags(entry.etree), COMPRESS, 433 TOKENIZED)) 434 categories = ' '.join([unicode(category) for category in 435 entry.categories]) 436 document.add(Field('category', categories, STORE, TOKENIZED)) 437 return document 438 439 Example `results_formatter` function: 440 ======================================= 441 442 .. python:: 443 444 def results_formatter(results): 445 '''Return the results as SQLObject instances. 446 447 Returns either an empty list or a SelectResults object. 448 449 ''' 450 if results: 451 return Entry.select_with_identity(IN(Entry.q.id, [int(id) for id 452 in results])) 453 454 :Parameters: 455 make_document : callable 456 `make_document` is a callable that will return a PyLucene `Document` 457 object based on the object passed in to `add`, `update` or `remove`. 458 The `Document` object must have at least a field called ``id`` that is 459 a string. This function operates inside a PyLucene ``PythonThread``. 460 results_formatter : callable 461 `results_formatter`, if provided, is a callable that will return 462 a formatted version of the search results that are passed to it by 463 `_Searcher.__call__`. Generally the `results_formatter` will take the 464 list of ``id`` strings that is passed to it and return a list of 465 application-specific objects (like SQLObject_ instances, for example.) 466 This function operates outside of any PyLucene ``PythonThread`` objects 467 (like in the CherryPy thread, for example). (Optional) 468 469 :See: 470 - `turbolucene` (module docstring) for details about configuration 471 settings. 472 - `_load_language_data` for details about the language configuration 473 file. 474 475 .. _SQLObject: http://sqlobject.org/ 476 477 """ 478 _load_language_data() 479 # Use of global here is deliberate. W0603 does not apply. 480 global _indexer, _searcher_factory #pylint: disable-msg=W0603 481 _indexer = _Indexer(make_document) 482 _searcher_factory = _SearcherFactory(results_formatter) 483 # Using atexit insted of call_on_shutdown so that tg-admin shell will also 484 # shutdown properly. 485 register(_stop) 486 _schedule_optimization() 487 _log.info(u'Search engine started.')
488 489
490 -def add(object_, language=None):
491 """Tell the search engine to add the given object to the index. 492 493 This function returns immediately. It does not wait for the indexer to be 494 finished. 495 496 :Parameters: 497 `object_` 498 This can be any object that ``make_document`` knows how to handle. 499 language : `str` 500 This is the ISO language code of the language of the object. If 501 `language` is given, then it must be on that was previously configured 502 in ``turbolucene.languages``. If `language` is not given, then 503 the language configured in ``turbolucene.default_language`` will be 504 used. (Optional) 505 506 :See: 507 - `turbolucene` (module docstring) for details about configuration 508 settings. 509 - `start` for details about ``make_document``. 510 511 """ 512 _indexer('add', object_, language)
513 514
515 -def update(object_, language=None):
516 """Tell the the search engine to update the index for the given object. 517 518 This function returns immediately. It does not wait for the indexer to be 519 finished. 520 521 :Parameters: 522 `object_` 523 This can be any object that ``make_document`` knows how to handle. 524 language : `str` 525 This is the ISO language code of the language of the object. If 526 `language` is given, then it must be on that was previously configured 527 in ``turbolucene.languages``. If `language` is not given, then 528 the language configured in ``turbolucene.default_language`` will be 529 used. (Optional) 530 531 :See: 532 - `turbolucene` (module docstring) for details about configuration 533 settings. 534 - `start` for details about ``make_document``. 535 536 """ 537 _indexer('update', object_, language)
538 539
540 -def remove(object_, language=None):
541 """Tell the search engine to remove the given object from the index. 542 543 This function returns immediately. It does not wait for the indexer to be 544 finished. 545 546 :Parameters: 547 `object_` 548 This can be any object that ``make_document`` knows how to handle. 549 language : `str` 550 This is the ISO language code of the language of the object. If 551 `language` is given, then it must be on that was previously configured 552 in ``turbolucene.languages``. If `language` is not given, then 553 the language configured in ``turbolucene.default_language`` will be 554 used. (Optional) 555 556 :See: 557 - `turbolucene` (module docstring) for details about configuration 558 settings. 559 - `start` for details about ``make_document``. 560 561 """ 562 _indexer('remove', object_, language)
563 564
565 -def search(query, language=None):
566 """Return results from the search engine that match the query. 567 568 If a ``results_formatter`` function was passed to `start` then the results 569 will be passed through the formatter before returning. If not, the 570 returned value is a list of strings that are the ``id`` fields of matching 571 objects. 572 573 :Parameters: 574 query : `str` or `unicode` 575 This is the search query to give to PyLucene. All of Lucene's query 576 syntax (field identifiers, wild cards, etc.) are available. 577 language : `str` 578 This is the ISO language code of the language of the object. If 579 `language` is given, then it must be on that was previously configured 580 in ``turbolucene.languages``. If `language` is not given, then 581 the language configured in ``turbolucene.default_language`` will be 582 used. (Optional) 583 584 :Returns: The results of the search. 585 :rtype: iterable 586 587 :See: 588 - `start` for details about ``results_formatter``. 589 - `turbolucene` (module docstring) for details about configuration 590 settings. 591 - http://lucene.apache.org/java/docs/queryparsersyntax.html for details 592 about Lucene's query syntax. 593 594 """ 595 return _searcher_factory()(query, language)
596 597 598 #---Classes-------------------------------------------------------------------- 599
600 -class _Indexer(PythonThread):
601 602 """Responsible for updating and maintaining the search engine index. 603 604 A single `_Indexer` thread is created to handle all index modifications. 605 606 Once the thread is started, messages are sent to it by calling the instance 607 with a task and an object, where the task is one of the following strings: 608 609 - ``add``: Adds the object to the index. 610 - ``remove``: Removes the object from the index. 611 - ``update``: Updates the index of an object. 612 613 and the object is any object that ``make_document`` knows how to handle. 614 615 To properly shutdown the thread, send the ``stop`` task with `None` as the 616 object. (This is normally handled by the `turbolucene._stop` function.) 617 618 To optimize the index, which can take a while, pass the ``optimize`` 619 task with `None` for the object. (This is normally handled by the 620 TurboGears scheduler as set up by `_schedule_optimization`.) 621 622 :See: `turbolucene.start` for details about ``make_document``. 623 624 :group Public API: __init__, __call__ 625 :group Threaded methods: run, _add, _remove, _update, _optimize, _stop 626 627 """ 628 629 #---Public API 630
631 - def __init__(self, make_document):
632 """Initialize the message queue and the PyLucene indexes. 633 634 One PyLucene index is created/opened for each of the configured 635 supported languages. 636 637 This method uses the ``turbolucene.default_language`` and 638 ``turbolucene.languages`` configuration settings. 639 640 :Parameters: 641 make_document : callable 642 A callable that takes the object to index as a parameter and 643 returns an appropriate `Document` object. 644 645 :Note: Instantiating this class starts the thread automatically. 646 647 :See: 648 - `turbolucene` (module docstring) for details about configuration 649 settings. 650 - `turbolucene.start` for details about ``make_document``. 651 - `_get_index_path` for details about the directory location of each 652 index. 653 - `_analyzer_factory` for details about the analyzer used for each 654 index. 655 656 """ 657 PythonThread.__init__(self) # PythonThread is an old-style class 658 self._make_document = make_document 659 self._task_queue = Queue() 660 self._indexes = {} 661 default_language = config.get('turbolucene.default_language', 662 _DEFAULT_LANGUAGE) 663 # Create indexes 664 languages = config.get('turbolucene.languages', [default_language]) 665 for language in languages: 666 index_path = _get_index_path(language) 667 self._indexes[language] = IndexModifier(index_path, 668 _analyzer_factory(language), not exists(index_path) and True or 669 False) 670 self.start()
671
672 - def __call__(self, task, object_=None, language=None):
673 """Pass `task`, `object_` and `language` to the thread for processing. 674 675 If `language` is `None`, then the default language configured in 676 ``turbolucene.default_language`` is used. 677 678 If `task` is ``stop``, then the `_Indexer` thread is shutdown and this 679 method will wait until the shutdown is complete. 680 681 :Parameters: 682 task : `str` 683 The task to perform. 684 `object_` 685 Any object that ``make_document`` knows how to handle. (Default: 686 `None`) 687 language : `str` 688 The ISO language code of the language of the object. This 689 specifies which PyLucene index to use. 690 691 :See: 692 - `turbolucene` (module docstring) for details about configuration 693 settings. 694 - `turbolucene.start` for details about ``make_document``. 695 696 """ 697 if not language: 698 language = config.get('turbolucene.default_language', 699 _DEFAULT_LANGUAGE) 700 self._task_queue.put((task, object_, language)) 701 if task == 'stop': 702 self.join()
703 704 #---Threaded methods 705
706 - def run(self):
707 """Main thread loop to do dispatching based on messages in the queue. 708 709 This method expects that the queue will contain 3-tuples in the form of 710 (task, object, language), where task is one of ``add``, ``update``, 711 ``remove``, ``optimize`` or ``stop``, entry is any object that 712 ``make_document`` can handle or `None` in the case of ``optimize`` and 713 ``stop``, and language is the ISO language code of the indexer. 714 715 If the task is ``stop``, then the thread shuts down. 716 717 :Note: This method is run in the thread. 718 719 :See: 720 - `_add`, `_update`, `_remove`, `_optimize` and `_stop` for details 721 about each respective task. 722 - `turbolucene.start` for details about ``make_document``. 723 724 """ 725 while True: 726 task, object_, language = self._task_queue.get() 727 method = getattr(self, '_' + task) 728 if task in ('optimize', 'stop'): 729 method() 730 else: 731 method(object_, language) 732 if task == 'stop': 733 break 734 self._indexes[language].flush() # This is essential.
735
736 - def _add(self, object_, language, document=None):
737 """Add a new object to the index. 738 739 If `document` is not provided, then this method passes the object off 740 to ``make_document`` and then indexes the resulting `Document` object. 741 Otherwise it just indexes the `document` object. 742 743 :Parameters: 744 `object_` 745 The object to be indexed. It will be passed to ``make_document`` 746 (unless `document` is provided). 747 language : `str` 748 The ISO language code of the indexer to use. 749 document : `Document` 750 A pre-built `Document` object for the given object, if it exists. 751 This is used internally by `_update`. (Default: `None`) 752 753 :Note: This method is run in the thread. 754 755 :See: `turbolucene.start` for details about ``make_document``. 756 757 """ 758 if not document: 759 document = self._make_document(object_) 760 _log.info(u'Adding object "%s" (id %s) to the %s index.' % (unicode( 761 object_), document['id'], language)) 762 self._indexes[language].addDocument(document)
763
764 - def _remove(self, object_, language, document=None):
765 """Remove an object from the index. 766 767 If `document` is not provided, then this method passes the object off 768 to ``make_document`` and then removes the resulting `Document` object 769 from the index. Otherwise it just removes the `document` object. 770 771 :Parameters: 772 `object_` 773 The object to be removed from the index. It will be passed to 774 ``make_document`` (unless `document` is provided). 775 language : `str` 776 The ISO language code of the indexer to use. 777 document : `Document` 778 A pre-built `Document` object for the given object, if it exists. 779 This is used internally by `_update`. (Default: `None`) 780 781 :Note: This method is run in the thread. 782 783 :See: `turbolucene.start` for details about ``make_document``. 784 785 """ 786 if not document: 787 document = self._make_document(object_) 788 _log.info(u'Removing object "%s" (id %s) from %s index.' % (unicode( 789 object_), document['id'], language)) 790 self._indexes[language].deleteDocuments(Term('id', document['id']))
791
792 - def _update(self, object_, language):
793 """Update an object in the index by replacing it. 794 795 This method updates the index by removing and then re-adding the 796 object. 797 798 :Parameters: 799 `object_` 800 The object to update in the index. It will be passed to 801 ``make_document`` and the resulting `Document` object will be 802 updated. 803 language : `str` 804 The ISO language code of the indexer to use. 805 806 :Note: This method is run in the thread. 807 808 :See: 809 - `_remove` and `_add` for details about the removal and 810 re-addition. 811 - `turbolucene.start` for details about ``make_document``. 812 813 """ 814 document = self._make_document(object_) 815 self._remove(object_, language, document) 816 self._add(object_, language, document)
817
818 - def _optimize(self):
819 """Optimize all of the indexes. This can take a while. 820 821 :Note: This method is run in the thread. 822 823 """ 824 _log.info(u'Optimizing indexes.') 825 for index in self._indexes.values(): 826 index.optimize() 827 _log.info(u'Indexes optimized.')
828
829 - def _stop(self):
830 """Shutdown all of the indexes. 831 832 :Note: This method is run in the thread. 833 834 """ 835 for index in self._indexes.values(): 836 index.close()
837 838
839 -class _Searcher(PythonThread):
840 841 """Responsible for searching an index and returning results. 842 843 `_Searcher` threads are created for each search that is requested. After 844 the search is completed, the thread dies. 845 846 To search, a `_Searcher` class is instantiated and then called with the 847 query and the ISO language code for the index to search. It returns the 848 results as a list of object id strings unless ``results_formatter`` was 849 provided. If it was, then the list of id strings are passed to 850 ``results_formatter`` to process and it's results are returned. 851 852 The thread is garbage collected when it goes out of scope. 853 854 The catch to all this is that a CherryPy thread cannot directly instantiate 855 a `_Searcher` thread because of PyLucene restrictions. So to get around 856 that, see the `_SearcherFactory` class. 857 858 :See: `turbolucene.start` for details about ``results_formatter``. 859 860 :group Public API: __init__, __call__ 861 :group Threaded methods: run 862 863 """ 864 865 #---Public API 866
867 - def __init__(self, results_formatter):
868 """Initialize message queues and start the thread. 869 870 :Note: The thread is started as soon as the class is instantiated. 871 872 """ 873 PythonThread.__init__(self) # PythonThread is an old-style class 874 self._results_formatter = results_formatter 875 self._query_queue = Queue() 876 self._results_queue = Queue() 877 self.start()
878
879 - def __call__(self, query, language=None):
880 """Send `query` and `language` to the thread, wait and return results. 881 882 If `language` is `None`, then the default language configured in 883 ``turbolucene.default_language`` is used. 884 885 :Parameters: 886 query : `str` or `unicode` 887 The search query to give to PyLucene. All of Lucene's query 888 syntax (field identifiers, wild cards, etc.) are available. 889 language : `str` 890 The ISO language code of the indexer to use. 891 892 :Returns: An iterable of id field strings that match the query or the 893 results produced by ``results_formatter`` if it was provided. 894 :rtype: iterable 895 896 :See: 897 - `turbolucene` (module docstring) for details about configuration 898 settings. 899 - `turbolucene.start` for details about ``results_formatter``. 900 - http://lucene.apache.org/java/docs/queryparsersyntax.html for 901 details about Lucene's query syntax. 902 903 """ 904 if not language: 905 language = config.get('turbolucene.default_language', 906 _DEFAULT_LANGUAGE) 907 self._query_queue.put((query, language)) 908 results = self._results_queue.get() 909 # The join is causing a segfault and I don't know why. In theory the 910 # join should not be necessary, but I thought it good practice to 911 # include it. Apparently I am wrong. 912 ## self.join() 913 if self._results_formatter: 914 return self._results_formatter(results) 915 return results
916 917 #---Threaded methods 918
919 - def run(self):
920 """Search the language index for the query and send back the results. 921 922 The results is an iterable of id field strings that match the query. 923 924 This method uses the ``turbolucene.search_fields`` configuration 925 setting for the default fields to search if none are specified in the 926 query itself, and ``turbolucene.default_operator`` for the default 927 operator to use when joining terms. 928 929 :Exceptions: 930 - `AttributeError`: Raised when the configured default operator is 931 not valid. 932 933 :Note: This method is run in the thread. 934 935 :Note: The thread dies after one search. 936 937 :See: 938 - `turbolucene` (module docstring) for details about configuration 939 settings. 940 - `_get_index_path` for details about the directory location of the 941 index. 942 - `_analyzer_factory` for details about the analyzer used for the 943 index. 944 - http://lucene.apache.org/java/docs/queryparsersyntax.html for 945 details about Lucene's query syntax. 946 947 """ 948 query, language = self._query_queue.get() 949 searcher = IndexSearcher(_get_index_path(language)) 950 search_fields = config.get('turbolucene.search_fields', ['id']) 951 parser = MultiFieldQueryParser(search_fields, _analyzer_factory( 952 language)) 953 default_operator = getattr(parser.Operator, config.get( 954 'turbolucene.default_operator', 'AND').upper()) 955 parser.setDefaultOperator(default_operator) 956 try: 957 hits = searcher.search(parser.parse(query)) 958 results = [document['id'] for _, document in hits] 959 except JavaError: 960 results = [] 961 self._results_queue.put(results) 962 searcher.close()
963 964
965 -class _SearcherFactory(PythonThread):
966 967 """Produces running `_Searcher` threads. 968 969 ``PythonThread`` threads can only be started by the main program or other 970 ``PythonThread`` threads, so this ``PythonThread``-based class creates and 971 starts single-use `_Searcher` threads. This thread is created and started 972 by the main program during TurboGears initialization as a singleton. 973 974 To get a `_Searcher` thread, call the `_SearcherFactory` instance. Then 975 pass the query to the `_Searcher` thread that was returned. 976 977 :group Public API: __init__, __call__, stop 978 :group Threaded methods: run 979 980 """ 981 982 #---Public API 983
984 - def __init__(self, *searcher_args, **searcher_kwargs):
985 """Initialize message queues and start the thread. 986 987 :Note: The thread is started as soon as the class is instantiated. 988 989 """ 990 PythonThread.__init__(self) # PythonThread is an old-style class 991 self._searcher_args = searcher_args 992 self._searcher_kwargs = searcher_kwargs 993 self._request_queue = Queue() 994 self._searcher_queue = Queue() 995 self.start()
996
997 - def __call__(self):
998 """Send a request for a running `_Searcher` class, then return it. 999 1000 :Returns: A running instance of the `_Searcher` class. 1001 :rtype: `_Searcher` 1002 1003 """ 1004 self._request_queue.put('request') 1005 return self._searcher_queue.get()
1006
1007 - def stop(self):
1008 """Stop the `_SearcherFactory` thread.""" 1009 self._request_queue.put('stop') 1010 self.join()
1011 1012 #---Threaded methods 1013
1014 - def run(self):
1015 """Listen for requests and create `_Searcher` classes. 1016 1017 If the request message is ``stop``, then the thread will be shutdown. 1018 1019 :Note: This method is run in the thread. 1020 1021 """ 1022 while True: 1023 request = self._request_queue.get() 1024 if request == 'stop': 1025 break 1026 # * and ** are used here for simplicity and transparency. 1027 self._searcher_queue.put(_Searcher( # pylint: disable-msg=W0142 1028 *self._searcher_args, **self._searcher_kwargs))
1029