1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 """Provides search functionality for TurboGears_ using PyLucene_.
18
19 This module uses PyLucene to do all the heavy lifting, but as a result this
20 module does some fancy things with threads.
21
22 PyLucene requires that all threads that use it must inherit from
23 ``PythonThread``. This means either patching CherryPy_ and/or TurboGears, or
24 having the CherryPy thread hand off the request to a ``PythonThread`` and, in
25 the case of searching, wait for the result. The second method was chosen so
26 that a patched CherryPy or TurboGears does not have to be maintained.
27
28 The other advantage to the chosen method is that indexing happens in a separate
29 thread so the web request can return more quickly by not waiting for the
30 results.
31
32 The main disadvantage with PyLucene and CherryPy, however, is that *autoreload*
33 does not work with it. You **must** disable it by adding
34 ``autoreload.on = False`` to your ``dev.cfg``.
35
36 Configuration options
37 =====================
38
39 TurboLucene_ uses the following configuration options:
40
41 **turbolucene.search_fields**:
42 The list of fields that should be searched by default when a specific field
43 is not specified. (e.g. ``['id', 'title', 'text', 'categories']``)
44 (Default: ``['id']``)
45 **turbolucene.default_language**:
46 The default language to use if a language is not given calling
47 `add`/`update`/`search`/etc. (Default: ``'en'``)
48 **turbolucene.languages**:
49 The list of languages to support. This is a list of ISO language codes
50 that you want to support in your application. The languages must be
51 supported by PyLucene and must be configured in the languages
52 configuration file. Currently the choice of languages that are possible
53 out-of-the-box are : *Czech (cs)*, *Danish (da)*, *German (de)*, *Greek
54 (el)*, *English (en)*, *Spanish (es)*, *Finnish (fi)*, *French (fr)*,
55 *Italian (it)*, *Japanese (ja)*, *Korean (ko)*, *Dutch (nl)*, *Norwegian
56 (no)*, *Portuguese (pt)*, *Brazilian (pt-br)*, *Russian (ru)*, *Swedish
57 (sv)*, and *Chinese (zh)*. (Default: ``[<default_language>]``)
58 **turbolucene.default_operator**:
59 The default search operator to use between search terms when non is
60 specified. (Default: ``'AND'``) This must be a valid operator object from
61 the ``PyLucene.MultiFieldQueryParser.Operator`` namespace.
62 **turbolucene.optimize_days**:
63 The list of days to schedule index optimization. Index optimization cleans
64 up and compacts the indexes so that searches happen faster. This is a list
65 of day numbers (Sunday = 1). Optimization of all indexes will occur on
66 those days. (Default: ``[1, 2, 3, 4, 5, 6, 7]``, i.e. every day)
67 **turbolucene.optimize_time**:
68 A tuple containing the hour (24 hour format) and minute of the time to run
69 the scheduled index optimizations. (Default: ``(00, 00)``, i.e. midnight)
70 **turbolucene.index_root**:
71 The base path in which to store the indexes. There is one index per
72 supported language. Each index is a directory. Those directories will be
73 sub-directories of this base path. If the path is relative, it is
74 relative to your project's root. Normally you should not need to override
75 this unless you specifically need the indexes to be located somewhere else.
76 (Default: ``u'index'``)
77 **turbolucene.languages_file**:
78 The path to the languages configuration file. The languages configuration
79 file provides the configuration information for all the languages that
80 *TurboLucene* supports. Normally you should not need to override this.
81 (Default: the ``u'languages.cfg'`` file in the `turbolucene` package)
82 **turbolucene.languages_file_encoding**:
83 The encoding of the languages file. (Default: ``'utf-8'``)
84 **turbolucene.stopwords_root**:
85 The languages file can specify files that contain stopwords. If a
86 stopwords file path is relative, this path with be prepended to it. This
87 allows for all stopword files to be customized without needing to specify
88 full paths for every one. Normally you should not need to override this.
89 (Default: the ``stopwords`` directory in the `turbolucene` package)
90
91 All fields are optional, but at the minimum, you will likely want to specify
92 ``turbolucene.search_fields``.
93
94 :See: `_load_language_data` for details about the languages configuration file.
95
96 :Warning: Do not forget to turn off *autoreload* in ``dev.cfg``.
97
98 :Requires: TurboGears_ and PyLucene_
99
100 .. _TurboGears: http://turbogears.org/
101 .. _PyLucene: http://pylucene.osafoundation.org/
102 .. _CherryPy: http://cherrypy.org/
103 .. _TurboLucene: http://dev.krys.ca/turbolucene/
104
105 :newfield api_version: API Version
106 :newfield revision: Revision
107
108 :group Objects to use in make_document: Document, Field, STORE, COMPRESS,
109 TOKENIZED, UN_TOKENIZED
110 :group Public API: start, add, update, remove, search
111
112 """
113
114 __author__ = 'Krys Wilken'
115 __contact__ = 'krys AT krys DOT ca'
116 __copyright__ = '(c) 2007 Krys Wilken'
117 __license__ = 'MIT'
118 __version__ = '0.2'
119 __api_version__ = '2.0'
120 __revision__ = '$Id: __init__.py 47 2007-04-01 22:36:05Z krys $'
121 __docformat__ = 'restructuredtext en'
122 __all__ = ['start', 'add', 'update', 'remove', 'search', 'Document', 'Field',
123 'STORE', 'COMPRESS', 'TOKENIZED', 'UN_TOKENIZED']
124
125
126
127
128
129 from Queue import Queue
130 from os.path import exists, join, isabs
131 from logging import getLogger
132 from atexit import register
133 from codecs import EncodedFile, open as codecs_open
134
135
136 from turbogears import scheduler, config
137 from configobj import ConfigObj
138
139 from pkg_resources import resource_stream
140
141
142 import PyLucene
143 from PyLucene import (PythonThread, IndexModifier, JavaError, Term,
144 IndexSearcher, MultiFieldQueryParser)
145
146 from PyLucene import Document, Field
147
148
149
150
151
152 _DEFAULT_LANGUAGE = 'en'
153
154
155 _log = getLogger('turbolucene')
156
157 _language_data = None
158
159 _indexer = None
160
161 _searcher_factory = None
162
163
164
165
166 STORE = Field.Store.YES
167
168 COMPRESS = Field.Store.COMPRESS
169
170 TOKENIZED = Field.Index.TOKENIZED
171
172 UN_TOKENIZED = Field.Index.UN_TOKENIZED
173
174
175
176
178 """Load all the language data from the configured languages file.
179
180 The languages configuration file can be set with the
181 ``turbolucene.languages_file`` configuration option and it's encoding is
182 set with ``turbolucene.languages_file_encoding``.
183
184 Configuration file format
185 =========================
186
187 The languages file is an INI-type (ConfigObj_) file. Each section is
188 defined by an ISO language code (``en``, ``de``, ``el``, ``pt-br``, etc.).
189 In each section the following keys are possible:
190
191 **analyzer_class**:
192 The PyLucene analyzer class to use for this language. (e.g.
193 ``SnowballAnalyzer``) (Required)
194 **analyzer_class_args**:
195 Any arguments that should be passed to the analyzer class. (e.g.
196 ``Danish``) (Optional)
197 **stopwords**:
198 A list of stopwords (words that do not get indexed) to pass to the
199 analyzer class. This is not normally used as ``stopwords_file`` is
200 generally preferred. (Optional)
201 **stopwords_file**:
202 The path to the file that contains the list of stopwords to pass to the
203 analyzer class. (e.g. ``stopwords_da.txt``) (Optional)
204 **stopwords_file_encoding**:
205 The encoding of the stopwords file. (e.g. ``windows-1252``)
206
207 If neither ``stopwords`` or ``stopwords_file`` is defined for a language,
208 then any stopwords that are used are determined automatically by the
209 analyzer class' constructor.
210
211 Example
212 -------
213
214 ::
215
216 # German
217 [de]
218 analyzer_class = SnowballAnalyzer
219 analyzer_class_args = German2
220 stopwords_file = stopwords_de.txt
221 stopwords_file_encoding = windows-1252
222
223 :Exceptions:
224 - `IOError`: Raised of the languages configuration file could not be
225 opened.
226 - `configobj.ParseError`: Raised if the languages configuration file is
227 contains errors.
228
229 :See:
230 - `turbolucene` (module docstring) for details about configuration
231 settings.
232 - `_read_stopwords` for details about stopwords files.
233
234 .. _ConfigObj: http://www.voidspace.org.uk/python/configobj.html
235
236 """
237
238 global _language_data
239 languages_file = config.get('turbolucene.languages_file', None)
240 languages_file_encoding = config.get('turbolucene.languages_file_encoding',
241 'utf-8')
242 if languages_file:
243 _log.info(u'Loading custom language data from "%s"' % languages_file)
244 else:
245 _log.info(u'Loading default language data')
246 languages_file = resource_stream(__name__, u'languages.cfg')
247 _language_data = ConfigObj(languages_file,
248 encoding=languages_file_encoding, file_error=True, raise_errors=True)
249
250
252 """Schedule index optimization using the TurboGears scheduler.
253
254 This function reads it's configuration data from
255 ``turbolucene.optimize_days`` and ``turbolucene.optimize_time``.
256
257 :Exceptions:
258 - `TypeError`: Raised if ``turbolucene.optimize_time`` is invalid.
259
260 :See: `turbolucene` (module docstring) for details about configuration
261 settings.
262
263 """
264 optimize_days = config.get('turbolucene.optimize_days', range(1, 8))
265 optimize_time = config.get('turbolucene.optimize_time', (00, 00))
266 scheduler.add_weekday_task(_optimize, optimize_days, optimize_time)
267 _log.info(u'Index optimization scheduled on %s at %s' % (unicode(
268 optimize_days), unicode(optimize_time)))
269
270
272 """Return the path to the index for the given language.
273
274 This function gets it's configuration data from ``turbolucene.index_root``.
275
276 :Parameters:
277 language : `str`
278 An ISO language code. (e.g. ``en``, ``pt-br``, etc.)
279
280 :Returns: The path to the index for the given language.
281 :rtype: `unicode`
282
283 :See: `turbolucene` (module docstring) for details about configuration
284 settings.
285
286 """
287 index_base_path = config.get('turbolucene.index_root', u'index')
288 return join(index_base_path, language)
289
290
292 """Read the stopwords from the given a stopwords file path.
293
294 Stopwords are words that should not be indexed because they are too common
295 or have no significant meaning (e.g. *the*, *in*, *with*, etc.) They are
296 language dependent.
297
298 This function gets it's configuration data from
299 ``turbolucene.stopwords_root``.
300
301 If `file_path` is not an absolute path, then it will be appended to the
302 path configured in ``turbolucene.stopwords_root``.
303
304 Stopwords files are text files (in the given encoding), with one stopword
305 per line. Comments are marked by a ``|`` character. This is for
306 compatibility with the stopwords files found at
307 http://snowball.tartarus.org/.
308
309 :Parameters:
310 file_path : `unicode`
311 The path to the stopwords file to read.
312 encoding : `str`
313 The encoding of the stopwords file.
314
315 :Returns: The list of stopwords from the file.
316 :rtype: `list` of `unicode` strings
317
318 :Exceptions:
319 - `IOError`: Raised if the stopwords file could not be opened.
320
321 :See: `turbolucene` (module docstring) for details about configuration
322 settings.
323
324 """
325 stopwords_base_path = config.get('turbolucene.stopwords_root', None)
326 if isabs(file_path) or stopwords_base_path:
327 if not isabs(file_path):
328 file_path = join(stopwords_base_path, file_path)
329 _log.info(u'Reading custom stopwords file "%s"' % file_path)
330 stopwords_file = codecs_open(file_path, 'r', encoding)
331 else:
332 _log.info(u'Reading default stopwords file "%s"' % file_path)
333 stopwords_file = EncodedFile(resource_stream(__name__, join(
334 u'stopwords', file_path)), encoding)
335 stopwords = []
336 for line in stopwords_file:
337
338
339
340 stopword = line.split(u'|')[0].strip()
341 if stopword:
342 stopwords.append(stopword)
343 stopwords_file.close()
344 return stopwords
345
346
348 """Produce an analyzer object appropriate for the given language.
349
350 This function uses the data that was read in from the languages
351 configuration file to determine and instantiate the analyzer object.
352
353 :Parameters:
354 language : `str` or `unicode`
355 An ISO language code that is configured in the languages configuration
356 file.
357
358 :Returns: An instance of the configured analyser class for given language.
359 :rtype: ``PyLucene.Analyzer`` sub-class
360
361 :Exceptions:
362 - `KeyError`: Raised if the given language is not configured or if the
363 configuration for that language does not have a *analyzer_class* key.
364 - `PyLucene.InvalidArgsError`: Raised if any of the parameters passed to
365 the analyzer class are invalid.
366
367 :See: `_load_language_data` for details about the language configuration
368 file.
369
370 """
371 ldata = _language_data[language]
372 args = (u'analyzer_class_args' in ldata and ldata[u'analyzer_class_args']
373 or [])
374 if not isinstance(args, list):
375 args = [args]
376
377
378
379 stopwords = []
380 if u'stopwords' in ldata and ldata[u'stopwords']:
381 stopwords = [ldata.stopwords]
382 elif u'stopwords_file' in ldata and u'stopwords_file_encoding' in ldata:
383 stopwords = [_read_stopwords(ldata[u'stopwords_file'],
384 ldata[u'stopwords_file_encoding'])]
385
386
387
388 args += stopwords
389
390 return getattr(PyLucene, ldata[
391 u'analyzer_class'])(*args)
392
393
399
400
402 """Tell the search engine to optimize it's index."""
403 _indexer('optimize')
404
405
406
407
408 -def start(make_document, results_formatter=None):
409 """Initialize and start the search engine threads.
410
411 This function loads the language configuration information, starts the
412 search engine threads, makes sure the search engine will be shutdown upon
413 shutdown of TurboGears and starts the optimization scheduler to run at the
414 configured times.
415
416 The `make_document` and `results_formatter` parameters are
417 callables. Here are examples of how they should be defined:
418
419 Example `make_document` function:
420 ===================================
421
422 .. python::
423
424 def make_document(entry):
425 '''Make a new PyLucene Document instance from an Entry instance.'''
426 document = Document()
427 # An 'id' string field is required.
428 document.add(Field('id', str(entry.id), STORE, UN_TOKENIZED))
429 document.add(Field('posted_on', entry.rendered_posted_on, STORE,
430 TOKENIZED))
431 document.add(Field('title', entry.title, STORE, TOKENIZED))
432 document.add(Field('text', strip_tags(entry.etree), COMPRESS,
433 TOKENIZED))
434 categories = ' '.join([unicode(category) for category in
435 entry.categories])
436 document.add(Field('category', categories, STORE, TOKENIZED))
437 return document
438
439 Example `results_formatter` function:
440 =======================================
441
442 .. python::
443
444 def results_formatter(results):
445 '''Return the results as SQLObject instances.
446
447 Returns either an empty list or a SelectResults object.
448
449 '''
450 if results:
451 return Entry.select_with_identity(IN(Entry.q.id, [int(id) for id
452 in results]))
453
454 :Parameters:
455 make_document : callable
456 `make_document` is a callable that will return a PyLucene `Document`
457 object based on the object passed in to `add`, `update` or `remove`.
458 The `Document` object must have at least a field called ``id`` that is
459 a string. This function operates inside a PyLucene ``PythonThread``.
460 results_formatter : callable
461 `results_formatter`, if provided, is a callable that will return
462 a formatted version of the search results that are passed to it by
463 `_Searcher.__call__`. Generally the `results_formatter` will take the
464 list of ``id`` strings that is passed to it and return a list of
465 application-specific objects (like SQLObject_ instances, for example.)
466 This function operates outside of any PyLucene ``PythonThread`` objects
467 (like in the CherryPy thread, for example). (Optional)
468
469 :See:
470 - `turbolucene` (module docstring) for details about configuration
471 settings.
472 - `_load_language_data` for details about the language configuration
473 file.
474
475 .. _SQLObject: http://sqlobject.org/
476
477 """
478 _load_language_data()
479
480 global _indexer, _searcher_factory
481 _indexer = _Indexer(make_document)
482 _searcher_factory = _SearcherFactory(results_formatter)
483
484
485 register(_stop)
486 _schedule_optimization()
487 _log.info(u'Search engine started.')
488
489
490 -def add(object_, language=None):
491 """Tell the search engine to add the given object to the index.
492
493 This function returns immediately. It does not wait for the indexer to be
494 finished.
495
496 :Parameters:
497 `object_`
498 This can be any object that ``make_document`` knows how to handle.
499 language : `str`
500 This is the ISO language code of the language of the object. If
501 `language` is given, then it must be on that was previously configured
502 in ``turbolucene.languages``. If `language` is not given, then
503 the language configured in ``turbolucene.default_language`` will be
504 used. (Optional)
505
506 :See:
507 - `turbolucene` (module docstring) for details about configuration
508 settings.
509 - `start` for details about ``make_document``.
510
511 """
512 _indexer('add', object_, language)
513
514
515 -def update(object_, language=None):
516 """Tell the the search engine to update the index for the given object.
517
518 This function returns immediately. It does not wait for the indexer to be
519 finished.
520
521 :Parameters:
522 `object_`
523 This can be any object that ``make_document`` knows how to handle.
524 language : `str`
525 This is the ISO language code of the language of the object. If
526 `language` is given, then it must be on that was previously configured
527 in ``turbolucene.languages``. If `language` is not given, then
528 the language configured in ``turbolucene.default_language`` will be
529 used. (Optional)
530
531 :See:
532 - `turbolucene` (module docstring) for details about configuration
533 settings.
534 - `start` for details about ``make_document``.
535
536 """
537 _indexer('update', object_, language)
538
539
540 -def remove(object_, language=None):
541 """Tell the search engine to remove the given object from the index.
542
543 This function returns immediately. It does not wait for the indexer to be
544 finished.
545
546 :Parameters:
547 `object_`
548 This can be any object that ``make_document`` knows how to handle.
549 language : `str`
550 This is the ISO language code of the language of the object. If
551 `language` is given, then it must be on that was previously configured
552 in ``turbolucene.languages``. If `language` is not given, then
553 the language configured in ``turbolucene.default_language`` will be
554 used. (Optional)
555
556 :See:
557 - `turbolucene` (module docstring) for details about configuration
558 settings.
559 - `start` for details about ``make_document``.
560
561 """
562 _indexer('remove', object_, language)
563
564
565 -def search(query, language=None):
566 """Return results from the search engine that match the query.
567
568 If a ``results_formatter`` function was passed to `start` then the results
569 will be passed through the formatter before returning. If not, the
570 returned value is a list of strings that are the ``id`` fields of matching
571 objects.
572
573 :Parameters:
574 query : `str` or `unicode`
575 This is the search query to give to PyLucene. All of Lucene's query
576 syntax (field identifiers, wild cards, etc.) are available.
577 language : `str`
578 This is the ISO language code of the language of the object. If
579 `language` is given, then it must be on that was previously configured
580 in ``turbolucene.languages``. If `language` is not given, then
581 the language configured in ``turbolucene.default_language`` will be
582 used. (Optional)
583
584 :Returns: The results of the search.
585 :rtype: iterable
586
587 :See:
588 - `start` for details about ``results_formatter``.
589 - `turbolucene` (module docstring) for details about configuration
590 settings.
591 - http://lucene.apache.org/java/docs/queryparsersyntax.html for details
592 about Lucene's query syntax.
593
594 """
595 return _searcher_factory()(query, language)
596
597
598
599
601
602 """Responsible for updating and maintaining the search engine index.
603
604 A single `_Indexer` thread is created to handle all index modifications.
605
606 Once the thread is started, messages are sent to it by calling the instance
607 with a task and an object, where the task is one of the following strings:
608
609 - ``add``: Adds the object to the index.
610 - ``remove``: Removes the object from the index.
611 - ``update``: Updates the index of an object.
612
613 and the object is any object that ``make_document`` knows how to handle.
614
615 To properly shutdown the thread, send the ``stop`` task with `None` as the
616 object. (This is normally handled by the `turbolucene._stop` function.)
617
618 To optimize the index, which can take a while, pass the ``optimize``
619 task with `None` for the object. (This is normally handled by the
620 TurboGears scheduler as set up by `_schedule_optimization`.)
621
622 :See: `turbolucene.start` for details about ``make_document``.
623
624 :group Public API: __init__, __call__
625 :group Threaded methods: run, _add, _remove, _update, _optimize, _stop
626
627 """
628
629
630
632 """Initialize the message queue and the PyLucene indexes.
633
634 One PyLucene index is created/opened for each of the configured
635 supported languages.
636
637 This method uses the ``turbolucene.default_language`` and
638 ``turbolucene.languages`` configuration settings.
639
640 :Parameters:
641 make_document : callable
642 A callable that takes the object to index as a parameter and
643 returns an appropriate `Document` object.
644
645 :Note: Instantiating this class starts the thread automatically.
646
647 :See:
648 - `turbolucene` (module docstring) for details about configuration
649 settings.
650 - `turbolucene.start` for details about ``make_document``.
651 - `_get_index_path` for details about the directory location of each
652 index.
653 - `_analyzer_factory` for details about the analyzer used for each
654 index.
655
656 """
657 PythonThread.__init__(self)
658 self._make_document = make_document
659 self._task_queue = Queue()
660 self._indexes = {}
661 default_language = config.get('turbolucene.default_language',
662 _DEFAULT_LANGUAGE)
663
664 languages = config.get('turbolucene.languages', [default_language])
665 for language in languages:
666 index_path = _get_index_path(language)
667 self._indexes[language] = IndexModifier(index_path,
668 _analyzer_factory(language), not exists(index_path) and True or
669 False)
670 self.start()
671
672 - def __call__(self, task, object_=None, language=None):
673 """Pass `task`, `object_` and `language` to the thread for processing.
674
675 If `language` is `None`, then the default language configured in
676 ``turbolucene.default_language`` is used.
677
678 If `task` is ``stop``, then the `_Indexer` thread is shutdown and this
679 method will wait until the shutdown is complete.
680
681 :Parameters:
682 task : `str`
683 The task to perform.
684 `object_`
685 Any object that ``make_document`` knows how to handle. (Default:
686 `None`)
687 language : `str`
688 The ISO language code of the language of the object. This
689 specifies which PyLucene index to use.
690
691 :See:
692 - `turbolucene` (module docstring) for details about configuration
693 settings.
694 - `turbolucene.start` for details about ``make_document``.
695
696 """
697 if not language:
698 language = config.get('turbolucene.default_language',
699 _DEFAULT_LANGUAGE)
700 self._task_queue.put((task, object_, language))
701 if task == 'stop':
702 self.join()
703
704
705
707 """Main thread loop to do dispatching based on messages in the queue.
708
709 This method expects that the queue will contain 3-tuples in the form of
710 (task, object, language), where task is one of ``add``, ``update``,
711 ``remove``, ``optimize`` or ``stop``, entry is any object that
712 ``make_document`` can handle or `None` in the case of ``optimize`` and
713 ``stop``, and language is the ISO language code of the indexer.
714
715 If the task is ``stop``, then the thread shuts down.
716
717 :Note: This method is run in the thread.
718
719 :See:
720 - `_add`, `_update`, `_remove`, `_optimize` and `_stop` for details
721 about each respective task.
722 - `turbolucene.start` for details about ``make_document``.
723
724 """
725 while True:
726 task, object_, language = self._task_queue.get()
727 method = getattr(self, '_' + task)
728 if task in ('optimize', 'stop'):
729 method()
730 else:
731 method(object_, language)
732 if task == 'stop':
733 break
734 self._indexes[language].flush()
735
736 - def _add(self, object_, language, document=None):
737 """Add a new object to the index.
738
739 If `document` is not provided, then this method passes the object off
740 to ``make_document`` and then indexes the resulting `Document` object.
741 Otherwise it just indexes the `document` object.
742
743 :Parameters:
744 `object_`
745 The object to be indexed. It will be passed to ``make_document``
746 (unless `document` is provided).
747 language : `str`
748 The ISO language code of the indexer to use.
749 document : `Document`
750 A pre-built `Document` object for the given object, if it exists.
751 This is used internally by `_update`. (Default: `None`)
752
753 :Note: This method is run in the thread.
754
755 :See: `turbolucene.start` for details about ``make_document``.
756
757 """
758 if not document:
759 document = self._make_document(object_)
760 _log.info(u'Adding object "%s" (id %s) to the %s index.' % (unicode(
761 object_), document['id'], language))
762 self._indexes[language].addDocument(document)
763
764 - def _remove(self, object_, language, document=None):
765 """Remove an object from the index.
766
767 If `document` is not provided, then this method passes the object off
768 to ``make_document`` and then removes the resulting `Document` object
769 from the index. Otherwise it just removes the `document` object.
770
771 :Parameters:
772 `object_`
773 The object to be removed from the index. It will be passed to
774 ``make_document`` (unless `document` is provided).
775 language : `str`
776 The ISO language code of the indexer to use.
777 document : `Document`
778 A pre-built `Document` object for the given object, if it exists.
779 This is used internally by `_update`. (Default: `None`)
780
781 :Note: This method is run in the thread.
782
783 :See: `turbolucene.start` for details about ``make_document``.
784
785 """
786 if not document:
787 document = self._make_document(object_)
788 _log.info(u'Removing object "%s" (id %s) from %s index.' % (unicode(
789 object_), document['id'], language))
790 self._indexes[language].deleteDocuments(Term('id', document['id']))
791
792 - def _update(self, object_, language):
793 """Update an object in the index by replacing it.
794
795 This method updates the index by removing and then re-adding the
796 object.
797
798 :Parameters:
799 `object_`
800 The object to update in the index. It will be passed to
801 ``make_document`` and the resulting `Document` object will be
802 updated.
803 language : `str`
804 The ISO language code of the indexer to use.
805
806 :Note: This method is run in the thread.
807
808 :See:
809 - `_remove` and `_add` for details about the removal and
810 re-addition.
811 - `turbolucene.start` for details about ``make_document``.
812
813 """
814 document = self._make_document(object_)
815 self._remove(object_, language, document)
816 self._add(object_, language, document)
817
819 """Optimize all of the indexes. This can take a while.
820
821 :Note: This method is run in the thread.
822
823 """
824 _log.info(u'Optimizing indexes.')
825 for index in self._indexes.values():
826 index.optimize()
827 _log.info(u'Indexes optimized.')
828
830 """Shutdown all of the indexes.
831
832 :Note: This method is run in the thread.
833
834 """
835 for index in self._indexes.values():
836 index.close()
837
838
840
841 """Responsible for searching an index and returning results.
842
843 `_Searcher` threads are created for each search that is requested. After
844 the search is completed, the thread dies.
845
846 To search, a `_Searcher` class is instantiated and then called with the
847 query and the ISO language code for the index to search. It returns the
848 results as a list of object id strings unless ``results_formatter`` was
849 provided. If it was, then the list of id strings are passed to
850 ``results_formatter`` to process and it's results are returned.
851
852 The thread is garbage collected when it goes out of scope.
853
854 The catch to all this is that a CherryPy thread cannot directly instantiate
855 a `_Searcher` thread because of PyLucene restrictions. So to get around
856 that, see the `_SearcherFactory` class.
857
858 :See: `turbolucene.start` for details about ``results_formatter``.
859
860 :group Public API: __init__, __call__
861 :group Threaded methods: run
862
863 """
864
865
866
868 """Initialize message queues and start the thread.
869
870 :Note: The thread is started as soon as the class is instantiated.
871
872 """
873 PythonThread.__init__(self)
874 self._results_formatter = results_formatter
875 self._query_queue = Queue()
876 self._results_queue = Queue()
877 self.start()
878
879 - def __call__(self, query, language=None):
880 """Send `query` and `language` to the thread, wait and return results.
881
882 If `language` is `None`, then the default language configured in
883 ``turbolucene.default_language`` is used.
884
885 :Parameters:
886 query : `str` or `unicode`
887 The search query to give to PyLucene. All of Lucene's query
888 syntax (field identifiers, wild cards, etc.) are available.
889 language : `str`
890 The ISO language code of the indexer to use.
891
892 :Returns: An iterable of id field strings that match the query or the
893 results produced by ``results_formatter`` if it was provided.
894 :rtype: iterable
895
896 :See:
897 - `turbolucene` (module docstring) for details about configuration
898 settings.
899 - `turbolucene.start` for details about ``results_formatter``.
900 - http://lucene.apache.org/java/docs/queryparsersyntax.html for
901 details about Lucene's query syntax.
902
903 """
904 if not language:
905 language = config.get('turbolucene.default_language',
906 _DEFAULT_LANGUAGE)
907 self._query_queue.put((query, language))
908 results = self._results_queue.get()
909
910
911
912
913 if self._results_formatter:
914 return self._results_formatter(results)
915 return results
916
917
918
920 """Search the language index for the query and send back the results.
921
922 The results is an iterable of id field strings that match the query.
923
924 This method uses the ``turbolucene.search_fields`` configuration
925 setting for the default fields to search if none are specified in the
926 query itself, and ``turbolucene.default_operator`` for the default
927 operator to use when joining terms.
928
929 :Exceptions:
930 - `AttributeError`: Raised when the configured default operator is
931 not valid.
932
933 :Note: This method is run in the thread.
934
935 :Note: The thread dies after one search.
936
937 :See:
938 - `turbolucene` (module docstring) for details about configuration
939 settings.
940 - `_get_index_path` for details about the directory location of the
941 index.
942 - `_analyzer_factory` for details about the analyzer used for the
943 index.
944 - http://lucene.apache.org/java/docs/queryparsersyntax.html for
945 details about Lucene's query syntax.
946
947 """
948 query, language = self._query_queue.get()
949 searcher = IndexSearcher(_get_index_path(language))
950 search_fields = config.get('turbolucene.search_fields', ['id'])
951 parser = MultiFieldQueryParser(search_fields, _analyzer_factory(
952 language))
953 default_operator = getattr(parser.Operator, config.get(
954 'turbolucene.default_operator', 'AND').upper())
955 parser.setDefaultOperator(default_operator)
956 try:
957 hits = searcher.search(parser.parse(query))
958 results = [document['id'] for _, document in hits]
959 except JavaError:
960 results = []
961 self._results_queue.put(results)
962 searcher.close()
963
964
966
967 """Produces running `_Searcher` threads.
968
969 ``PythonThread`` threads can only be started by the main program or other
970 ``PythonThread`` threads, so this ``PythonThread``-based class creates and
971 starts single-use `_Searcher` threads. This thread is created and started
972 by the main program during TurboGears initialization as a singleton.
973
974 To get a `_Searcher` thread, call the `_SearcherFactory` instance. Then
975 pass the query to the `_Searcher` thread that was returned.
976
977 :group Public API: __init__, __call__, stop
978 :group Threaded methods: run
979
980 """
981
982
983
984 - def __init__(self, *searcher_args, **searcher_kwargs):
985 """Initialize message queues and start the thread.
986
987 :Note: The thread is started as soon as the class is instantiated.
988
989 """
990 PythonThread.__init__(self)
991 self._searcher_args = searcher_args
992 self._searcher_kwargs = searcher_kwargs
993 self._request_queue = Queue()
994 self._searcher_queue = Queue()
995 self.start()
996
998 """Send a request for a running `_Searcher` class, then return it.
999
1000 :Returns: A running instance of the `_Searcher` class.
1001 :rtype: `_Searcher`
1002
1003 """
1004 self._request_queue.put('request')
1005 return self._searcher_queue.get()
1006
1008 """Stop the `_SearcherFactory` thread."""
1009 self._request_queue.put('stop')
1010 self.join()
1011
1012
1013
1015 """Listen for requests and create `_Searcher` classes.
1016
1017 If the request message is ``stop``, then the thread will be shutdown.
1018
1019 :Note: This method is run in the thread.
1020
1021 """
1022 while True:
1023 request = self._request_queue.get()
1024 if request == 'stop':
1025 break
1026
1027 self._searcher_queue.put(_Searcher(
1028 *self._searcher_args, **self._searcher_kwargs))
1029