This module contains various general utility functions.
Objects which inherit from this class have save/load functions, which un/pickle them to disk.
This uses cPickle for de/serializing, so objects must not contains unpicklable attributes, such as lambda functions etc.
Remove accentuation from the given string.
Input text is either a unicode string or utf8 encoded bytestring. Return input string with accents removed, as unicode.
>>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
u'Sef chomutovskych komunistu dostal postou bily prasek'
Scan corpus for all word ids that appear in it, then contruct and return a mapping which maps each wordId -> str(wordId).
This function is used whenever words need to be displayed (as opposed to just their ids) but no wordId->word mapping was provided. The resulting mapping only covers words actually used in the corpus, up to the highest wordId found.
Check whether obj is a corpus.
NOTE: When called on an empty corpus (no documents), will return False.
Iteratively yield tokens as unicode strings, optionally also lowercasing them and removing accent marks.
Input text may be either unicode or utf8-encoded byte string.
The tokens on output are maximal contiguous sequences of alphabetic characters (no digits!).
>>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
[u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']