Previous topic

corpora.bleicorpus – Corpus in Blei’s LDA-C format

Next topic

corpora.dmlcorpus – Corpus in DML-CZ format

corpora.dictionary – Construct word<->id mappings

This module implements the concept of Dictionary – a mapping between words and their integer ids.

Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filterExtremes() method), save/loaded from disk via Dictionary.save() and Dictionary.load() methods etc.

class gensim.corpora.dictionary.Dictionary(documents=None)

Dictionary encapsulates mappings between normalized words and their integer ids.

The main function is doc2bow, which converts a collection of words to its bag-of-words representation, optionally also updating the dictionary mapping with newly encountered words and their ids.

addDocuments(documents)

Build dictionary from a collection of documents. Each document is a list of tokens (ie. tokenized and normalized utf-8 encoded strings).

This is only a convenience wrapper for calling doc2bow on each document with allowUpdate=True.

>>> print Dictionary.fromDocuments(["máma mele maso".split(), "ema má máma".split()])
Dictionary(5 unique tokens)
doc2bow(document, allowUpdate=False)

Convert document (a list of words) into the bag-of-words format = list of (tokenId, tokenCount) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string.

If allowUpdate is set, then also update of dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its self.docFreq by one.

If allowUpdate is not set, this function is const, ie. read-only.

filterExtremes(noBelow=5, noAbove=0.5, keepN=None)

Filter out tokens that appear in

  1. less than noBelow documents (absolute number) or
  2. more than noAbove documents (fraction of total corpus size, not absolute number).
  3. after (1) and (2), keep only the first keepN’ most frequent tokens (or all if `None).

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

filterTokens(badIds=None, goodIds=None)

Remove the selected badIds tokens from all dictionary mappings, or, keep selected goodIds in the mapping and remove the rest.

badIds and goodIds are collections of word ids to be removed.

classmethod load(fname)
Load a previously saved object from file (also see save).
rebuildDictionary()

Assign new word ids to all words.

This is done to make the ids more compact, ie. after some tokens have been removed via filterTokens() and there are gaps in the id series. Calling this method will remove the gaps.

save(fname)
Save the object to file via pickling (also see load).