This module contains basic interfaces used throughout the whole gensim package.
The interfaces are realized as abstract base classes (ie., some optional functionality is provided in the interface itself, so that the interfaces can be subclassed).
Interface for corpora. A corpus is simply an iterable, where each iteration step yields one document. A document is a list of (fieldId, fieldValue) 2-tuples.
Note that although a default len() method is provided, it is very inefficient (performs a linear scan through the corpus to determine its length). Wherever the corpus size is needed and known in advance (or at least doesn’t change so that it can be cached), the len() method should be overridden.
See the gensim.corpora.mmcorpus module for an example of a corpus.
Abstract interface for similarity searches over a corpus.
In all instances, there is a corpus against which we want to perform the similarity search.
For each similarity search, the input is a document and the output are its similarities to individual corpus documents.
Similarity queries are realized by calling self[query_document].
There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (ie., the query is each corpus document in turn).
Return similarity of a sparse vector doc to all documents in the corpus.
The document is assumed to be either of unit length or empty.
Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead.
See the gensim.models.tfidfmodel module for an example of a transformation.