Previous topic

corpora.wikicorpus – Corpus from a Wikipedia dump

Next topic

models.lsimodel – Latent Semantic Indexing

models.ldamodel – Latent Dirichlet Allocation

This module encapsulates functionality for the Latent Dirichlet Allocation algorithm.

It allows both model estimation from a training corpus and inference on new, unseen documents.

The implementation is based on Blei et al., Latent Dirichlet Allocation, 2003, and on Blei’s LDA-C software [1] in particular. This means it uses variational EM inference rather than Gibbs sampling to estimate model parameters. NumPy is used heavily here, but is still much slower than the original C version. The up side is that it is streamed (documents come in sequentially, no random indexing), runs in constant memory w.r.t. the number of documents (input corpus size) and is distributed (makes use of a cluster of machines, if available).

[1]http://www.cs.princeton.edu/~blei/lda-c/
class gensim.models.ldamodel.LdaModel(corpus=None, numTopics=200, id2word=None, distributed=False, chunks=None, alpha=None, initMode='random', dtype=<type 'numpy.float64'>)

Objects of this class allow building and maintaining a model of Latent Dirichlet Allocation.

The constructor estimates model parameters based on a training corpus:

>>> lda = LdaModel(corpus, numTopics=10)

You can then infer topic distributions on new, unseen documents, with:

>>> doc_lda = lda[doc_bow]

Model persistency is achieved via its load/save methods.

numTopics is the number of requested latent topics to be extracted from the training corpus.

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.

initMode can be either ‘random’, for a fast random initialization of the model parameters, or ‘seeded’, for an initialization based on a handful of real documents. The ‘seeded’ mode requires an extra sweep over the entire input corpus, and is thus much slower.

alpha is either None (to be estimated during training) or a number between (0.0, 1.0).

Turn on distributed to force distributed computing (see the web tutorial on how to set up a cluster).

Example:

>>> lda = LdaModel(corpus, numTopics=100)
>>> print lda[doc_tfidf] # get topic probability distribution for a documents
>>> lda.addDocuments(corpus2) # update LDA with additional documents
>>> print lda[doc_tfidf]
addDocuments(corpus, chunks=None)

Run LDA parameter estimation on a training corpus, using the EM algorithm.

This effectively updates the underlying LDA model on new documents from corpus (or initializes the model if this is the first call).

computeLikelihood(doc, phi, gamma)
Compute the document likelihood, given all model parameters.
countsFromCorpus(corpus, numInitDocs=1)
Initialize the model word counts from the corpus. Each topic will be initialized from numInitDocs randomly selected documents. The corpus is only iterated over once.
docEStep(corpus)
Find optimizing parameters for phi and gamma, and update sufficient statistics.
inference(doc)

Perform inference on a single document.

Return 3-tuple of (likelihood of this document, word-topic distribution phi, expected word counts gamma (~topic distribution)).

A document is simply a bag-of-words collection which supports len() and iteration over (wordIndex, wordCount) 2-tuples.

The model itself is not affected in any way (this function is read-only aka const).

classmethod load(fname)
Load a previously saved object from file (also see save).
mle(estimateAlpha)

Maximum likelihood estimate.

This maximizes the lower bound on log likelihood wrt. to the alpha and beta parameters.

optAlpha(max_iter=1000, newton_thresh=1.0000000000000001e-05)
Estimate new topic priors (actually just one scalar shared across all topics).
printTopics(numTopics=5, numWords=10, pretty=True)

Print the top numTerms words for numTopics topics, along with the log of their probability.

If pretty is set, use the probs2scores() to determine what the ‘top words’ are. Otherwise, order the words directly by their word-topic probability.

probs2scores(numTopics=10)

Transform topic-word probability distribution into more human-friendly scores, in hopes these scores make topics more easily interpretable.

The transformation is a sort of TF-IDF score, where the word gets higher score if it’s probable in this topic (the TF part) and lower score if it’s probable across all topics (the IDF part).

The exact formula is taken from Blei&Lafferty: “Topic Models”, 2009.

The numTopics transformed scores are yielded iteratively, one topic after another.

save(fname)
Save the object to file via pickling (also see load).
class gensim.models.ldamodel.LdaState

Encapsulate information returned by distributed computation of the E training step.

Objects of this class are sent over the network at the end of each corpus iteration, so try to keep this class lean to reduce traffic.

classmethod load(fname)
Load a previously saved object from file (also see save).
merge(other)

Merge the result of an E step from one node with that of another node.

The merging is trivial and after merging all cluster nodes, we have the exact same result as if the computation was run on a single node (no approximation).

reset(mat=None)
Prepare the state for a new iteration.
save(fname)
Save the object to file via pickling (also see load).
gensim.models.ldamodel.dirichlet_expectation(alpha)
For a vector theta ~ Dir(alpha), computes E[log(theta)] given alpha.