Previous topic

corpora.indexedcorpus – Random access to corpus documents

Next topic

models.lsimodel – Latent Semantic Indexing

models.ldamodel – Latent Dirichlet Allocation

This module encapsulates functionality for the Latent Dirichlet Allocation algorithm.

It allows both model estimation from a training corpus and inference of topic distribution on new, unseen documents.

The core estimation code is directly adapted from the onlineldavb.py script by M. Hoffman [1], see Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.

The algorithm:

  • is streamed: training documents come in sequentially, no random access,
  • runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint, and
  • is distributed: makes use of a cluster of machines, if available, to speed up model estimation.
[1]http://www.cs.princeton.edu/~mdhoffma
class gensim.models.ldamodel.LdaModel(corpus=None, num_topics=100, id2word=None, distributed=False, chunksize=2000, passes=1, update_every=1, alpha=None, eta=None, decay=0.5)

The constructor estimates Latent Dirichlet Allocation model parameters based on a training corpus:

>>> lda = LdaModel(corpus, num_topics=10)

You can then infer topic distributions on new, unseen documents, with

>>> doc_lda = lda[doc_bow]

The model can be updated (trained) with new documents via

>>> lda.update(other_corpus)

Model persistency is achieved through its load/save methods.

num_topics is the number of requested latent topics to be extracted from the training corpus.

id2word is a mapping from word ids (integers) to words (strings). It is used to determine the vocabulary size, as well as for debugging and topic printing.

alpha and eta are hyperparameters that affect sparsity of the document-topic (theta) and topic-word (lambda) distributions. Both default to a symmetric 1.0/num_topics (but can be set to a vector, for asymmetric priors).

Turn on distributed to force distributed computing (see the web tutorial on how to set up a cluster of machines for gensim).

Example:

>>> lda = LdaModel(corpus, num_topics=100)
>>> print lda[doc_bow] # get topic probability distribution for a document
>>> lda.update(corpus2) # update the LDA model with additional documents
>>> print lda[doc_bow]
bound(corpus, gamma=None)

Estimate the variational bound of documents from corpus.

gamma are the variational parameters on topic weights (one for each document in corpus). If not supplied, will be automatically inferred from the model.

clear()

Clear model state (free up some memory). Used in the distributed algo.

do_estep(chunk, state=None)

Perform inference on a chunk of documents, and accumulate the collected sufficient statistics in state (or self.state if None).

do_mstep(rho, other)

M step: use linear interpolation between the existing topics and collected sufficient statistics in other to update the topics.

inference(chunk, collect_sstats=False)

Given a chunk of sparse document vectors, estimate gamma (parameters controlling the topic weights) for each document in the chunk.

This function does not modify the model (=is read-only aka const). The whole input chunk of document is assumed to fit in RAM; chunking of a large corpus must be done earlier in the pipeline.

If collect_sstats is True, also collect sufficient statistics needed to update the model’s topic-word distributions, and return a 2-tuple (gamma, sstats). Otherwise, return (gamma, None). gamma is of shape len(chunk) x topics.

classmethod load(fname)

Load a previously saved object from file (also see save).

save(fname)

Save the object to file via pickling (also see load).

show_topics(topics=10, topn=10, log=False, formatted=True)

Print the topN most probable words for (randomly selected) topics number of topics. Set topics=-1 to print all topics.

Unlike LSA, there is no ordering between the topics in LDA. The printed topics <= self.num_topics subset of all topics is therefore arbitrary and may change between two runs.

update(corpus, chunksize=None, decay=None, passes=None, update_every=None)

Train the model with new documents, by EM-iterating over corpus until the topics converge (or until the maximum number of allowed iterations is reached).

In distributed mode, the E step is distributed over a cluster of machines.

This update also supports updating an already trained model (self) with new documents from corpus; the two models are then merged in proportion to the number of old vs. new documents. This feature is still experimental for non-stationary input streams.

For stationary input (no topic drift in new documents), on the other hand, this equals the online update of Hoffman et al. and is guaranteed to converge for any decay in (0.5, 1.0>.

class gensim.models.ldamodel.LdaState(eta, shape)

Encapsulate information for distributed computation of LdaModel objects.

Objects of this class are sent over the network, so try to keep them lean to reduce traffic.

blend(rhot, other, targetsize=None)

Given LdaState other, merge it with the current state. Stretch both to targetsize documents before merging, so that they are of comparable magnitude.

Merging is done by average weighting: in the extremes, rhot=0.0 means other is completely ignored; rhot=1.0 means self is completely ignored.

This procedure corresponds to the stochastic gradient update from Hoffman et al., algorithm 2 (eq. 14).

blend2(rhot, other, targetsize=None)

Alternative, more simple blend.

classmethod load(fname)

Load a previously saved object from file (also see save).

merge(other)

Merge the result of an E step from one node with that of another node (summing up sufficient statistics).

The merging is trivial and after merging all cluster nodes, we have the exact same result as if the computation was run on a single node (no approximation).

reset()

Prepare the state for a new EM iteration (reset sufficient stats).

save(fname)

Save the object to file via pickling (also see load).

gensim.models.ldamodel.dirichlet_expectation(alpha)

For a vector theta~Dir(alpha), compute E[log(theta)].