Table Of Contents

Previous topic

Corpora and Vector Spaces

Next topic

Similarity Queries

Topics and Transformations

Don’t forget to set

>>> import logging
>>> logging.root.setLevel(logging.INFO) # will suppress DEBUG level events

if you want to see logging events.

Transformation interface

In the previous tutorial on Corpora and Vector Spaces, we created a corpus of documents represented as a stream of vectors. To continue, let’s fire up gensim and use that corpus:

>>> from gensim import corpora, models, similarities
>>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
>>> corpus = corpora.MmCorpus('/tmp/deerwester.mm')
>>> print corpus
MmCorpus(9 documents, 12 features, 28 non-zero entries)

In this tutorial, we will show how to transform documents from one vector representation into another. This process serves two goals:

  1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more realistic way.
  2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

Creating a transformation

The transformations are standard Python objects, typically initialized by means of a training corpus:

>>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

We used our old corpus to initialize (train) the transformation model. Different transformations may require different initialization parameters; in case of TfIdf, the “training” consists simply of going through the supplied corpus once and computing document frequencies of all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet Allocation, is much more involved and, consequently, takes much more time.

Note

Transformations always convert between two specific vector spaces. The same vector space (= the same set of feature ids) must be used for training as well as for subsequent vector transformations. Failure to use the same input feature space, such as applying a different string preprocessing, using different feature ids, or using bag-of-words input vectors where TfIdf vectors are expected, will result in feature mismatch during transformation calls and consequently in either garbage output and/or runtime exceptions.

Transforming vectors

From now on, tfidf is treated as a read-only object that can be used to convert any vector from the old representation (bag-of-words integer counts) to the new representation (TfIdf real-valued weights):

>>> doc_bow = [(0, 1), (1, 1)]
>>> print tfidf[doc_bow] # step 2 -- use the model to transform vectors
[(0, 0.70710678), (1, 0.70710678)]

Or to apply a transformation to a whole corpus:

>>> corpus_tfidf = tfidf[corpus]
>>> for doc in corpus_tfidf:
>>>     print doc

In this particular case, we are transforming the same corpus that we used for training, but this is only incidental. Once the transformation model has been initialized, it can be used on any vectors (provided they come from the correct vector space, of course), even if they were not used in the training corpus at all. This is achieved by a process called folding-in for LSA, by topic inference for LDA etc.

Note

Calling model[corpus] only creates a wrapper around the old corpus document stream – actual conversions are done on-the-fly, during document iteration. This is because conversion at the time of calling corpus2 = model[corpus] would mean storing the result in main memory, which contradicts gensim’s objective of memory-indepedence. If you will be iterating over the transformed corpus2 multiple times, and the transformation is costly, serialize the resulting corpus to disk first and continue using that.

Transformations can also be serialized, one on top of another, in a sort of chain:

>>> lsi = models.LsiModel(corpus_tfidf, id2word = dictionary.id2token, numTopics = 2) # initialize an LSI transformation
>>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

Here we transformed our Tf-Idf corpus via Latent Semantic Indexing into a latent 2-D space (2-D because we set numTopics=2). Now you’re probably wondering: what do these two latent dimensions stand for? Let’s inspect with models.LsiModel.printTopics():

>>> for topicNo in range(lsi.numTopics):
>>>     print 'topic %i: %s' % (topicNo, lsi.printTopic(topicNo))
topic 0: -0.703 * "trees" + -0.538 * "graph" + -0.402 * "minors" + -0.187 * "survey" + -0.061 * "system" + -0.060 * "time" + -0.060 * "response" + -0.058 * "user" + -0.049 * "computer" + -0.035 * "interface" + -0.035 * "eps" + -0.030 * "human"
topic 1: 0.460 * "system" + 0.373 * "user" + 0.332 * "eps" + 0.328 * "interface" + 0.320 * "time" + 0.320 * "response" + 0.293 * "computer" + 0.280 * "human" + 0.171 * "survey" + -0.161 * "trees" + -0.076 * "graph" + -0.029 * "minors"

It appears that according to LSI, “trees”, “graphs” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic:

>>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
>>>     print doc
[(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"
[(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"
[(0, -0.090), (1, 0.724)] # "The EPS user interface management system"
[(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"
[(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"
[(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"
[(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"
[(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"
[(0, -0.617), (1, 0.054)] # "Graph minors A survey"

Model persistency is achieved with the save() and load() functions:

>>> lsi.save('/tmp/model.lsi') # same for tfidf, lda, ...
>>> lsi = models.LsiModel.load('/tmp/model.lsi')

The next question might be: just how exactly similar are those documents to each other? Is there a way to formalize the similarity, so that for a given input document, we can order some other set of documents according to their similarity? Similarity queries are covered in the next tutorial.

Available transformations

Gensim implements several popular Vector Space Model algorithms:

  • Term Frequency * Inverse Document Frequency, Tf-Idf expects a bag-of-words (integer values) training corpus during initialization. During transformation, it will take a vector and return another vector of the same dimensionality, except that features which were rare in the training corpus will have their value increased. It therefore converts integer-valued vectors into real-valued ones, while leaving the number of dimensions intact. It can also optionally normalize the resulting vectors to (Euclidean) unit length.

    >>> model = tfidfmodel.TfidfModel(bow_corpus, normalize = True)
    
  • Latent Semantic Indexing, LSI (or sometimes LSA) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. For the toy corpus above we used only 2 latent dimensions, but on real corpora, target dimensionality of 200–500 is recommended as a “golden standard” [1].

    >>> model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary.id2token, numTopics=300)
    

    LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!

    >>> model.addDocuments(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
    >>> lsi_vec = model[tfidf_vec] # convert a new document into the LSI space, without affecting the model
    >>> ...
    >>> model.addDocuments(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
    >>> lsi_vec = model[tfidf_vec]
    >>> ...
    

    See the gensim.models.lsimodel documentation for details on how to make LSI gradually “forget” old observations in infinite streams and how to tweak parameters affecting speed vs. memory footprint vs. numerical precision of the algorithm.

    For a discussion of two-pass vs. one-pass LSA algorithms (both are available in gensim), look at Experiments on Wikipedia.

  • Random Projections, RP aim to reduce vector space dimensionality. This is a very efficient (both memory- and CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness. Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

    >>> model = rpmodel.RpModel(tfidf_corpus, numTopics = 500)
    
  • Latent Dirichlet Allocation, LDA is yet another transformation from bag-of-words counts into a topic space of lower dimensionality. LDA is much slower than the other algorithms, so we are currently looking into ways of making it faster (see eg. [2], [3]). If you could help, let us know!

    >>> model = ldamodel.LdaModel(bow_corpus, id2word = dictionary.id2token, numTopics = 200)
    

Adding new VSM transformations (such as different weighting schemes) is rather trivial; see the API reference or directly the Python code for more info and examples.

It is worth repeating that these are all unique, incremental implementations, which do not require the whole training corpus to be present in main memory all at once. With memory taken care of, we are now implementing Distributed Computing, to improve CPU efficiency, too. If you feel you could contribute, please let us know!


[1]Bradford, R.B., 2008. An empirical study of required dimensionality for large-scale latent semantic indexing applications.
[2]Asuncion, A., 2009. On Smoothing and Inference for Topic Models.
[3]Yao, Mimno, McCallum, 2009. Efficient Methods for Topic Model Inference on Streaming Document Collections