What’s new?
Version 0.7 is out!
If you have a cluster of computers, the time taken to process a given corpus with our distributed LSA algorithm drops almost linearly with the number of machines. Of course, the option of incrementally adding new documents to an existing decomposition, without the need to recompute everything from scratch, remains from the previous version. This means that your document input stream may even be infinite in size, with new documents coming in asynchronously.
For an introduction on what gensim does (or does not do), go to the introduction.
To download and install gensim, consult the install page.
For examples on how to use it, try the tutorials.
>>> from gensim import corpora, models, similarities
>>>
>>> # load corpus iterator from a Matrix Market file on disk
>>> corpus = corpora.MmCorpus('/path/to/corpus.mm')
>>>
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, numTopics=200)
>>>
>>> # convert the same corpus to latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[corpus])
>>>
>>> # perform similarity query of another vector in LSI space against the whole corpus
>>> sims = index[query]