Table Of Contents

Next topic

Introduction

Gensim – Python Framework for Vector Space Modelling

What’s new?

Version 0.7 is out!

  • Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) are now faster, consume less memory.
  • Optimizations to vocabulary generation.
  • Input corpus iterator can come from a compressed file (bzip2, gzip, ...), to save disk space when dealing with very large corpora.

gensim now completes LSI of the English Wikipedia (3.2 million documents) in 5 hours 14 minutes, using a one-pass SVD algorithm, on a single Macbook Pro laptop. Be sure to check out the distributed mode, too.

For an overview on what gensim does (or does not do), go to the introduction.

To download and install gensim, consult the install page.

For examples on how to use it, try the tutorials.

Quick Reference Example

>>> from gensim import corpora, models, similarities
>>>
>>> # load corpus iterator from a Matrix Market file on disk
>>> corpus = corpora.MmCorpus('/path/to/corpus.mm')
>>>
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, numTopics=200)
>>>
>>> # convert the same corpus to latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[corpus])
>>>
>>> # perform similarity query of another vector in LSI space against the whole corpus
>>> sims = index[query]