Gensim is a Python framework designed to help make the conversion of natural language texts to the Vector Space Model as simple and natural as possible.
Gensim contains algorithms for unsupervised learning from raw, unstructured digital texts, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections. These algorithms discover hidden (latent) corpus structure. Once found, documents can be succinctly expressed in terms of this structure, queried for topical similarity and so on.
If the previous paragraphs left you confused, you can read more about the Vector Space Model and unsupervised document analysis at Wikipedia.
Note
Gensim’s target audience is the NLP research community and interested general public; gensim is not meant to be a production tool for commercial environments.
gensim includes the following features:
Creation of gensim was motivated by a perceived lack of available, scalable software frameworks that realize topic modelling, and/or their overwhelming internal complexity (hail java!). You can read more about the motivation in our LREC 2010 workshop paper. If you want to cite gensim in your own work, please refer to that article.
The principal design objectives behind gensim are:
Gensim is licensed under the OSI-approved GNU LPGL license and can be downloaded either from its SVN repository or from the Python Package Index.
See also
See the install page for more info on package deployment.
The whole gensim package revolves around the concepts of corpus, vector and model.
In the Vector Space Model (VSM), each document is represented by an array of features. For example, a single feature may be thought of as a question-answer pair:
The question is usually represented only by its integer id, so that the representation of a document becomes a series of pairs like (1, 0.0), (2, 2.0), (3, 5.0). If we know all the questions in advance, we may leave them implicit and simply write (0.0, 2.0, 5.0). This sequence of answers can be thought of as a high-dimensional (in our case 3-dimensional) vector. For practical purposes, only questions to which the answer is (or can be converted to) a single real number are allowed.
The questions are the same for each document, so that looking at two vectors (representing two documents), we will hopefully be able to make conclusions such as “The numbers in these two vectors are very similar, and therefore the original documents must be similar, too”. Of course, whether such conclusions correspond to reality depends on how well we picked our questions.
See also
For some examples on how this works out in code, go to tutorials.