corpora.bleicorpus – Corpus in Blei’s LDA-C format

Blei’s LDA-C format.

class gensim.corpora.bleicorpus.BleiCorpus(fname, fnameVocab=None)

Corpus in Blei’s LDA-C format.

The corpus is represented as two files: one describing the documents, and another describing the mapping between words and their ids.

Each document is one line:

N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN

The vocabulary is a file with words, one word per line; word at line K has an implicit id=K.

Initialize the corpus from a file.

fnameVocab is the file with vocabulary; if not specified, it defaults to fname.vocab.

classmethod load(fname)
Load a previously saved object from file (also see save).
save(fname)
Save the object to file via pickling (also see load).
static saveCorpus(fname, corpus, id2word=None)

Save a corpus in the Matrix Market format.

There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.

Previous topic

matutils – Math utils

Next topic

corpora.dictionary – Construct word<->id mappings