Metadata-Version: 2.4
Name: anchortopics
Version: 0.1.0
Summary: Fast, low-memory spectral topic modeling via on-the-fly rectification
Keywords: topic modeling,spectral methods,anchor words,natural language processing,nlp,machine learning
Author: David Mimno
Author-email: David Mimno <mimno@cornell.edu>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Python: >=3.11
Project-URL: Homepage, https://github.com/mimno/anchortopics
Project-URL: Repository, https://github.com/mimno/anchortopics
Project-URL: Paper, https://arxiv.org/abs/2111.06580
Description-Content-Type: text/markdown

# anchortopics

[![PyPI](https://img.shields.io/pypi/v/anchortopics.svg)](https://pypi.org/project/anchortopics/)
[Source code](https://github.com/mimno/anchortopics)

A fast, low-memory Python implementation of **On-the-Fly Rectification (OTFR)**
spectral topic modeling: Moontae Lee, Sungjun Cho, Kun Dong, David Mimno, and
David Bindel,
[*On-the-Fly Rectification for Robust Large-Vocabulary Topic Inference*](https://arxiv.org/abs/2111.06580)
(ICML 2021), building on the Joint-Stochastic Matrix Factorization /
Anchor Word framework (Arora et al. 2013; Moontae et al. 2015,
[JSMF](https://github.com/moontae/JSMF) /
[pyJSMF-RAW](https://github.com/sc782/pyJSMF-RAW)).

Spectral ("anchor word") topic models recover topics from low-order word
co-occurrence statistics in one shot — no sampling, no EM, just an
eigendecomposition and some linear algebra. The catch is that empirical
co-occurrence matrices are noisy, indefinite, and dense once rectified, so
the classical pipeline costs `O(N^2)` space and `O(N^2 K)` time in the
vocabulary size `N`. OTFR avoids ever forming the `N x N` matrix: it
maintains the rectified co-occurrence as an implicit low-rank-plus-sparse
operator (`Y Y^T + E + r * 1 1^T`) and recovers anchor words directly from
the low-rank factor `Y`, bringing the cost down to `O(N K)` space and
`O(N K^2)` time.

## Install

```bash
uv add anchortopics   # or: pip install anchortopics
```

## Quickstart

```python
from sklearn.feature_extraction.text import CountVectorizer
from anchortopics import OTFR

vectorizer = CountVectorizer(max_features=20000, stop_words="english")
X = vectorizer.fit_transform(documents)  # (n_documents, n_vocab) sparse counts

model = OTFR(n_topics=20).fit(X)

for k, words in enumerate(model.top_words(vectorizer.get_feature_names_out(), n_words=10)):
    print(k, words)
```

`model.components_` is the `(n_topics, n_vocab)` word-topic distribution
matrix (`p(word | topic)`), `model.topic_correlation_` is the `(n_topics,
n_topics)` topic-topic correlation matrix, and `model.anchors_` holds the
vocabulary indices selected as anchor words.

If you already have a co-occurrence matrix `C` (dense or sparse, `N x N`),
use `model.fit_cooccurrence(C)` instead of `fit(X)`.

For very large vocabularies/corpora, pass `randomized_init=True` to
initialize the rectification with a one-pass randomized eigendecomposition
computed directly from the word-document counts (Halko, Martinsson & Tropp
2011), rather than ARPACK.

## How it works

1. **`CooccurrenceOperator`** (`anchortopics.cooccurrence`) builds the unbiased
   joint-stochastic co-occurrence estimator of Arora et al. as an implicit
   linear operator over a sparse word-document count matrix, with
   `O(nnz(X))` matrix-vector products — the dense matrix is never formed.
2. **`enn_rectify`** (`anchortopics.enn`) runs Epsilon Non-Negative (ENN)
   rectification: iteratively project toward the nearest joint-stochastic,
   rank-`K`, (epsilon-)non-negative matrix, representing the non-negativity
   correction as a sparse matrix `E` rather than a dense one.
3. **`law`** (`anchortopics.law`) runs the Low-rank Anchor Word algorithm: selects
   anchor words via column-pivoted QR performed in the `K`-dimensional
   compressed space (equivalent to pivoting on the full matrix, per Lemma 1
   of the paper, but `O(N K^2)` instead of `O(N^2 K)`), then recovers the
   word-topic matrix `B` and topic correlation matrix `A`.

`anchortopics.model.OTFR` wires these together behind a small, sklearn-style `fit`
API.

## Diagnostics

`anchortopics.diagnostics` computes everything below from the fitted model's own
attributes — no held-out data, human judgments, or extra passes over the
corpus required:

```python
from anchortopics import diagnostics

vocab = vectorizer.get_feature_names_out()
report = diagnostics.summary(model, vocabulary=vocab)
print(report["specificity"], report["dissimilarity"])
for c in report["stopword_candidates"]:
    print(c["word"], c["topic_entropy"])
```

- **`stopword_candidates`**: ranks frequent words whose posterior over
  topics is close to uniform (high entropy) as candidates to add to a
  stoplist. On a real abstracts corpus this reliably surfaces academic
  boilerplate ("furthermore", "state-of-the-art", "leveraging", "address")
  rather than topical words.
- **`relative_approximation`**: `‖C − BAB^T‖_F / ‖C‖_F` against the
  *original* (unrectified) co-occurrence, estimated in `O(NK)` via a
  Hutchinson trace estimator — never materializes the dense matrix.
- **`relative_recovery`**: how well the selected anchors reconstruct the
  rest of the normalized co-occurrence space.
- **`relative_dominancy`**: how concentrated the topic-correlation matrix
  `A` is on its diagonal (independence) vs. off-diagonal (correlation).
- **`specificity`**: average KL divergence of each topic from the corpus
  unigram distribution; low values flag generic, high-frequency-word-driven
  topics.
- **`dissimilarity`**: fraction of each topic's top words that don't
  recur in any other topic's top words; low values flag redundant topics.
- **`eigengap`**: relative gap between the K-th and (K+1)-th eigenvalues of
  the rectified spectrum; a shrinking gap as K grows is a sign you're
  past the number of topics the data actually supports.

These catch degenerate cases (redundant or generic topics, over-large K)
rather than confirming topics are meaningful — they're a complement to,
not a replacement for, reading the topics. (Not implemented: the paper's
MST-Incoherence metric, which needs an NPMI graph over prominent +
"characteristic" words per topic — a reasonable follow-up.)

## Development

```bash
uv sync
uv run pytest
uv run python examples/synthetic_example.py
```
