Metadata-Version: 2.4
Name: topic-stability
Version: 0.1.0
Summary: Measure and visualize topic model stability across multiple runs
Project-URL: Homepage, https://github.com/mimno/TopicStability
Project-URL: Repository, https://github.com/mimno/TopicStability
Project-URL: Issues, https://github.com/mimno/TopicStability/issues
Author-email: David Mimno <mimno@cornell.edu>
License: MIT License
        
        Copyright (c) 2026 mimno
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: BERTopic,LDA,NLP,stability,text analysis,topic modeling
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: numpy
Requires-Dist: scipy
Provides-Extra: all
Requires-Dist: matplotlib; extra == 'all'
Requires-Dist: sentence-transformers; extra == 'all'
Requires-Dist: umap-learn; extra == 'all'
Provides-Extra: embed
Requires-Dist: sentence-transformers; extra == 'embed'
Provides-Extra: umap
Requires-Dist: umap-learn; extra == 'umap'
Provides-Extra: viz
Requires-Dist: matplotlib; extra == 'viz'
Description-Content-Type: text/markdown

# topic-stability

Measure and visualize the stability of topic models across multiple runs.

Topic models are stochastic: two runs with the same settings produce differently-labelled topics in a different order. **topic-stability** aligns topics across runs using sentence-embedding centroids and scores each topic by how consistently the same documents are assigned to it (Jensen-Shannon divergence). The result is a per-topic stability score in [0, 1] and a small-multiples UMAP visualization with stability annotated on each panel.

Works with any topic model that produces a document-topic matrix — LDA, NMF, BERTopic, and more.

## Install

```bash
pip install topic-stability                      # core (numpy + scipy only)
pip install "topic-stability[embed]"             # + sentence-transformers
pip install "topic-stability[umap,viz]"          # + UMAP + matplotlib
pip install "topic-stability[all]"               # everything
```

## Quick start

### sklearn (LDA, NMF, …)

```python
from sklearn.decomposition import LatentDirichletAllocation
from topic_stability import TopicRun, StabilityAnalysis, DocumentEmbedder

# Embed documents once and cache to disk
embedder = DocumentEmbedder(cache_path="embeddings.npy")
embeddings = embedder.embed(texts, ids=doc_ids)

# Train several runs
runs = [TopicRun.from_sklearn(
            LatentDirichletAllocation(n_components=20).fit(X), X
        ) for _ in range(5)]

analysis = StabilityAnalysis(runs, embeddings=embeddings)
analysis.align()

print(analysis.topic_stability())   # array of shape (n_topics,)
print(analysis.overall_stability()) # scalar

analysis.visualize("topics.png")    # requires topic-stability[umap,viz]
```

### Pass precomputed embeddings (e.g. from BERTopic)

```python
from topic_stability.integrations.bertopic import from_bertopic

run, embeddings = from_bertopic(model, embeddings=precomputed_embeddings)
```

See [BERTopic notes](#bertopic) below for important differences.

### From files (Mallet / CSV pipeline)

```python
runs = [
    TopicRun.from_csv(
        f"model_42_run{i}/doc_topic_avg.csv",
        word_topic_path=f"model_42_run{i}/word_topic_avg.csv",
    )
    for i in range(1, 6)
]

embedder = DocumentEmbedder(cache_path="embeddings.npy")
embeddings, _ = embedder.load()

analysis = StabilityAnalysis(runs, embeddings=embeddings)
analysis.align()
analysis.visualize("topics.png", umap_coords=precomputed_umap)
```

## API

### `TopicRun`

One run's topic distributions.

| Constructor | Use when |
|---|---|
| `TopicRun.from_matrix(doc_topic, *, doc_ids, word_topic, vocab)` | You have numpy arrays |
| `TopicRun.from_sklearn(model, X, *, doc_ids, vocab)` | sklearn `transform()` interface |
| `TopicRun.from_csv(doc_topic_path, *, word_topic_path)` | CSV files from the CLI pipeline |
| `TopicRun.from_mallet_states(model_dir, *, iterations, tsv_path)` | Mallet `.gz` state files |

### `DocumentEmbedder`

```python
embedder = DocumentEmbedder(model="all-MiniLM-L6-v2", cache_path="embeddings.npy")
embeddings = embedder.embed(texts, ids=doc_ids)  # computes and caches
embeddings, ids = embedder.load()                # load from cache
```

Pass the returned array directly to `StabilityAnalysis(runs, embeddings=embeddings)`.

### `StabilityAnalysis`

```python
analysis = StabilityAnalysis(runs, embeddings, *, doc_ids=None)
analysis.align(reference=0)         # must call before scoring
analysis.topic_stability()          # ndarray (n_topics,) in [0, 1]
analysis.overall_stability()        # float
analysis.umap_projection(**kwargs)  # ndarray (n_docs, 2)
analysis.visualize(path, *, reference_run=0, umap_coords=None)
```

**Alignment** uses cosine similarity of per-topic embedding centroids
(`centroid_k = Σ_d θ_dk · e_d`, normalised) matched with the Hungarian
algorithm. No shared vocabulary is required, so runs from different model
types can be compared.

**Stability score** for topic k: mean pairwise `1 − JS(p, q)` where p and
q are the normalised document-profile columns `θ[:,k]` (treated as a
distribution over documents) from each pair of aligned runs.

## BERTopic

```python
from topic_stability.integrations.bertopic import from_bertopic

run, embeddings = from_bertopic(model, docs=None, *, embeddings=None, doc_ids=None)
```

Returns `(TopicRun, embeddings_array)`.

**Key differences from LDA/NMF:**

- BERTopic assigns each document to exactly one cluster (hard assignment). The
  `doc_topic` matrix is binary: 1 for the assigned topic, 0 elsewhere.
  Documents that HDBSCAN assigns to topic −1 (outliers) get an all-zero row.
- `model.probabilities_` contains HDBSCAN soft-membership scores, not
  topic-weight distributions. We do not use them — they are a geometric
  property of the embedding space, not comparable to LDA posterior weights.
- Word representations come from c-TF-IDF scores, not a generative word
  distribution. Cross-model word-based comparison is not meaningful.
- Stability scores measure whether the *same documents* cluster together
  across runs, not whether the same word distributions recur.

## CLI pipeline (Mallet / RustMallet)

The package includes CLI wrappers for a full file-based workflow:

```bash
# 1. Embed documents
topic-stability-embed corpus.tsv embeddings.npy

# 2. Project to 2D
topic-stability-project embeddings.npy umap_2d.csv

# 3. Estimate distributions from Mallet states
topic-stability-estimate model_42_run1/ 42 corpus.tsv

# 4. Visualize a single run
topic-stability-visualize umap_2d.csv model_42_run1/doc_topic_avg.csv \
    model_42_run1/word_topic_avg.csv topics.png
```

## License

MIT
