Metadata-Version: 2.4
Name: semaclust
Version: 0.4.2
Summary: Semantic text clustering using sentence embeddings and agglomerative clustering.
Project-URL: Homepage, https://github.com/cobanov/semaclust
Project-URL: Repository, https://github.com/cobanov/semaclust
Project-URL: Issues, https://github.com/cobanov/semaclust/issues
Project-URL: Changelog, https://github.com/cobanov/semaclust/blob/main/CHANGELOG.md
Author-email: Mert Cobanov <mertcobanov@gmail.com>
License: MIT
License-File: LICENSE
Keywords: clustering,embeddings,nlp,semantic,sentence-transformers,text
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: numpy>=1.23
Requires-Dist: scikit-learn>=1.2
Requires-Dist: sentence-transformers>=2.2
Requires-Dist: typer>=0.12
Provides-Extra: benchmarks
Requires-Dist: einops>=0.8.2; extra == 'benchmarks'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-randomly>=3.15; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
Description-Content-Type: text/markdown

# semaclust

[![CI](https://github.com/cobanov/semaclust/actions/workflows/ci.yml/badge.svg)](https://github.com/cobanov/semaclust/actions/workflows/ci.yml)
[![PyPI version](https://img.shields.io/pypi/v/semaclust.svg)](https://pypi.org/project/semaclust/)
[![Python versions](https://img.shields.io/pypi/pyversions/semaclust.svg)](https://pypi.org/project/semaclust/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

**semaclust** (semantic + clustering) is a small Python library for clustering
similar strings using sentence embeddings and agglomerative clustering. It is
useful for deduplicating free-text fields, normalizing user-entered values, and
collapsing spelling or formatting variants into a canonical form.

**Documentation:** https://cobanov.github.io/semaclust/

## Installation

```bash
pip install semaclust
# or
uv add semaclust
```

Install from source:

```bash
pip install git+https://github.com/cobanov/semaclust.git
```

## Quickstart

```python
from semaclust import TextClusterer

texts = [
    "New York", "NYC", "new york city",
    "Los Angeles", "LA",
    "San Francisco", "San Fran", "SF",
]

clusterer = TextClusterer(distance_threshold=1.0)
clusterer.fit(texts)

print(clusterer.n_clusters_)
# 3

print(clusterer.clusters_)
# {0: ['New York', 'NYC', 'new york city'],
#  1: ['Los Angeles', 'LA'],
#  2: ['San Francisco', 'San Fran', 'SF']}

print(clusterer.transform())
# ['NYC', 'NYC', 'NYC', 'LA', 'LA', 'SF', 'SF', 'SF']
```

`fit_transform` is the one-call shortcut:

```python
TextClusterer(distance_threshold=1.0).fit_transform(texts)
```

## API at a glance

| Method | Returns | Purpose |
|---|---|---|
| `fit(texts)` | `self` | Cluster and store fitted attributes |
| `fit_predict(texts)` | `ndarray[int]` | Cluster labels per input text |
| `fit_transform(texts)` | `list[str]` | Each text replaced with its representative |
| `transform(texts=None)` | `list[str]` | Replace texts seen at fit time |
| `get_replacement_map()` | `dict[str, str]` | Mapping from original to representative |
| `result_` | `ClusterResult` | Frozen dataclass with labels, clusters, reps |

Fitted attributes (sklearn convention): `labels_`, `clusters_`,
`representatives_`, `n_clusters_`, `texts_`.

## Choosing a model and threshold

The default encoder is `all-MiniLM-L6-v2` (22M params, 384-dim, ~80 MB). It's fast,
small, and a sensible drop-in, but it isn't the strongest option for every task.
The table below summarizes 9 encoders on three workloads of escalating
difficulty: short text with abbreviations (cities), medium text with synonyms
(job titles), and long sentences with topical themes (customer feedback). See
[benchmarks.md](benchmarks.md) for the full methodology, raw thresholds, and
per-test breakdowns.

ARI (Adjusted Rand Index) of 1.0 means an exact match with the ground-truth
clustering. Bold = perfect.

| Model | Params | Dim | cities | job titles | feedback | Good threshold |
|---|---|---|---|---|---|---|
| `all-MiniLM-L6-v2` (default) | 23M | 384 | **1.000** | 0.672 | **1.000** | 1.0 (cities) / 1.3 (feedback) |
| `all-MiniLM-L12-v2` | 33M | 384 | **1.000** | 0.512 | **1.000** | 1.20 |
| `all-mpnet-base-v2` | 109M | 768 | 0.619 | 0.512 | **1.000** | - |
| `BAAI/bge-small-en-v1.5` | 33M | 384 | **1.000** | 0.672 | **1.000** | 0.95 - 1.00 |
| `BAAI/bge-m3` | 568M | 1024 | **1.000** | 0.520 | **1.000** | - |
| `nomic-ai/nomic-embed-text-v1.5` * | 137M | 768 | **1.000** | 0.520 | **1.000** | 0.70 - 0.85 |
| `nomic-ai/nomic-embed-text-v2-moe` * | 475M | 768 | **1.000** | 0.672 | **1.000** | 1.15 - 1.35 |
| `mixedbread-ai/mxbai-embed-large-v1` | 335M | 1024 | **1.000** | **0.815** | **1.000** | 0.95 - 1.05 |
| `Qwen/Qwen3-Embedding-0.6B` | 596M | 1024 | 0.529 | 0.672 | **1.000** | - |

\* Requires a `clustering: ` prefix on each input. The bench applies this
automatically; if you swap the model in directly, wrap it in a custom encoder
that prepends the prefix.

**Practical takeaways:**

- **Bigger isn't better.** `all-mpnet-base-v2` and `Qwen3-Embedding-0.6B` both
  *fail* the simplest test (cities) - they merge Los Angeles with San Francisco
  before merging SF with San Francisco.
- **Best small drop-in**: `BAAI/bge-small-en-v1.5` matches the default's size
  class and has a single threshold window (0.95-1.00) that works for both
  cities and feedback.
- **Best overall**: `mixedbread-ai/mxbai-embed-large-v1` - the only model to
  exceed 0.8 ARI on job titles, with a 0.10-wide cities+feedback overlap at
  0.95-1.05. Costs 15x more parameters than the default.
- **No model solves the job-titles case.** Synonymy across roles
  (`SWE` ~ Programmer ~ Software Engineer, Product Manager ~ Product Owner)
  defeats every encoder we tested.

To swap the encoder, pass a string or a custom `Encoder`:

```python
from semaclust import TextClusterer

clusterer = TextClusterer(
    encoder="BAAI/bge-small-en-v1.5",
    distance_threshold=1.0,
)
```

### Hardware acceleration

`device="auto"` is the default. It picks CUDA if a GPU is visible, MPS on
Apple Silicon, and otherwise lets sentence-transformers fall back to CPU.
Mac users get the GPU for free; no code change needed.

```python
TextClusterer(device="auto")   # default; cuda > mps > cpu
TextClusterer(device="mps")    # force MPS
TextClusterer(device="cuda")   # force CUDA
TextClusterer(device="cpu")    # force CPU
TextClusterer(device=None)     # delegate to sentence-transformers (no MPS auto-pick)
```

## Plugging in your own encoder

Any object implementing `encode(texts: list[str]) -> np.ndarray` satisfies the
`Encoder` protocol:

```python
from semaclust import TextClusterer, Encoder
import numpy as np

class MyEncoder:
    def encode(self, texts: list[str]) -> np.ndarray:
        return np.random.rand(len(texts), 384).astype(np.float32)

clusterer = TextClusterer(encoder=MyEncoder())
```

## CLI

semaclust ships a small `typer`-based CLI:

```bash
# Cluster lines from a file, write JSON
semaclust cluster items.txt --threshold 1.0 --output clusters.json

# Replace each line with its cluster representative
cat items.txt | semaclust replace --threshold 1.0
```

Run `semaclust --help` for the full reference.

## Migration from 0.1.x

The 0.3 release is a breaking change. The single `cluster(texts)` entry point
is gone; the new API mirrors scikit-learn:

| 0.1.x | 0.3.x |
|---|---|
| `clusterer.cluster(texts)` | `clusterer.fit(texts).clusters_` |
| `clusterer.get_replacement_map(texts)` | `clusterer.fit(texts).get_replacement_map()` |
| `clusterer.replace_values(texts)` | `clusterer.fit_transform(texts)` |

The `representative_selector` argument moved from per-call to the constructor.

## Development

```bash
git clone https://github.com/cobanov/semaclust.git
cd semaclust
uv sync --extra dev
uv run pytest
uv run ruff check src tests
uv run mypy src/semaclust
```

Install the pre-commit hooks once:

```bash
uv run pre-commit install
```

## License

MIT, see [LICENSE](LICENSE).
