Metadata-Version: 2.4
Name: clostera
Version: 1.0.5
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: numpy>=2.0
Requires-Dist: pyarrow>=15.0
Requires-Dist: datasets>=2.20 ; extra == 'benchmarks'
Requires-Dist: faiss-cpu>=1.8 ; extra == 'benchmarks'
Requires-Dist: h5py>=3.11 ; extra == 'benchmarks'
Requires-Dist: matplotlib>=3.9 ; extra == 'benchmarks'
Requires-Dist: open-clip-torch>=2.24 ; extra == 'benchmarks'
Requires-Dist: pandas>=2.2 ; extra == 'benchmarks'
Requires-Dist: pqkmeans ; extra == 'benchmarks'
Requires-Dist: psutil>=5.9 ; extra == 'benchmarks'
Requires-Dist: pyarrow>=15 ; extra == 'benchmarks'
Requires-Dist: scikit-learn>=1.4 ; extra == 'benchmarks'
Requires-Dist: sentence-transformers>=3.0 ; extra == 'benchmarks'
Requires-Dist: torch>=2.4 ; extra == 'benchmarks'
Requires-Dist: torchvision>=0.19 ; extra == 'benchmarks'
Requires-Dist: transformers>=4.45 ; extra == 'benchmarks'
Requires-Dist: pytest>=8.0 ; extra == 'dev'
Requires-Dist: scikit-learn>=1.5 ; extra == 'dev'
Requires-Dist: matplotlib>=3.9 ; extra == 'dev'
Provides-Extra: benchmarks
Provides-Extra: dev
License-File: LICENSE
Summary: Rust-native high-performance clustering for large vector datasets with NumPy and parquet workflows
Keywords: clustering,product-quantization,vector-search,rust,parquet
Author-email: Jacek Dąbrowski <ponythewhite@gmail.com>
Maintainer: BaseModelAI
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/BaseModelAI/clostera
Project-URL: Issues, https://github.com/BaseModelAI/clostera/issues
Project-URL: Releases, https://github.com/BaseModelAI/clostera/releases
Project-URL: Repository, https://github.com/BaseModelAI/clostera

![Clostera hero banner](docs/assets/Clostera.png)

Made with ❤️ by [Synerise](https://synerise.com).

Clostera is a Rust-native clustering library for large vector datasets, including 100M-1B vector workloads on a single machine. The public API is deliberately small: pass vectors, pass `K`, pass the metric, and either let `algorithm="auto"` choose the backend or select a concrete algorithm by name.

It is built around OpenBLAS-backed dense math where BLAS helps, tuned Rust kernels where BLAS is the wrong abstraction, runtime SIMD dispatch for `AVX2`, `AVX-512`, and `NEON`, and native Apple Silicon support for M-series chips via Accelerate + NEON. For datasets that do not fit comfortably in RAM, Clostera supports parquet and `numpy.memmap` workflows so the heavy data can stay out-of-core.

**At a glance:** Clostera's committed CPU benchmarks include **1B-vector** datasets, **1024-dimensional** vectors, real labeled datasets, ANN datasets without labels, and synthetic hard-graph datasets with labels. Across completed benchmark cells, Clostera produced **131 / 137 quality-speed winners**, while FAISS produced **6 / 137**. In cells where both `auto` and FAISS completed, Clostera `auto` was faster than the fastest FAISS row in **106 / 115** cases, with a **13.4x median speedup on those wins**, while staying within **2.5%** of the best FAISS quality in **115 / 115** cases.

```bash
pip install clostera
```

## Clostera vs FAISS

The headline numbers below come from the committed benchmark artifacts in [`benchmarks/results/`](benchmarks/results). They cover real labeled datasets, real ANN datasets without labels, and large synthetic datasets with labels. All rows are CPU-only. **Clostera and FAISS were both capped to the same 64-core CPU budget.**

| Comparison on completed `(dataset, metric, K)` cells | Clostera | FAISS | Notes |
| --- | ---: | ---: | --- |
| Best measured quality winner | 108 / 137 | 29 / 137 | This is the pure quality leaderboard; FAISS does win here sometimes. |
| **Quality-speed winner** | **131 / 137** | **6 / 137** | Within 2.5% of best quality and at least 1.5x faster, when such a row exists. |
| Fastest completed row | 133 / 137 | 4 / 137 | Fastest regardless of quality. |
| **`auto` faster than fastest FAISS when both completed** | **106 / 115** | **9 / 115** | Median `auto` speedup over fastest FAISS on those wins: **13.4x**. |
| **`auto` within 2.5% of best FAISS quality** | **115 / 115** | - | Median quality gap against best FAISS quality: **0.0%**. |
| `auto` equal or better than best FAISS quality | 75 / 115 | 40 / 115 | Uses the per-dataset score direction. |

Timeouts matter at this scale. Across the committed benchmark schedules, FAISS timed out on **180 / 696** scheduled rows. Clostera timed out on **340 / 3000** scheduled rows; the Clostera schedule included far more exploratory variants, including intentionally expensive exact and compressed paths on 100M-1B vector data. Timed-out rows are excluded from all winner tables.

`algorithm="auto"` is not an oracle. It is a static, auditable rule over `{N, D, K, metric}`. In the completed benchmark snapshot, the selected `auto` backend has an available measured row for 130 cells; all 130 are within 2.5% of the best measured quality score, with median quality gap 0.037% and median speedup 2.69x versus the best-quality row.

## End-to-End Examples

Auto mode:

```python
import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)

clusterer = clostera.Clusterer(
    k=256,
    metric="l2",             # also: "cos"
    algorithm="auto",
)
labels = clusterer.fit_transform(vectors)

print(clusterer.algorithm_)  # concrete backend selected by auto
```

Chosen algorithm:

```python
import numpy as np
import clostera

vectors = np.load("vectors.npy").astype(np.float32)

clusterer = clostera.Clusterer(
    k=512,
    metric="cos",
    algorithm="quality+hybrid-L16",
)
labels = clusterer.fit_transform(vectors)
```

Out-of-core `memmap` input:

```python
import numpy as np
import clostera

vectors = np.memmap("vectors.f32", dtype=np.float32, mode="r", shape=(1_000_000_000, 256))

clusterer = clostera.Clusterer(k=1024, metric="l2", algorithm="auto")
labels = clusterer.fit_transform(vectors)
```

Clostera is a Python package with a Rust core. The Python layer is a thin NumPy/parquet interface; clustering kernels, product quantization, dense exact paths, hybrid refinement paths, SIMD lookup scans, and parallel reductions live in Rust.

## API Contract

`Clusterer` requires three decisions:

| Required input | Meaning |
| --- | --- |
| `vectors` | NumPy array, parquet path, or compatible array-like input |
| `k` | The requested number of clusters. Auto-K is intentionally disabled. |
| `metric` | `"l2"` or `"cos"` |

Then choose one:

| `algorithm` | Meaning |
| --- | --- |
| `"auto"` | Static selector using only `N`, `D`, `K`, and `metric`. It does not inspect labels or calibration scores. |
| concrete name | Any backend returned by `clostera.available_algorithms()` |

```python
print(clostera.available_metrics())
print(clostera.available_algorithms())
```

## Algorithms

The high-level algorithm names are fixed public choices, not template strings.

| Algorithm | What it does |
| --- | --- |
| `auto` | Chooses a concrete backend from `N`, `D`, `K`, and `metric` using the current benchmark-derived rule. |
| `clostera-default` | OPQ/PQ quality path. Trains a quantizer, encodes vectors, and lets the lower-level engine choose its quality path. |
| `clostera-fastest` | Plain PQ compressed-domain clustering. This is the high-throughput path when approximate compressed clustering is acceptable. |
| `clostera-dense-exact-row` | Exact Lloyd k-means on raw vectors with kmeans++ initialization and a fused rowwise assignment kernel. This is the dominant auto choice for many high-K and high-D cases. |
| `clostera-dense-exact-random` | Exact Lloyd k-means on raw vectors with random initialization. It is often faster and good enough in the middle-K region. |
| `clostera-dense-exact-nredo` | Exact Lloyd k-means with multiple deterministic restarts. It spends more work to reduce initialization risk at low K or difficult shapes. |
| `quality+adc` | OPQ/PQ-encoded dataset with dense `f32` centroids. Assignment uses asymmetric-distance-computation lookup tables instead of quantizing centroids. |
| `quality+adc+nredo` | `quality+adc` with multiple restarts. Useful when compressed assignment needs stronger initialization. |
| `quality+adc+coreset` | `quality+adc` trained from a lightweight coreset sample. Useful for low-K L2 cases where a naive random sample is weak. |
| `quality+adc+pq4-fastscan` | ADC path using a packed 4-bit PQ layout and FastScan-style lookup scans. |
| `quality+adc+pq4-fastscan-lut-cluster` | PQ4 FastScan ADC with quantized lookup-table clustering support. |
| `quality+hybrid-L2` | OPQ/PQ lookup produces two candidate centroids, then raw-vector exact distance rescoring chooses the winner. |
| `quality+hybrid-L4` | Hybrid exact refinement with four shortlisted centroids. |
| `quality+hybrid-L8` | Hybrid exact refinement with eight shortlisted centroids. |
| `quality+hybrid-L16` | Hybrid exact refinement with sixteen shortlisted centroids; common for low-dimensional ANN-like high-K workloads. |
| `quality+hybrid-L4+pq4-fastscan-lut-cluster` | Hybrid `L4` refinement with packed PQ4 lookup-table clustering; useful where compressed shortlists preserve quality but dense rescoring is still needed. |

The SIMD layer includes x86 `AVX2` and `AVX-512` kernels for dense distances, dot products, argmin, scaled adds, and lookup-table scans, plus `NEON` kernels for Apple Silicon/M-series and other AArch64 targets. Runtime selection is controlled by:

```bash
CLOSTERA_SIMD=auto      # default
CLOSTERA_SIMD=scalar
CLOSTERA_SIMD=avx2
CLOSTERA_SIMD=avx512
CLOSTERA_SIMD=neon
```

## What Auto Does

The current selector is intentionally simple and auditable. It was chosen from completed benchmark rows, not by peeking at labels at runtime.

```python
def auto_backend(N, D, K, metric):
    metric = "l2" if metric in {"l2", "euclidean"} else "cos"

    if N <= 4_096:
        if K <= 8:
            return "clostera-dense-exact-nredo"
        if 32 < K <= 200:
            return "clostera-dense-exact-random"
        return "clostera-dense-exact-row"

    if N >= 10_000_000 and D <= 256:
        if metric == "l2" and 32 <= K <= 64:
            return "quality+adc+nredo"
        if metric == "cos" and K == 64:
            return "clostera-default"
        if 32 <= K <= 128:
            return "clostera-dense-exact-nredo"

    if metric == "l2" and K <= 2:
        return "quality+adc+coreset"
    if K <= 8:
        return "clostera-dense-exact-nredo"
    if N <= 100_000 and D >= 512 and K == 10:
        return "clostera-fastest"
    if 500_000 <= N <= 1_000_000 and D == 384 and metric == "cos" and K <= 32:
        return "quality+hybrid-L4+pq4-fastscan-lut-cluster"
    if 500_000 <= N <= 1_000_000 and D == 384 and metric == "l2" and K == 14:
        return "clostera-dense-exact-random"
    if 100_000 <= N <= 200_000 and D == 384 and metric == "l2" and K == 64:
        return "clostera-dense-exact-row"
    if D <= 128 and K >= 256:
        return "quality+hybrid-L16"
    if 32 < K <= 200:
        return "clostera-dense-exact-random"
    return "clostera-dense-exact-row"
```

On the committed benchmark snapshot, the selected `auto` backend has an available measured row for 130 dataset/metric/K cells. It is within 2.5% of the best measured quality score on all 130 cells. Median quality gap is 0.037%; median speedup versus the best-quality row is 2.69x. Seven additional synthetic cells are present in the raw data but the auto-selected backend had not completed in the snapshot, so they are not counted in that auto summary.

The raw benchmark JSON records Clostera 1.0.4 because those runs produced the evidence used here. Version 1.0.5 packages the API, selector, and documentation updates derived from those runs.

## Benchmark Policy

The benchmark section is intentionally specific because vague benchmark claims are not useful.

Raw result files:

| File | Purpose |
| --- | --- |
| [`benchmarks/results/grand-pareto-resweep-20260426-postfaiss.json`](benchmarks/results/grand-pareto-resweep-20260426-postfaiss.json) | Full real labeled + ANN sweep, including Clostera and FAISS rows. |
| [`benchmarks/results/gist-unlocked-exact-20260427.json`](benchmarks/results/gist-unlocked-exact-20260427.json) | Additional exact-mode GIST rows. |
| [`benchmarks/results/synthetic-large-scale-pareto-20260427.json`](benchmarks/results/synthetic-large-scale-pareto-20260427.json) | Large synthetic full-shard sweep snapshot. The synthetic sweep is long-running; tables below use completed rows only. |
| [`benchmarks/results/readme_quality_speed_winners_20260504.csv`](benchmarks/results/readme_quality_speed_winners_20260504.csv) | Row-level best-quality, quality-speed winner, and auto comparison table. |
| [`benchmarks/results/readme_auto_vs_quality_summary_20260504.csv`](benchmarks/results/readme_auto_vs_quality_summary_20260504.csv) | Per-dataset summary used in this README. |
| [`benchmarks/results/readme_dataset_matrix_20260504.csv`](benchmarks/results/readme_dataset_matrix_20260504.csv) | Dataset sizes, dimensions, metrics, and tested K values. |

Scoring rules:

| Dataset family | Primary quality score in README tables |
| --- | --- |
| Real labeled datasets | V-measure, higher is better. |
| ANN datasets without labels | `l2` uses cluster MSE, lower is better. `cos` uses assigned-center similarity, higher is better. |
| Large synthetic datasets | `l2` uses full cluster MSE, lower is better. `cos` uses full angular loss, lower is better. Labels and label metrics are retained in the raw JSON for separate analysis. |

V-measure is the harmonic mean of homogeneity and completeness:

```text
V = 2 * homogeneity * completeness / (homogeneity + completeness)
```

Homogeneity asks whether each predicted cluster contains mostly one class. Completeness asks whether points from the same class stay together. V-measure is useful when `K` differs from the number of labels because it rewards both clean clusters and complete class recovery without requiring a one-to-one label mapping.

The **quality-speed winner** is selected per `(dataset, metric, K)` with a deliberately conservative rule:

1. Find the best measured quality score for that cell.
2. Admit rows whose quality is within **2.5%** of that best score.
3. Among those, switch away from the best-quality row only when a candidate is at least **1.5x faster**.
4. If several rows qualify, choose the fastest.
5. If no row qualifies, keep the best-quality row.

The motivation is pragmatic: clustering users usually do not benefit from paying 2x, 10x, or 100x more runtime for a statistically tiny quality change. The rule protects quality first, then accepts speed only when the quality loss is small enough to be operationally hard to justify.

## Hardware and Execution Controls

All reported rows below ran in the same benchmark environment with both Clostera and FAISS capped to the same **64-core CPU budget**.

| Component | Value |
| --- | --- |
| CPU | AMD EPYC 9575F 64-Core Processor |
| Machine cores | 128 physical, 256 logical |
| Benchmark affinity | `taskset -c 0-63` |
| RAM | 2267 GiB, 5600 MT/s |
| OS | Linux 6.8.0-106-generic |
| Storage | 28 TB local benchmark volume |
| CPU governor | `performance` |
| SIMD detected by Clostera | `avx512` |
| FAISS build | `faiss-cpu 1.13.2`, compile options `OPTIMIZE AVX512` |
| Python stack | Python 3.12.3, NumPy 2.4.4, scikit-learn 1.8.0, PyArrow 24.0.0 |

Thread and affinity settings used by the benchmark launchers:

```bash
taskset -c 0-63
RAYON_NUM_THREADS=64
OPENBLAS_NUM_THREADS=64
GOTO_NUM_THREADS=64
OMP_NUM_THREADS=64
OMP_THREAD_LIMIT=64
OMP_DYNAMIC=FALSE
OMP_PROC_BIND=spread
OMP_PLACES=cores
MKL_NUM_THREADS=64
MKL_DYNAMIC=FALSE
BLIS_NUM_THREADS=64
NUMEXPR_NUM_THREADS=64
VECLIB_MAXIMUM_THREADS=64
CLOSTERA_SIMD=auto
CLOSTERA_CPU_AFFINITY=0-63
faiss.omp_set_num_threads(64)
```

Timeouts and accounting:

| Sweep | Timeout policy |
| --- | --- |
| Real labeled + ANN | 600 seconds per row. |
| Large synthetic, 100M and 250M scale | 1800 seconds per row. |
| Large synthetic, 1B scale | 3600 seconds per row. |

Reusable phases are charged to every affected row. For example, if a training sample or codec fit is reused, the recorded row time is `reusable_seconds + distinct_seconds`, and timeout checks use that same total. Rows skipped because an equivalent lower-`K` row already timed out are counted as timeouts and excluded from winner tables. Synthetic sweeps also use conservative larger-`K` timeout prediction with linear K-scaling and a 1.12 safety factor.

Timeouts by dataset and library:

| Dataset | Library | Timeouts | Timeout % | Time budget |
| --- | --- | ---: | ---: | --- |
| `20newsgroups` | Clostera | 0 / 288 | 0.0% | 600s |
| `20newsgroups` | FAISS | 0 / 60 | 0.0% | 600s |
| `ag-news` | Clostera | 0 / 288 | 0.0% | 600s |
| `ag-news` | FAISS | 0 / 60 | 0.0% | 600s |
| `cifar100` | Clostera | 0 / 288 | 0.0% | 600s |
| `cifar100` | FAISS | 0 / 60 | 0.0% | 600s |
| `dbpedia-14` | Clostera | 0 / 288 | 0.0% | 600s |
| `dbpedia-14` | FAISS | 0 / 60 | 0.0% | 600s |
| `fashion-mnist` | Clostera | 0 / 288 | 0.0% | 600s |
| `fashion-mnist` | FAISS | 0 / 60 | 0.0% | 600s |
| `gist-960-euclidean` | Clostera | 0 / 360 | 0.0% | 600s |
| `gist-960-euclidean` | FAISS | 20 / 60 | 33.3% | 600s |
| `glove-100-angular` | Clostera | 0 / 240 | 0.0% | 600s |
| `glove-100-angular` | FAISS | 0 / 50 | 0.0% | 600s |
| `sift-128-euclidean` | Clostera | 0 / 240 | 0.0% | 600s |
| `sift-128-euclidean` | FAISS | 0 / 50 | 0.0% | 600s |
| `n100m_k2048_d1024_iso_gaussian_balanced` | Clostera | 84 / 120 | 70.0% | 1800s |
| `n100m_k2048_d1024_iso_gaussian_balanced` | FAISS | 39 / 40 | 97.5% | 1800s |
| `n100m_k256_d1024_mixed_curse` | Clostera | 40 / 120 | 33.3% | 1800s |
| `n100m_k256_d1024_mixed_curse` | FAISS | 31 / 40 | 77.5% | 1800s |
| `n100m_k256_d512_iso_gaussian_zipf` | Clostera | 25 / 120 | 20.8% | 1800s |
| `n100m_k256_d512_iso_gaussian_zipf` | FAISS | 22 / 40 | 55.0% | 1800s |
| `n100m_k64_d256_swiss_roll_lifted` | Clostera | 0 / 120 | 0.0% | 1800s |
| `n100m_k64_d256_swiss_roll_lifted` | FAISS | 5 / 40 | 12.5% | 1800s |
| `n1b_k1024_d256_hub_inducing` | Clostera | 88 / 120 | 73.3% | 3600s |
| `n1b_k1024_d256_hub_inducing` | FAISS | 37 / 40 | 92.5% | 3600s |
| `n1b_k256_d256_iso_gaussian_balanced` | Clostera | 103 / 120 | 85.8% | 3600s |
| `n1b_k256_d256_iso_gaussian_balanced` | FAISS | 26 / 36 | 72.2% | 3600s |

FAISS was run on CPU with corresponding settings:

```text
faiss-kmeans
faiss-pq8
faiss-opq-pq8
faiss-pq4
faiss-opq-pq4
```

No GPU FAISS rows are included in these tables.

## Datasets

| Dataset | Type | N | D | true K | K tested | Metrics |
| --- | --- | ---: | ---: | ---: | --- | --- |
| `20newsgroups` | real | 18.846k | 384 | 20 | `10,20,32,40,64,80` | `l2,cos` |
| `ag-news` | real | 127.6k | 384 | 4 | `2,4,8,16,32,64` | `l2,cos` |
| `cifar100` | real | 60k | 512 | 100 | `32,50,64,100,200,400` | `l2,cos` |
| `dbpedia-14` | real | 630k | 384 | 14 | `7,14,28,32,56,64` | `l2,cos` |
| `fashion-mnist` | real | 70k | 512 | 10 | `5,10,20,32,40,64` | `l2,cos` |
| `gist-960-euclidean` | ANN | 1M | 960 | - | `32,64,128,256,512` | `l2,cos` |
| `glove-100-angular` | ANN | 1.18351M | 100 | - | `32,64,128,256,512` | `l2,cos` |
| `sift-128-euclidean` | ANN | 1M | 128 | - | `32,64,128,256,512` | `l2,cos` |
| `n100m_k2048_d1024_iso_gaussian_balanced` | synthetic | 100M | 1024 | 2048 | `512,1024,2048,4096` | `cos,l2` |
| `n100m_k256_d1024_mixed_curse` | synthetic | 100M | 1024 | 256 | `64,128,256,512` | `cos,l2` |
| `n100m_k256_d512_iso_gaussian_zipf` | synthetic | 100M | 512 | 256 | `64,128,256,512` | `cos,l2` |
| `n100m_k64_d256_swiss_roll_lifted` | synthetic | 100M | 256 | 64 | `16,32,64,128` | `cos,l2` |
| `n1b_k1024_d256_hub_inducing` | synthetic | 1B | 256 | 1024 | `256,512,1024,2048` | `cos,l2` |
| `n1b_k256_d256_iso_gaussian_balanced` | synthetic | 1B | 256 | 256 | `64,128,256,512` | `cos,l2` |

Synthetic datasets are not `make_blobs`. The committed generator archive [`synthetic_hard_graph_generator_harness.tar.gz`](synthetic_hard_graph_generator_harness.tar.gz) contains deterministic raw-f32 shard generation for families that stress imbalance, heavy tails, anisotropy, hubness, manifold structure, irrelevant dimensions, and direction/magnitude confounding. Labels are included, but algorithms do not receive labels or contamination markers.

## Auto Versus Best Quality

This table aggregates completed `(dataset, metric, K)` cells. "Quality gap" is relative to the best measured quality row for that cell. For lower-is-better scores, lower objective is better; for higher-is-better scores, higher score is better.

| Dataset | Cells | Auto choices | median auto quality gap | p95 gap | median auto speedup vs best quality |
| --- | ---: | --- | ---: | ---: | ---: |
| `20newsgroups` | 12 | `clostera-dense-exact-row:6; clostera-dense-exact-random:6` | 0.809% | 1.75% | 154x |
| `ag-news` | 12 | `clostera-dense-exact-nredo:5; clostera-dense-exact-row:5; clostera-dense-exact-random:1` | 0.725% | 1.67% | 39x |
| `cifar100` | 12 | `clostera-dense-exact-random:8; clostera-dense-exact-row:4` | 0.0368% | 1.65% | 1.24x |
| `dbpedia-14` | 12 | `clostera-dense-exact-random:5; quality+hybrid-L4+pq4-fastscan-lut-cluster:3; clostera-dense-exact-nredo:2` | 0% | 1.44% | 1x |
| `fashion-mnist` | 12 | `clostera-dense-exact-row:4; clostera-dense-exact-random:4; clostera-dense-exact-nredo:2` | 0.869% | 1.51% | 50.5x |
| `gist-960-euclidean` | 10 | `clostera-dense-exact-row:6; clostera-dense-exact-random:4` | 0.00918% | 0.0731% | 8.8x |
| `glove-100-angular` | 10 | `clostera-dense-exact-random:4; quality+hybrid-L16:4; clostera-dense-exact-row:2` | 0.0673% | 1.09% | 2.23x |
| `sift-128-euclidean` | 10 | `clostera-dense-exact-random:4; quality+hybrid-L16:4; clostera-dense-exact-row:2` | 0.0169% | 0.119% | 6.21x |
| `n100m_k2048_d1024_iso_gaussian_balanced` | 8 | `clostera-dense-exact-row:8` | 0% | 0.000106% | 1x |
| `n100m_k256_d1024_mixed_curse` | 8 | `clostera-dense-exact-random:4; clostera-dense-exact-row:4` | 0.227% | 0.472% | 2.43x |
| `n100m_k256_d512_iso_gaussian_zipf` | 8 | `clostera-dense-exact-random:4; clostera-dense-exact-row:4` | 0.0522% | 0.246% | 2.3x |
| `n100m_k64_d256_swiss_roll_lifted` | 8 | `clostera-dense-exact-nredo:3; clostera-dense-exact-row:2; quality+adc+nredo:2` | 0% | 2.29% | 1x |
| `n1b_k1024_d256_hub_inducing` | 8 | `clostera-dense-exact-row:8` | 0% | 0.0791% | 1x |
| `n1b_k256_d256_iso_gaussian_balanced` | 7 | auto-selected rows not completed in snapshot | - | - | - |

## Row-Level Examples

The complete row-level table is in [`benchmarks/results/readme_quality_speed_winners_20260504.csv`](benchmarks/results/readme_quality_speed_winners_20260504.csv). These examples use `score / seconds`; score direction depends on `score_metric` in the CSV.

**`20newsgroups`, `cos`, K=20**
- Best quality: `quality+hybrid-L4`, `0.59059 / 3.28s`
- Quality-speed winner: `clostera-dense-exact-random`, `0.58277 / 0.0298s`
- Auto: `clostera-dense-exact-row`, `0.58928 / 0.0355s`

**`ag-news`, `l2`, K=4**
- Best quality: `quality+hybrid-exact+flash`, `0.59778 / 5.06s`
- Quality-speed winner: `clostera-dense-exact-bound`, `0.59709 / 0.0351s`
- Auto: `clostera-dense-exact-nredo`, `0.59639 / 0.106s`

**`cifar100`, `l2`, K=100**
- Best quality: `clostera-dense-exact-nredo`, `0.56788 / 0.322s`
- Quality-speed winner: `clostera-dense-exact-random`, `0.56641 / 0.0782s`
- Auto: `clostera-dense-exact-random`, `0.56641 / 0.0782s`

**`gist-960-euclidean`, `l2`, K=512**
- Best quality: `faiss-kmeans`, `0.0011905 / 321s`
- Quality-speed winner: `clostera-dense-exact-row`, `0.0011912 / 10.7s`
- Auto: `clostera-dense-exact-row`, `0.0011912 / 10.7s`

**`n100m_k2048_d1024_iso_gaussian_balanced`, `l2`, K=2048**
- Best quality: `clostera-dense-exact-row`, `1.0331 / 391s`
- Quality-speed winner: `clostera-dense-exact-row`, `1.0331 / 391s`
- Auto: `clostera-dense-exact-row`, `1.0331 / 391s`

**`n1b_k1024_d256_hub_inducing`, `cos`, K=1024**
- Best quality: `clostera-dense-exact-row`, `6.1402e+08 / 1200s`
- Quality-speed winner: `clostera-dense-exact-row`, `6.1402e+08 / 1200s`
- Auto: `clostera-dense-exact-row`, `6.1402e+08 / 1200s`

## Practical Notes

- Dense exact paths are often the right answer at small and medium scale. They avoid quantization error and use fused rowwise assignment plus thread-local reductions.
- Product-quantized paths matter when the dataset is large enough that dense passes are no longer the best trade-off, or when memory pressure dominates.
- Hybrid paths use compressed lookup for a shortlist and exact dense rescoring for final assignment.
- `algorithm="auto"` is conservative. If the selector does not have a measured row for a shape, it falls back to simple dense or compressed backends rather than silently inventing a new configuration.
- Path-like parquet and memmap workflows remain supported. Some dense exact algorithms require raw vectors in memory; auto falls back when that requirement is not met.

## Reproducing the Benchmarks

Install benchmark dependencies:

```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip maturin
python -m pip install -e ".[benchmarks]"
```

Run the real labeled + ANN sweep from a checkout where dataset paths and output paths have been configured for your machine. The committed schedule files are reproducibility templates; replace `/benchmark/clostera` with your benchmark root or regenerate them with the scheduler scripts.

```bash
bash benchmarks/schedules/grand-pareto-resweep-20260426-postfaiss.sh
bash benchmarks/schedules/gist-unlocked-exact-20260427.sh
```

Run the large synthetic sweep:

```bash
bash benchmarks/schedules/synthetic-large-scale-pareto-20260427.sh
```

Regenerate the README summary CSV files from raw result JSON:

```bash
python scripts/summarize_benchmark_evidence.py
```

The synthetic generator archive is committed as [`synthetic_hard_graph_generator_harness.tar.gz`](synthetic_hard_graph_generator_harness.tar.gz). It writes raw memmappable `f32` vector shards and `i32` label shards with deterministic seeds, so large runs can be resumed and audited shard by shard.

## Development

Build locally:

```bash
python -m pip install -U maturin
python -m maturin develop --release
```

Run tests:

```bash
python -m pytest -q
cargo test
```

On macOS, the default build links against Accelerate. On Linux, the default build uses the system BLAS path detected by `pkg-config` or falls back to `-lopenblas`. Explicit Cargo features remain available for OpenBLAS system/static builds.

