Metadata-Version: 2.4
Name: static-embed-runner
Version: 0.1.0
Summary: Lightweight runner for static-embedding / bag-of-embeddings sentence models (numpy only, CPU, Windows-friendly)
Author-email: jun76 <jun76.main@gmail.com>
License-Expression: MIT
Keywords: embeddings,sentence-embeddings,static-embedding,bag-of-embeddings,retrieval,japanese,tokenizer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Provides-Extra: rust
Requires-Dist: tokenizers>=0.15; extra == "rust"
Dynamic: license-file

# static-embed-runner

**English** | [日本語](README.ja.md)

A lightweight runner for static-embedding / bag-of-embeddings sentence models, powered by **numpy only**.

It runs the full pipeline — tokenize → mean pooling → (optional head) → L2 normalization —
in pure Python with a total dependency footprint of ~50 MB, without torch / transformers /
sentence-transformers. No native build step is required, so it runs anywhere
(including Windows, where compiler toolchains and runtime DLLs are a common source of pain —
here `pip install` is all you need).

## Verified models

Loads static-embedding models in `tokenizer.json` + safetensors form, from a local
directory or a Hugging Face Hub repo id. Verified end-to-end on a 224-sentence ja/en
corpus plus edge cases, against each model's reference implementation:

The same verification and benchmark flow has been run successfully on both Windows 11
and WSL Ubuntu 24.04.3 LTS.

| Model                                                                                                                                                              | Format          | Verification                                                  |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------- | -------------------------------------------------------------- |
| [sentence-transformers/static-retrieval-mrl-en-v1](https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1) (English retrieval, official)           | StaticEmbedding | token ids 224/224; max emb diff 5.2e-8 vs sentence-transformers |
| [sentence-transformers/static-similarity-mrl-multilingual-v1](https://huggingface.co/sentence-transformers/static-similarity-mrl-multilingual-v1) (multilingual, official) | StaticEmbedding | token ids 224/224; max emb diff 4.5e-8 (benchmark target below) |
| [minishlab/potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) and other Model2Vec models                                                              | Model2Vec       | token ids 224/224; max emb diff 1.2e-7 vs `model2vec` reference |
| [RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en](https://huggingface.co/RikkaBotan/quantized-stable-static-embedding-fast-retrieval-mrl-bilingual-ja-en) (ja/en bilingual, 4-bit quantized) | SSE q4          | token ids 224/224; max emb diff 6e-8 vs sentence-transformers   |

For Model2Vec models the runner mirrors the reference implementation's semantics of
dropping the unknown token before pooling.

The tokenizer is a pure-Python implementation driven by `tokenizer.json`, covering
Unigram (Viterbi, byte_fallback) and WordPiece (including BertNormalizer /
BertPreTokenizer). Configurations outside this subset automatically fall back to the
`tokenizers` package when installed.

## Installation

```powershell
pip install static-embed-runner         # numpy is the only dependency
pip install static-embed-runner[rust]   # optional: Rust tokenizer backend
```

## Usage

```python
from static_embed_runner import StaticEmbedRunner, similarity

# Accepts a HF Hub repo id or a local directory. For Hub ids, only the files
# the runner needs are downloaded (via urllib) and cached under
# ~/.cache/static-embed-runner.
runner = StaticEmbedRunner.load("minishlab/potion-base-8M")

emb = runner.encode(["こんにちは", "Hello"])         # (2, dim) float32, L2-normalized
emb64 = runner.encode("Hello", truncate_dim=64)      # Matryoshka truncation (MRL models)

sim = similarity(emb, emb)                           # bundled cosine-similarity helper (2, 2)
```

CLI:

```powershell
static-embed-runner minishlab/potion-base-8M "Hello world" --bench --out emb.npy
```

### API scope

What this library produces is **raw, L2-normalized embedding vectors (numpy arrays)**.
Similarity computation and search are fundamentally the caller's responsibility, but since
dot product = cosine for normalized vectors, a thin `similarity(a, b)` helper is bundled
(essentially `a @ b.T`). ANN indexes, storage, and reranking are out of scope — the numpy
arrays plug directly into faiss / hnswlib / sqlite-vec and friends.

### Options

- `table="q4"` (default): for 4-bit quantized models, keeps the table packed in memory
  (~26 MB RAM) and dequantizes only the rows actually looked up.
- `table="f32"`: dequantizes the whole table at load time (~200 MB RAM). Fastest lookups.
- `tokenizer_backend="lite"` (default: auto): pure-Python tokenizer. `"rust"` uses the
  `tokenizers` package.
- `encode(..., normalize=False)`: when you need pre-normalization vectors.

## Benchmarks

Target model: [sentence-transformers/static-similarity-mrl-multilingual-v1](https://huggingface.co/sentence-transformers/static-similarity-mrl-multilingual-v1)
(official multilingual static embedding model, vocab 105,879 × 1024 dims).
Baseline: the same model running on sentence-transformers + torch (CPU).
Environment: Windows 11 / i7-14700 (20C/28T) / Python 3.13.
Corpus: 224 mixed ja/en sentences (`bench/texts.py`). The runner's output matches the
baseline with **identical token ids (224/224)** and a max embedding diff of 4.5e-8
(float32 rounding only).

| Configuration                                 |   Deps size |   Single p50 | Batch ms/text |    Throughput |
| --------------------------------------------- | ----------: | -----------: | ------------: | ------------: |
| Baseline: sentence-transformers + torch (CPU) |    993.6 MB |     0.419 ms |        0.0246 |  40,587 txt/s |
| **runner (numpy only)**                       | **52.5 MB** | **0.020 ms** |    **0.0159** |  62,848 txt/s |
| runner + `tokenizers` (optional)              |     96.4 MB |     0.047 ms |        0.0216 |  46,301 txt/s |

- Dependency size: **~1/19** (52.5 MB vs 993.6 MB)
- Single-text latency: **~21× faster**
- Batch throughput: **1.5×**
- Load time: 0.28 s vs 7.0 s import+load for the baseline (with warm HF cache)

Cold numbers (fresh process, empty caches): single-text p50 is 0.049 ms, and the first
batch call pays a one-time ~0.1 s warm-up (first touch of the 433 MB table + BLAS thread
spin-up) before settling at the steady-state numbers above.

4-bit quantized models (SSE format) show the same trend, with the table held packed at
~26 MB RAM in `table="q4"` mode.

With the word-level cache, the pure-Python tokenizer wins batch throughput on typical
corpora too; the `[rust]` backend (+44 MB) mainly pays off for bulk indexing of
low-redundancy text and for `tokenizer.json` configs outside the built-in subset.

All speedups come from algorithms and BLAS; no OS-specific optimizations are used.

### Running the benchmark locally

Model weights are not included in this repository. The real-model smoke tests look
for `./model` by default and are skipped when it is missing. To run them without
manually preparing `./model`, pass a Hugging Face repo id or a local model directory
with `RUNNER_MODEL`:

```bash
python -m venv .venv
.venv/bin/python -m pip install -e . pytest
RUNNER_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv/bin/python -m pytest -q
RUNNER_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv/bin/python bench/bench_runner.py
```

`bench_runner.py` downloads only the files this runner needs and writes reproducible
artifacts under `results/`. Its result name includes the detected model format, actual
table storage, and tokenizer backend, for example
`runner_format=static-embedding_table=f32_tok=lite` or
`runner_format=sse-q4_table=q4_tok=lite`.

The sentence-transformers baseline uses a separate environment so its large dependency
tree does not affect the runner dependency-size measurement:

```bash
python -m venv .venv-baseline
.venv-baseline/bin/python -m pip install sentence-transformers
BASELINE_MODEL=sentence-transformers/static-similarity-mrl-multilingual-v1 \
  .venv-baseline/bin/python bench/bench_baseline.py
```

## Implementation notes

The pipeline is `tokenize → mean pooling → head (if any) → MRL truncation → L2 normalize`.
Everything runs on CPU; no GPU is used (static embeddings are just table lookups plus a
mean, so transfer overhead would dwarf the compute on a GPU; the baseline was also run
with `device="cpu"` for fairness).

### 1. numpy as the only dependency (993.6 MB → 52.5 MB)

A static embedding model is really just "an embedding table plus three lines of math",
yet the baseline drags in torch (518 MB), scipy, transformers, and more. So everything
around the table is reimplemented from scratch:

- **safetensors reader** (`safetensors_lite.py`, ~40 lines): the format is just
  "8-byte header length + JSON metadata + raw buffer", readable with `struct` + `json` + numpy.
- **Tokenizer** (`hf_tokenizer_lite.py`): pure-Python, driven by `tokenizer.json`.
  Supports Unigram (Viterbi, byte_fallback, unk penalty) and WordPiece plus the major
  normalizers / pre-tokenizers. Anything outside this subset falls back to the
  `tokenizers` package.
- **EmbeddingBag / head / normalization**: a few lines of numpy each
  (e.g. `beta * tanh(alpha * x + bias)`).

### 2. Making a pure-Python tokenizer compete with parallel Rust

Initially the batch path lost 6× to the Rust tokenizer (rayon across 20 cores).
What closed the gap:

- **First-char → candidate-piece-length table**: the Viterbi inner loop only tries piece
  lengths that can actually start at the current position, drastically cutting failed
  dict probes.
- **Chunking + memoization** (Unigram): the vocab is inspected, and if the only pieces
  containing a non-leading `▁` are pure `▁` runs, the optimal segmentation provably never
  crosses a word-end → `▁` boundary. The input is then split into `▁+word` chunks and
  Viterbi results are cached per chunk (an exact divide-and-conquer, not a greedy
  approximation). On English or repetitive corpora most chunks become cache hits.
- **Per-char memoized normalization + raw-word cache** (WordPiece): BertNormalizer's
  transforms (clean / CJK padding / strip accents / lowercase) all act on one character
  at a time, so the pipeline collapses into a lazily built char → replacement table —
  this is what makes CJK text fast, where every character otherwise goes through
  `unicodedata`. Because the normalizer is per-char, it commutes with whitespace
  splitting, so whole raw words can additionally be memoized straight from the input.
- Smaller wins: a single piece → `(id, score)` dict halves lookups; tuple allocations
  eliminated from the backtracking arrays.

### 3. Batch pooling as a BLAS matmul (~3× over reduceat)

The naive version (gather all token embeddings, then `np.add.reduceat` segment sums) was
the biggest bottleneck in profiles. Since bag-of-embeddings discards order, the math
rearranges into a single sgemm:

> count matrix over the batch's unique tokens `C (B×U)` @ unique embeddings `E (U×dim)`

This wins twice: (a) BLAS uses all cores, and (b) gathering and 4-bit decoding only touch
the U unique tokens instead of every token occurrence. An implicit **float64 promotion via
integer division was also eliminated** (it had been doubling the cost of tanh and the matmul).

### 4. 4-bit table: one-shot LUT decode + two storage modes

- A precomputed **256×2 LUT** maps each packed byte (0–255) to its dequantized `(hi, lo)`
  float32 pair, so unpacking is a single fancy-index — no per-row bit twiddling.
- `table="q4"` (default): keep the table packed (~26 MB RAM), decode only referenced rows.
- `table="f32"`: dequantize everything at load (~200 MB RAM), making lookups a pure gather. Fastest.

### 5. Correctness pinned against the baseline

After every optimization, two checks run against the reference implementation: exact
token-id equality (including edge cases: full-width characters, ZWJ emoji, control
characters, soft hyphens) and max absolute embedding error (≤6e-8 = float32 rounding
only). Edge cases like empty strings (`tokenizers` returns an empty id list) were
caught and matched this way.

## License

MIT
