Metadata-Version: 2.4
Name: spectraltm
Version: 0.1.1
Requires-Dist: numpy>=1.20
Summary: Sparse Spectral Encoding for cold-tier vector memory (Rust port with PyO3 bindings)
Author-email: Gerald Enrique Nelson Mc Kenzie <lordxmen2k@gmail.com>
License: Apache-2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# SpectraLTM

[![PyPI version](https://img.shields.io/badge/pypi-v0.1.0-blue)](https://pypi.org/project/spectraltm/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

> **Rust + PyO3 port of the Sparse Spectral Encoding algorithm for cold-tier vector memory**

`spectraltm` is the AVX2-accelerated Rust implementation of the `spectral_codes` reference from the [Sparse Spectral Encoding for Cold-Tier Vector Memory](https://github.com/lordxmen2k/sparse-spectral-encoding) research project. Same algorithm, 5–10× faster, exposed as a drop-in Python module.

## Features

- **SIMD-accelerated inner loop**: AVX2 fast-path on x86_64; portable scalar fallback for other platforms
- **Drop-in Python API**: `SpectralEncoder`, `SpectralIndex`, `.encode()`, `.add_embeddings()`, `.search()` — identical to the Python reference
- **Batch-first**: encode and search batches, not single vectors
- **24–192 B per chunk**: tiny cold-tier storage vs float32 embeddings (~1.5 KB per vector at dim=384)
- **Sub-millisecond search** on N=5K corpora; scales linearly with N
- **Parity-tested**: `tests/test_spectraltm_rust_parity.py` cross-checks against the Python reference for top-1/top-10 agreement

## Installation

```bash
pip install spectraltm
```

Requires Python 3.10+ and `numpy>=1.20`. As of v0.1.1 the Python
bindings only accept plain Python lists (use `arr.tolist()` on any
numpy array before passing it in), due to a PyO3 / numpy ABI mismatch
that crashes on numpy >= 2.4. See the Quick start example for the
working pattern.

## Quick start

```python
import numpy as np
from spectraltm import SpectralEncoder, SpectralIndex

# Build the encoder (calibrate first on a representative sample).
# NOTE: pass plain Python lists, not numpy arrays. The Rust PyO3 binding
# crashes on numpy >= 2.4 due to an ABI mismatch (PyO3 0.29 was built
# against an older numpy ABI). The .tolist() path works on every numpy
# version and is what the parity test suite uses.
enc = SpectralEncoder(dim=384, top_k=64, mag_bits=8, phase_bits=8, norm_bits=8)
sample = np.random.randn(1000, 384).astype(np.float32)
enc.calibrate(sample.reshape(-1).tolist())

# Build the index
db = np.random.randn(10_000, 384).astype(np.float32)
idx = SpectralIndex(enc)
idx.add_embeddings(db.tolist())

# Search
queries = np.random.randn(100, 384).astype(np.float32)
top_idx, top_scores = idx.search_batch(queries.tolist(), top_k=10)
# search_batch returns nested Python lists of shape (n_queries, top_k);
# convert to numpy if you want shape/dtype ergonomics.
import numpy as np
top_idx = np.array(top_idx)        # shape: (100, 10)
top_scores = np.array(top_scores)  # shape: (100, 10)
```

## Benchmarks — BEIR scifact (k=10)

Honest framing below the table: spectraltm is a **lossy compressed index**, not a replacement for brute-force cosine. Pick K based on your quality bar.

```
==========================================================================================
SUMMARY  (BEIR scifact, k=10, spec_k=8)
==========================================================================================
engine                                       nDCG@10  MRR@10  R@10    encode doc ms  encode q ms  search ms/q  B/chunk  total MB
------------------------------------------------------------------------------------------
brute-force cosine (all-MiniLM-L6-v2)        0.6451   0.6047  0.7833  2514.8         0.1          0.021        -        -
spectraltm K=8 (all-MiniLM-L6-v2)            0.1732   0.1512  0.2569  2514.8         0.1          0.224        24       0.12
brute-force cosine (BAAI/bge-large-en-v1.5)  0.7346   0.7013  0.8592  44604.3        0.5          0.023        -        -
spectraltm K=8 (BAAI/bge-large-en-v1.5)      0.1346   0.1198  0.1973  44604.3        0.5          0.317        26       0.13
```

Honest framing:

- **Brute-force cosine is the gold-standard reference.** Spectral layer quality is a steep function of K. At K=8 you lose ~65% of nDCG; at K=64 you lose ~12%. Pick K based on your quality bar.
- **`bytes_per_chunk` scales linearly with K** (24 B at K=8, 192 B at K=64 for dim=384). Storage cost is real, not just latency.
- **Search latency grows with K** (each query bin contributes K column scans). For BEIR scifact (N=5K) it's <3ms at K=64; at N=50K expect 10–30ms with the scalar inner loop, much faster with AVX2.
- **`encode doc ms` / `encode q ms`** come from the same embedder call — they are not per-engine costs. The engine cost is bytes/chunk and total MB.
- **The right K depends on storage budget and quality bar.** K=32 is a reasonable middle ground (96 B/chunk, ~14% nDCG gap).

## When to use spectraltm

Use it when:
- You have a large cold-tier corpus (≥ 100K vectors) and can't keep float32 embeddings online
- Storage bandwidth dominates query latency (disk-resident index, RAM-constrained)
- You're willing to trade retrieval quality for ~60× storage compression (24 B/chunk vs 1.5 KB/vector)

Don't use it when:
- Brute-force cosine fits in memory (small corpus, fast disk) — the quality gap isn't worth the engineering
- You need exact top-K recall (spectral layer is lossy by design)
- Latency-critical path requires < 0.1 ms search (use a flat HNSW or IVF index instead)

## Project layout

```
SpectraLTM/
├── src/
│   ├── codes.rs             # SpectralCodes struct (Rust) + quantization
│   ├── encoder.rs           # SpectralEncoder (calibrate + encode)
│   ├── index.rs             # SpectralIndex (inverted mag/phase grids)
│   ├── simd.rs              # AVX2 + scalar inner loop
│   ├── error.rs             # SpectraltmError
│   └── python_bindings.rs   # PyO3 module + SpectralCodes/Index wrappers
├── tests/
│   ├── test_spectraltm_rust_parity.py    # cross-check vs Python reference
│   ├── generate_trace_vectors.py         # corpus fixtures
│   └── test_hypothesis_python_encoded.py # property tests
├── examples/
│   └── beir_scifact_eval.py  # the benchmark that produced the table above
└── pyproject.toml            # maturin build config
```

## How It Works

```
embeddings ─┐
            ├─► [encoder.calibrate]  ─► quantizer grids (mags, phases)
            │
            ├─► [encoder.encode]      ─► SpectralCodes (idx, mag_q, phase_q, norm_q)
            │
            └─► [index.add_embeddings] ─► dense (N, F) mag + phase grids

queries  ──┬─► [encoder.encode] ─► SpectralCodes (query)
            │
            └─► [index.search]   ─► (N, F) grid lookup ─► top-K scores
```

Per-frequency-bin scoring is the hot path. SIMD vectorizes the inner loop so a single query bin becomes a packed float32 dot-product over all N corpus entries for that bin. Top-K across bins is a small partial-sort.

## Development

```bash
# Editable install (rebuilds on Rust changes):
python -m maturin develop --release

# Or build a wheel:
python -m maturin build --release
pip install target/wheels/spectraltm-*.whl
```

See [HOW_TO_INSTALL.md](HOW_TO_INSTALL.md) for platform-specific setup (Rust toolchain, Python ABI compatibility, NumPy linkage).

## Citation

```
@misc{mckenzie2026spectral,
  title={Sparse Spectral Encoding for Cold-Tier Vector Memory},
  author={Mc Kenzie, Gerald Enrique Nelson},
  year={2026},
  month={6},
  day={28},
  doi={10.5281/zenodo.21005661},
  howpublished={Harmonic Resonance Indexing research project},
  url={https://github.com/lordxmen2k/sparse-spectral-encoding},
  note={Apache License 2.0. Contact: lordxmen2k@gmail.com}
}
```

**Metadata**

- **Author:** Gerald Enrique Nelson Mc Kenzie
- **Date:** 2026-06-28
- **DOI:** [10.5281/zenodo.21005661](https://doi.org/10.5281/zenodo.21005661)
- **Repository:** [github.com/lordxmen2k/sparse-spectral-encoding](https://github.com/lordxmen2k/sparse-spectral-encoding)
- **License:** Apache License 2.0
- **Contact:** lordxmen2k@gmail.com

## License

Apache-2.0 — see [LICENSE](LICENSE).

---

Built as a Rust port of [`spectral_codes`](https://github.com/lordxmen2k/sparse-spectral-encoding) — the two projects share zero source code per the explicit separation rule. The parity test (`tests/test_spectraltm_rust_parity.py`) is the only cross-project coupling and is the regression check that catches SpectraLTM drifting from the reference algorithm.
