Metadata-Version: 2.4
Name: local-llm-embed
Version: 0.1.0
Summary: Sentence embeddings from any local causal LLM, no fine-tuning.
Project-URL: Homepage, https://github.com/youichi-uda/local-llm-embed
Project-URL: Issues, https://github.com/youichi-uda/local-llm-embed/issues
Author: youichi uda
License: MIT
License-File: LICENSE
Keywords: embeddings,llm,ollama,rag,retrieval,sentence-embeddings,transformers,whitening
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24
Requires-Dist: torch>=2.1
Requires-Dist: transformers>=4.45
Provides-Extra: dev
Requires-Dist: huggingface-hub>=0.24; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: hub
Requires-Dist: huggingface-hub>=0.24; extra == 'hub'
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == 'test'
Description-Content-Type: text/markdown

# local-llm-embed

> Get sentence embeddings out of any local causal LLM you already have running. No fine-tuning, no separate encoder model.

If you're running Llama / Qwen / Phi / Mistral / TinyLlama via `transformers`, `Ollama`, `vLLM`, or `llama.cpp` and want embeddings for RAG / retrieval / classification — this library extracts them from the model you already have, in a few lines of code.

## Why

| | dedicated encoder (BGE-M3, MiniLM) | this library |
|---|---|---|
| Need a separate ~500 MB model load | yes | **no** (reuses your LLM) |
| Need fine-tuning | already trained | **none** |
| Works on any HF causal LM | n/a | **yes** |
| STS Spearman vs MiniLM-L6 | 0.867 (baseline) | 0.806 (Phi-3.5) |
| Banking77 accuracy vs MiniLM-L6 | 0.5500 | **0.5540 (wins)** |

The trade-off is honest: dedicated encoders still beat raw LLM probes on pure semantic similarity (STS) by ~6 points. But on *classification-style* tasks like Banking77, this library matches or slightly beats the baseline — using a model you already have in memory.

## Install

```bash
pip install local-llm-embed
```

Or with the Hugging Face Hub helpers (for downloading pre-fit whiteners):

```bash
pip install "local-llm-embed[hub]"
```

## Quick start

```python
from local_llm_embed import LocalLLMEmbedder

embedder = LocalLLMEmbedder("Qwen/Qwen2.5-0.5B-Instruct")
emb = embedder.encode(["The cat sat on the mat.", "A feline rested."])
print(emb.shape)  # (2, 896)
print(emb @ emb.T)  # cosine similarity matrix
```

By default this uses `prefix+whiten` (the most universally strong variation
in our benchmarks). The whitener is fit lazily on the first batch you encode.

### Use a calibration set for better whitening

```python
calibration = [...]  # ~1000 representative texts from your domain
embedder.fit_whitener(calibration)
embedder.save_whitener("./domain_whiten.npz")

# later, in another session:
embedder = LocalLLMEmbedder("Qwen/Qwen2.5-0.5B-Instruct",
                             whitener_path="./domain_whiten.npz")
```

### Pick a different variation

```python
embedder = LocalLLMEmbedder(
    "microsoft/Phi-3.5-mini-instruct",
    variation="echo+whiten",   # best for STS
    layer="final",
    pooling="weighted_mean",
)
```

## Variations

Three train-free recipes are bundled. They're combinations of well-known
techniques, ranked by how well they performed in our internal benchmark
(STSB validation, Banking77; see `BENCHMARKS.md`):

- **`prefix+whiten`** *(default)* — feed the text in, take the chosen
  layer + pooling, then center & ZCA-whiten the resulting matrix. Whitening
  removes the anisotropy ("everything-looks-similar") problem that causal
  LMs suffer from. Universal +0.12 to +0.22 STS Spearman over no whitening.
- **`echo+whiten`** — duplicate the text and pool only over the second
  copy, so each pooled token has seen the full sentence (works around the
  causal mask). Best STS combination in our tests.
- **`prefix`** — no transformation. For comparison / debugging.

## Hardware notes

Causal LMs at fp32 are heavy on RAM. We default to **bf16** if your CPU
has the `avx512_bf16` flag (most modern AMD / Intel desktop CPUs do); on
GPU just pass `device="cuda"` and the same bf16 default applies.

## Limitations (be honest)

- For *pure* semantic textual similarity, dedicated contrastive encoders
  (BGE-M3, all-MiniLM-L6-v2) still win by ~6 STS points. This library is
  for the case where you already have a causal LM loaded.
- Whitening requires a calibration set of at least a few hundred texts.
  Without it, self-whitening is used at encode time (fits on the batch
  you're encoding). That's worse than a good calibration set but still
  better than raw probes.
- `bidirectional` inference (LLM2Vec-style attention-mask removal) is
  **not bundled**. We benchmarked it; it consistently hurt without
  fine-tuning and we don't want to ship a footgun.

## Acknowledgements

The technique is a combination of
[BERT-whitening (Su et al. 2021)](https://arxiv.org/abs/2103.15316),
[Echo Embeddings (Springer et al. 2024)](https://arxiv.org/abs/2402.15449),
and [PromptEOL (Jiang et al. 2023)](https://arxiv.org/abs/2307.16645).
This library packages the train-free subset.

## License

MIT.
