Metadata-Version: 2.4
Name: longctx
Version: 0.2.0
Summary: Open long-context inference stack: retrieval + open weights, no closed parts.
Author: TheTom
License: Apache-2.0
Project-URL: Homepage, https://github.com/TheTom/longctx
Project-URL: Repository, https://github.com/TheTom/longctx
Project-URL: Documentation, https://github.com/TheTom/longctx#readme
Keywords: llm,long-context,retrieval,rag,vllm,open-weights
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Requires-Dist: pandas>=2.0
Requires-Dist: requests>=2.31
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: faiss-cpu>=1.7.4
Provides-Extra: serve
Requires-Dist: vllm>=0.7; extra == "serve"
Requires-Dist: torch>=2.4; extra == "serve"
Provides-Extra: eval
Requires-Dist: pyarrow>=15.0; extra == "eval"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Dynamic: license-file

# longctx

Open long-context inference stack. Retrieval + open weights, no closed parts.

A small library that bundles the components needed to reach Anthropic-class long-context retrieval performance on a single accessible GPU using only open weights.

## What it is

`longctx` is a thin wrapper over standard tools:

- **Retrieval**: sentence-transformers (bi-encoder) + faiss
- **Generation**: any OpenAI-compatible LLM endpoint (vLLM, SGLang, llama.cpp server)
- **Defaults tuned for**: Qwen2.5-14B-Instruct-1M, but works with any instruction-following open model

## Why

A stack of `longctx` defaults running Qwen2.5-14B-Instruct-1M on a single MI300X scored **0.822 on MRCR v2 8K bin** (n=82, mass-validated 2026-05-06), beating the headline number a $29M-funded closed-weight startup published with their custom subquadratic architecture. The architectural moat narrative wasn't load-bearing for the workload. Retrieval + open weights solve it.

This library exists so the rest of the open ecosystem can reproduce that result with one `pip install`.

## Install

```bash
pip install longctx
```

For local vLLM serving:

```bash
pip install longctx[serve]
```

## Quickstart

```python
from longctx import LongCtxClient

# Defaults: sentence-transformers/all-MiniLM-L6-v2 + local vLLM at port 5050
client = LongCtxClient()

# Pass your candidate chunks and a query
result = client.ask(
    query="What was the third response about regulatory compliance?",
    candidates=[
        "Response 1: brief on regulatory compliance...",
        "Response 2: legal analysis of...",
        "Response 3: detailed compliance walkthrough...",
        # ... up to thousands of candidates
    ],
    top_k=8,
)

print(result.content)
print(f"Retrieved indices: {result.retrieved_indices}")
print(f"Prompt tokens: {result.prompt_tokens}")
```

## Custom embedder

```python
from longctx import LongCtxClient, RetrievalPipeline

# Default uses MiniLM-L6 (23M params, CPU-friendly).
# For higher quality at the cost of compute:
pipeline = RetrievalPipeline(embedder_model="BAAI/bge-large-en-v1.5")
client = LongCtxClient(pipeline=pipeline)
```

## Notes on rerankers

`longctx` does not enable cross-encoder reranking by default. Off-the-shelf rerankers (ms-marco-MiniLM, bge-reranker-base) **degraded** retrieval quality on MRCR-style tasks in our 2026-05-06 testing. They are trained for web-search relevance, which doesn't transfer to "find the Nth message of type X" task semantics.

A retrieval-style reranker fine-tuned on appropriate data is on the roadmap. Until then, pure bi-encoder retrieval is the default.

## Status

Pre-alpha v0.1.0. APIs may change.

### Headline numbers (mass-validated)

End-to-end validation 2026-05-06 on AMD MI300X with vLLM-served Qwen2.5-14B-Instruct-1M, default `LongCtxClient` config (sentence-transformers MiniLM-L6 + faiss top-K=8):

| MRCR v2 8-needle bin | pipeline | n | avg_score | prefix_pass |
| -------------------- | -------- | -- | --------- | ----------- |
| 8K  (16K-32K char)   | RAG          | 82 | **0.822** | 100% |
| 32K (64K-128K char)  | RAG          | 98 | **0.697** |  97% |
| 64K (128K-256K char) | RAG          | 95 | **0.641** |  98% |
| 64K (128K-256K char) | chunked-RAG  | 95 | **0.670** |  98% |

Reference baseline: SubQ Inc.'s published MRCR headline = 0.659 (closed-weight, custom subquadratic architecture, $29M funding).

Three of three bins clear the closed-weight headline with the right pipeline. Plain RAG over standard attention is competitive with claimed-state-of-the-art subquadratic architectures on MRCR-style retrieval workloads at every bin we measured.

### Other tested generators (single-run, n=30, not mass-validated)

- Qwen2.5-7B-Instruct + RAG: 0.567 (2.4× faster, fits 16GB GPU)
- Qwen2.5-32B-Instruct + RAG: 0.237 (vanilla 32K context window, training-data fit limits the result)
- Qwen3-Next-80B-A3B + RAG: 0.281 (linear-attention hybrid, MoE)

Single-run scores at n=30 have substantial variance (we observed ±0.05 swings between adjacent runs of the same config). Trust the mass-validated numbers above for headline claims.

Mistral-7B-Instruct-v0.3 and Qwen3-8B failed with the default Qwen2.5-style template (prefix-first instruction). Templates are provided for both: `longctx.templates.MISTRAL_VERBATIM_TEMPLATE` and `longctx.templates.QWEN3_NO_THINK_TEMPLATE`. Validation against MRCR for these templates is on the roadmap.

### Reproduce

```bash
longctx-bench --data-dir /path/to/mrcr/v2 --model qwen2.5-14b-instruct-1m \
    --bins 8k 32k 64k --n 80 --include-chunked
```

## License

Apache 2.0.
