Metadata-Version: 2.4
Name: apohara-vllm-plugin
Version: 0.1.0
Summary: Apohara ContextForge plugin for vLLM V1 — multi-agent KV-cache coordination, JCR Safety Gate (INV-15), RotateKV INT4 hooks, on AMD Instinct MI300X.
Author-email: "Pablo M. Suarez" <suarezpm@csnat.unt.edu.ar>
Maintainer-email: "Pablo M. Suarez" <suarezpm@csnat.unt.edu.ar>
License: Apache-2.0
Project-URL: Homepage, https://github.com/SuarezPM/Apohara_Context_Forge
Project-URL: Repository, https://github.com/SuarezPM/Apohara_Context_Forge
Project-URL: Documentation, https://github.com/SuarezPM/Apohara_Context_Forge#readme
Project-URL: Changelog, https://github.com/SuarezPM/Apohara_Context_Forge/blob/main/CHANGELOG.md
Project-URL: Audit, https://github.com/SuarezPM/Apohara_Context_Forge/blob/main/AUDIT.md
Project-URL: Paper, https://doi.org/10.5281/zenodo.20114594
Project-URL: Issues, https://github.com/SuarezPM/Apohara_Context_Forge/issues
Keywords: vllm,kv-cache,multi-agent,llm,amd,rocm,mi300x,kv-coordination,speculative-decoding,judge-consistency
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Requires-Dist: apohara-context-forge>=6.1.0
Provides-Extra: vllm
Requires-Dist: vllm>=0.9.0; extra == "vllm"
Provides-Extra: test
Requires-Dist: pytest>=8.3; extra == "test"
Requires-Dist: pytest-asyncio>=0.24; extra == "test"
Requires-Dist: numpy>=1.26; extra == "test"
Dynamic: license-file

# apohara-vllm-plugin

Multi-agent KV-cache coordination as a [vLLM V1](https://vllm.ai)
plugin. Drop it next to vLLM and it self-registers through the
`vllm.general_plugins` entry-point group: no patching, no fork.

```bash
pip install apohara-vllm-plugin
```

The plugin's job inside vLLM is:

1. **Anchor-aware KV-block routing** via SimHash LSH lookup against the
   ContextForge registry (cross-agent block reuse).
2. **RotateKV pre-RoPE INT4 quantization hooks** (INVARIANT 10:
   pre-RoPE only).
3. **JCR Safety Gate (INV-15) enforcement** — judge / critic agents
   with `JCR risk > 0.7` are forced into dense prefill, bypassing the
   shared cache. See [arXiv:2601.08343](https://arxiv.org/abs/2601.08343).
4. **Honest metrics** — every flag in the hook's return dict reflects
   state (what actually ran), not intent (what the config asked for).

This is the thin published shim over the in-tree implementation at
[`apohara_context_forge.serving.romy_plugin`](https://github.com/SuarezPM/Apohara_Context_Forge/blob/main/apohara_context_forge/serving/romy_plugin.py).

## Quick usage

### Inside vLLM (automatic)

vLLM walks `vllm.general_plugins` at worker startup. No code change:

```bash
pip install vllm apohara-vllm-plugin
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-235B-A22B
```

You should see in the vLLM startup log:

```
ROMY plugin initialised: worker=… deps={…}
```

Cross-worker KV reuse is wired separately, config-driven via
`--kv-transfer-config` (LMCache) — not by this plugin. See
[`LMCACHE.md`](https://github.com/SuarezPM/Apohara_Context_Forge/blob/main/LMCACHE.md).

### Manually (for tests / inspection)

```python
from apohara_vllm_plugin import register

plugin = register()
assert plugin.is_initialized()
print(plugin.get_stats())
```

The plugin is constructible without vLLM installed.

### Wiring real ContextForge dependencies

By default the plugin runs as a no-op telemetry surface (every flag in
the metadata dict reports `False` / `None` honestly). Inject the real
subsystems through `vLLMRomyPlugin(...)`:

```python
from apohara_vllm_plugin import vLLMRomyPlugin, ROMYConfig
from apohara_context_forge.quantization.rotate_kv import (
    RotateKVConfig, RotateKVQuantizer,
)
from apohara_context_forge.dedup.lsh_engine import LSHTokenMatcher
from apohara_context_forge.safety.jcr_gate import JCRSafetyGate
from apohara_context_forge.metrics.collector import MetricsCollector

plugin = vLLMRomyPlugin(
    ROMYConfig(),
    quantizer=RotateKVQuantizer(RotateKVConfig()),
    lsh_matcher=LSHTokenMatcher(),
    jcr_gate=JCRSafetyGate(),
    metrics=MetricsCollector(),
)
plugin.initialize("worker_0", vllm_config={})
```

`pre_attention_hook` / `post_attention_hook` are unit-tested,
importable utilities for inspecting reuse/quantization decisions; they
are NOT cabled to the vLLM runtime (no such vLLM platform attention-hook
API exists). The runtime cross-worker KV path is config-driven via
`--kv-transfer-config` (LMCache).

## Honest semantics

V6.1+ flags in the pre-attention hook's return dict:

| Flag                        | True iff                                                                 |
|----------------------------|---------------------------------------------------------------------------|
| `quantization_attempted`    | `enable_quantization=True` *and* a quantizer was wired                    |
| `quantization_applied`      | a quantizer was wired *and* it actually executed without raising         |
| `quantized` *(alias)*       | same as `quantization_applied` — kept for back-compat                    |
| `pre_rope`                  | always `True` — INV-10: this hook never operates on post-RoPE tensors    |
| `anchor_match`              | `None` if no LSH matcher wired; else lookup descriptor                   |
| `jcr_dense`                 | `True` iff JCR Safety Gate fired INV-15 for this call                    |

Returning `True` when nothing happened is the pattern we're explicitly
fixing in V6.1 — see the project root [`AUDIT.md`](https://github.com/SuarezPM/Apohara_Context_Forge/blob/main/AUDIT.md).

## Citation

If this plugin or the underlying mechanisms help your work, please cite:

```bibtex
@misc{contextforge,
  author    = {Suarez, Pablo M.},
  title     = {{ContextForge: A Unified KV-Cache Coordination Layer
                for Multi-Agent LLM Pipelines on AMD Instinct MI300X}},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20114594},
  url       = {https://doi.org/10.5281/zenodo.20114594}
}
```

## License

Apache-2.0.
