Metadata-Version: 2.4
Name: memkv-lmcache
Version: 1.0.1
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: pytest >=7 ; extra == 'test'
Requires-Dist: torch >=2.1 ; extra == 'test'
Provides-Extra: test
Summary: LMCache storage plugin that targets the MemKV context memory store; gives vLLM a MemKV-backed prefix cache via lmcache_connector.
Author: MinIO, Inc.
License: LicenseRef-Proprietary
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/miniohq/memkv
Project-URL: Documentation, https://docs.min.io/memkv/

# memkv-lmcache

LMCache `StoragePluginInterface` backend that persists KV chunks in a
remote MemKV cluster. Loaded as a vendor plugin via LMCache's
`storage_plugins` dynamic loader — no patches to LMCache's tree.

vLLM gets a MemKV-backed prefix KV-state path for free through
`vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py` —
no separate vLLM connector package is required.

## Build

```bash
cd lmcache-plugin
pip install maturin
maturin develop --release      # local dev install
# or
maturin build --release        # wheel under target/wheels/
pip install target/wheels/memkv_lmcache-*.whl
```

The wheel bundles a native PyO3 extension built from the same
`memkv-client` crate the NIXL plugin and the sglang plugin use, so
RDMA/TCP transport selection works the same way across all three.

## Configure the MemKV connection

The plugin reads the standard MemKV config chain — `MEMKV_CONFIG`
yaml first, then `MEMKV_*` env vars:

```bash
export MEMKV_SERVERS="10.0.0.10:9900,10.0.0.11:9900"
export MEMKV_RDMA_DEVICES="mlx5_0,mlx5_1"
export MEMKV_AUTH_KEY="<64-hex>"
# optional:
# export MEMKV_TRANSPORT=auto
# export MEMKV_CONFIG=/etc/memkv.yaml
```

## Configure LMCache

Add the plugin to your LMCache yaml. `max_local_cpu_size` must be
`> 0` because the plugin uses LocalCPUBackend's allocator to stage
retrieved tensors:

```yaml
chunk_size: 64
local_cpu: true
max_local_cpu_size: 5
storage_plugins: memkv
extra_config:
  storage_plugin.memkv.module_path: memkv_lmcache.backend
  storage_plugin.memkv.class_name: MemKVStorageBackend
```

## Launch vLLM with LMCache + MemKV

```bash
LMCACHE_CONFIG_FILE=/etc/lmcache.yaml \
KV_TRANSFER_CONFIG='{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' \
vllm serve meta-llama/Llama-3-8B \
    --tensor-parallel-size 1 \
    --kv-transfer-config "$KV_TRANSFER_CONFIG"
```

LMCache's connector picks the storage_plugins entry up at startup and
routes prefix KV reads/writes through MemKVStorageBackend.

## What's implemented

| Method                    | Status                                                                                                                                                                                                                                                                 |
| ------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `contains`                | yes (in-process `_meta` check — a key only counts as present when its shape/dtype metadata is on file, so a fresh process reports _miss_ for server-resident bytes it cannot reconstruct); `batched_contains` falls through to `StoragePluginInterface`'s default loop |
| `exists_in_put_tasks`     | yes (in-process tracking set)                                                                                                                                                                                                                                          |
| `batched_submit_put_task` | yes (synchronous; returns None)                                                                                                                                                                                                                                        |
| `get_blocking`            | yes (requires prior put in this process — see _Caveats_)                                                                                                                                                                                                               |
| `remove`                  | yes                                                                                                                                                                                                                                                                    |
| `pin` / `unpin`           | yes (presence-check only — wire layer has no per-client retention)                                                                                                                                                                                                     |
| `get_allocator_backend`   | yes (delegates to LocalCPUBackend)                                                                                                                                                                                                                                     |
| `close`                   | yes                                                                                                                                                                                                                                                                    |

## Caveats

- **Cross-restart warm cache is MVP-restricted.** Each engine process
  keeps shape/dtype/fmt in an in-memory dict so `get_blocking` knows
  what `MemoryObj` to allocate. The wire bytes survive in MemKV across
  restarts; the local metadata does not. A fresh process therefore
  starts cold even when MemKV holds the chunks. This matches LMCache's
  LocalDiskBackend behavior. Encoding the shape header on the wire is
  a follow-up.
- **Key length cap.** MemKV's protocol caps keys at 512 bytes;
  `CacheEngineKey.to_string()` shapes longer than 480 bytes collapse
  to a `memkv-h2:<blake2b-256>` digest.
- **Pin/unpin are local-only.** MemKV has no per-client retention
  policy, so the methods are presence checks against the local meta
  dict. Server-side eviction is owned by the MemKV cluster.
- **Reads ride the server-driven chunked BatchRead.** `get_blocking`
  uses `batch_get_into`, which streams the full value through the
  server's bounce buffers into per-thread staging with strict
  full-length-or-miss semantics. The single-key zero-copy `get_into`
  (client-driven RDMA READ) remains available but is not the LMCache
  default: under sustained burst load it saturated the per-connection
  RC send CQ tail and tripped vLLM's SPMD broadcast timeout, and it
  faults the value resident server-side.

## Layout

```
lmcache-plugin/
├── Cargo.toml                      # cdylib + pyo3 + memkv-client
├── pyproject.toml                  # maturin
├── src/lib.rs                      # PyO3 wrapper around memkv-client::Engine
└── python/memkv_lmcache/
    ├── __init__.py                 # re-exports Client
    └── backend.py                  # MemKVStorageBackend(StoragePluginInterface)
```

