LMCache is a high-performance key–value (KV) cache management system designed to accelerate large language model (LLM) inference by efficiently storing, transferring, and reusing intermediate attention states. As modern LLM serving increasingly becomes bottlenecked by memory bandwidth, redundant computation, and cross-device communication, LMCache provides a system-level solution that decouples KV cache storage from the model execution pipeline and enables scalable, low-latency reuse across requests, processes, and even distributed nodes.

At its core, LMCache targets one of the most expensive components of autoregressive inference: the KV cache generated during the prefill phase. In conventional serving systems, this cache is tightly coupled to a single process or GPU, making it difficult to reuse across requests or share between instances. As a result, repeated prompts or multi-turn conversations often trigger redundant computation, increasing both latency and resource consumption. LMCache addresses this limitation by introducing a unified KV cache abstraction that can be externally managed, retrieved asynchronously, and seamlessly reintegrated into the decoding pipeline.

A key design principle of LMCache is minimizing the impact of cache operations on the critical path of inference. By supporting asynchronous KV retrieval and background prefetching, LMCache allows decoding to proceed without blocking on cache transfers. This is particularly important in disaggregated or multi-process (MP) deployments, where different inference engines may collaborate through a shared cache backend. In such settings, LMCache enables one instance to reuse KV states computed by another, significantly reducing time-to-first-token (TTFT) and improving tail latency. Empirical results show substantial gains in multi-turn and stateful workloads, where reuse opportunities are abundant.

LMCache also provides flexible support for heterogeneous memory and storage backends, including GPU memory, host memory, persistent storage (e.g., SSD or DAX devices), and high-speed interconnects such as RDMA. Through pluggable connectors and configurable transfer policies, it can adapt to diverse deployment environments, from single-node setups to large-scale distributed clusters. This flexibility enables system designers to balance trade-offs between latency, capacity, and cost, while maintaining high throughput.

Beyond performance optimization, LMCache emphasizes observability and integration. It exposes detailed metrics for cache lookup, retrieval, and storage operations, allowing users to understand system behavior and diagnose bottlenecks. It is designed to integrate seamlessly with popular serving frameworks such as vLLM, requiring minimal changes to existing workflows while unlocking advanced caching capabilities.

In summary, LMCache rethinks KV cache management as a first-class system component for LLM serving. By enabling efficient reuse, asynchronous data movement, and cross-instance sharing, it addresses fundamental inefficiencies in current inference pipelines and paves the way for more scalable, responsive, and resource-efficient AI systems.