#kv_cache

2 posts

Serving Agentic Workloads at Scale with vLLM x Mooncake

May 6, 2026·10 min read

TL;DR: Agentic workloads generate massive shared prefixes that are often recomputed across turns. By integrating Mooncake's distributed KV cache store into vLLM, we achieve 3.8x higher throughput,...

The State of FP8 KV-Cache and Attention Quantization in vLLM

Apr 22, 2026·21 min read

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...