Skip to content

KV-Cache in LLM Inference

The KV-cache is the single most impactful performance optimization in LLM inference. Understanding it is essential for anyone building or scaling AI systems.

Transformers use attention: every output token attends to every previous token. Without caching, generating a 1000-token response requires recomputing attention for tokens 1–999 when generating token 1000.

Without KV-cache:
Token 1: 1 attention computation
Token 2: 2 attention computations
Token N: N attention computations
Total: O(NΒ²) computations ← quadratic!
With KV-cache:
Token 1: 1 computation β†’ store K,V in cache
Token 2: 1 computation β†’ append K,V to cache
Token N: 1 computation β†’ append K,V to cache
Total: O(N) computations ← linear!

For each transformer layer, every token produces three vectors: Query (Q), Key (K), and Value (V). During generation, Q changes every step (new token), but K and V for previous tokens are static β€” they never change.

import torch
def attention_with_kv_cache(query, key_cache, value_cache, new_key, new_value):
"""
query: [1, d_head] ← current token only
key_cache: [seq_len, d_head] ← all previous K vectors
value_cache: [seq_len, d_head] ← all previous V vectors
"""
# Append new K, V to cache
key_cache = torch.cat([key_cache, new_key.unsqueeze(0)], dim=0)
value_cache = torch.cat([value_cache, new_value.unsqueeze(0)], dim=0)
# Attention over full cached sequence
scores = torch.matmul(query, key_cache.T) # [1, seq_len]
scores = scores / (query.shape[-1] ** 0.5)
weights = torch.softmax(scores, dim=-1)
output = torch.matmul(weights, value_cache) # [1, d_head]
return output, key_cache, value_cache

This is where storage becomes critical. KV-cache size for a single request:

def kv_cache_bytes(num_layers, num_heads, head_dim, seq_len, dtype_bytes=2):
"""
num_layers: e.g. 32 (Llama-2-7B has 32 layers)
num_heads: e.g. 32
head_dim: e.g. 128
seq_len: max context length, e.g. 4096
dtype_bytes: 2 for float16/bfloat16
"""
per_token = num_layers * num_heads * head_dim * dtype_bytes * 2 # K + V
total = per_token * seq_len
return total
# Llama-2-7B, 4096 token context, bfloat16
size = kv_cache_bytes(32, 32, 128, 4096, 2)
print(f"KV-cache per request: {size / 1e9:.2f} GB")
# Output: KV-cache per request: 1.07 GB

With 40 GB GPU RAM and 1 GB per request, you can serve ~37 concurrent users β€” that’s your throughput ceiling.

vLLM solved the KV-cache memory fragmentation problem by treating GPU memory exactly like OS virtual memory β€” paged, demand-allocated.

Traditional KV-cache: PagedAttention (vLLM):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚ Request A (2048 tok)β”‚ β”‚Page 1β”‚ β”‚Page 3β”‚ β”‚Page 5β”‚ ← Request A
β”‚ [pre-allocated] β”‚ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜
β”‚ [wasted if shorter]β”‚ β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚Page 2β”‚ β”‚Page 4β”‚ ← Request B
β”‚ Request B (512 tok) β”‚ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜
β”‚ [padded to max] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Pages allocated on demand β€” no waste

Result: vLLM achieves 2–4x higher throughput than naive implementations on the same hardware.

ConceptDetail
What’s cachedK and V vectors per layer, per token
Memory growsLinearly with sequence length
GPU memory limitDefines max concurrent requests
vLLM innovationPaged KV-cache β€” eliminates fragmentation
Quantization winINT8 KV-cache halves memory β†’ 2x more users

See also: Flash Attention, GPU Memory Management, vLLM Architecture