A three-part investigation into compressing transformer KV caches using keyframe + delta encoding — the same principle as video compression. From first principles to 125× compression at zero quality loss.
The field has focused on model weight compression (GPTQ, AWQ, QuIP#). These are valuable — but weights are a one-time fixed cost. The KV cache is the variable cost that grows with context length, conversation history, and concurrent users. It is the actual bottleneck for long-context production inference.
Compresses model weights. Requires calibration and retraining. Applied once at deployment. Operates on a fixed storage axis. State of the art for weight quantization.
Compresses KV cache at runtime. No model modification. Works on any pretrained model. Directly addresses the bottleneck that limits context window size in production.
These compress different things and stack freely. A QuIP#-compressed model running with TSC KV compression gets both savings simultaneously — on independent storage axes with no interference.
At the 125× tier, a LLaMA-3-7B deployment serving 128K-token contexts reduces per-user KV memory from ~67 GB to ~537 MB — allowing ~150 concurrent users on a single A100 where previously only one was possible, at identical generation quality (top-1 match rate = 1.000).