TECHNICAL REPORT SERIES · MARCH 2026

Temporal State Compression
Research Series

A three-part investigation into compressing transformer KV caches using keyframe + delta encoding — the same principle as video compression. From first principles to 125× compression at zero quality loss.

125×
Best defensible
modern arch
1.000
Top-1 match rate
at 125×
0.057
KL divergence
at 125×
20.8×
Over Google
QuIP# (6×)
PART I March 2026
Temporal State Compression for Transformer KV Caches:
63× Lossy Compression with Zero Measurable Quality Loss
Introduces TSC and the core prefer_append_for_growing insight. Fixes the root bug (max_delta_error calibration) that caused every step to fall back to keyframes. Validates 7.98× lossless and 63× lossy on GPT-2 across WikiText-2, with top-1=1.0000 and ppl_delta=−0.03.
7.98× lossless 63× lossy top-1 = 1.0000 ppl_delta = −0.03 10.5× over QuIP# seq 128–1024 validated
PART II March 2026
Beyond 84×: A Vector Quantization Quality Framework
for Transformer KV Cache Compression
Investigates stacking VQ on TSC keyframes to reach extreme ratios. Identifies root cause of 1034× being unusable (per-head KV magnitude distributions). Builds a three-stage quality framework: per-head normalization, selective fp16 fallback, and Product Quantization. Establishes 84× (TSC + H2O eviction) and 219× (PQ) as defensible operating points.
84× + eviction 219× PQ (KL=0.05) 1034× archival KV magnitude root cause 14× over QuIP# H2O eviction at 75% retention
PART III March 2026 ✦ New
TSC at Scale: Modern Architecture Validation,
the Cross-Layer Weight Question, and a Definitive
Production Compression Hierarchy
Extends validation to TinyLlama-1.1B (RoPE + RMSNorm + GQA — the Gemma/LLaMA-3/Qwen-2 architecture class) at seq=1024. The selective VQ tier reaches 125× at KL=0.057 with top-1=1.000 on modern architecture. Also closes the weight compression question: a systematic probe finds inter-layer weight delta/weight ratios of 1.25–1.44×, conclusively ruling out cross-layer delta compression. Closes with the definitive production tier hierarchy and the complete TSC vs QuIP# comparison.
125× at KL=0.057 top-1 = 1.000 seq=1024 validated TinyLlama-1.1B (Gemma class) 20.8× over QuIP# Weight delta: negative result DynamicCache portability

Why KV Cache Compression Matters More Than Weight Compression

The field has focused on model weight compression (GPTQ, AWQ, QuIP#). These are valuable — but weights are a one-time fixed cost. The KV cache is the variable cost that grows with context length, conversation history, and concurrent users. It is the actual bottleneck for long-context production inference.

Google QuIP# (6×)

Compresses model weights. Requires calibration and retraining. Applied once at deployment. Operates on a fixed storage axis. State of the art for weight quantization.

TSC (63–125×)

Compresses KV cache at runtime. No model modification. Works on any pretrained model. Directly addresses the bottleneck that limits context window size in production.

They compose: QuIP# model + TSC KV cache

These compress different things and stack freely. A QuIP#-compressed model running with TSC KV compression gets both savings simultaneously — on independent storage axes with no interference.

At the 125× tier, a LLaMA-3-7B deployment serving 128K-token contexts reduces per-user KV memory from ~67 GB to ~537 MB — allowing ~150 concurrent users on a single A100 where previously only one was possible, at identical generation quality (top-1 match rate = 1.000).