01. The Algorithm
How TurboQuant compresses KV cache to 3 bits per coordinate with lossless attention scoring.
The KV Cache Problem
Every time a language model generates a token, it caches a key and a value vector for that token across every single layer. The cache grows linearly with sequence length. At 8,000 tokens on a 3-billion parameter model, that's almost 300 MB of KV cache.
During the decode phase (the part where the model actually generates its response), the GPU isn't doing much math. It's sitting there waiting for cached data to load from high-bandwidth memory. Over 90% of wall-clock time is spent on memory reads.
Step 1: Random Rotation
Quantization is rounding. You snap each coordinate to the nearest bin. But raw key vectors have outliers: a handful of coordinates with huge magnitudes, the rest clustered near zero. A uniform grid wastes most of its bins covering the outliers. The dense middle gets poor resolution.
TurboQuant fixes this in two steps. First, each vector is normalized to a unit vector (the original magnitude is stored separately as a scalar norm). Then it's multiplied by a random orthogonal matrix Π, which spreads energy evenly across all dimensions without changing the vector's geometry (dot products are preserved since Π is orthogonal). After rotation, every coordinate follows the same predictable Gaussian distribution, and because we know that distribution ahead of time, we can design the perfect quantizer for it.
Step 2: Lloyd-Max Quantization
Now that every coordinate follows the same Gaussian, we can use Lloyd-Max optimal scalar quantization. It places centroids along the bell curve: more where the data is dense, fewer at the tails. This minimizes mean squared error for the known distribution.
We precompute the codebook once offline using a feedback loop: assign coordinates to nearest centroids, recompute centroids as conditional means, repeat. The result is stored as a small lookup table.
Values
3 bits → 8 centroids. All bits go to Lloyd-Max. Values aren't part of the scoring dot product, so no bias correction needed.
Keys
2 bits → 4 centroids for Lloyd-Max, plus 1 bit reserved for QJL bias correction (see below). 3 total bits per coordinate.
Step 3: QJL Bias Correction
Here's the catch. Reconstructed vectors don't come back perfectly. There's a small reconstruction loss that systematically shrinks every vector toward zero, which means every attention score ends up a little too small.
Softmax exponentiates those scores, so even a small systematic bias gets amplified into a completely warped attention distribution.
TurboQuant fixes this with QJL (Quantized Johnson-Lindenstrauss). Take the quantization residual, project it through a random matrix S, and keep just the sign of each coordinate. Positive or negative. One bit per dimension.
At decode time, the stored signs compute a correction term that provably cancels out the bias:
Bit Budget
| Component | Lloyd-Max | QJL | Total | Centroids |
|---|---|---|---|---|
| Keys | 2 bits | 1 bit | 3 bits | 4 (Lloyd-Max) |
| Values | 3 bits | - | 3 bits | 8 (Lloyd-Max) |
At 3 bits per coordinate for both keys and values, the total KV cache compresses to roughly 1/5 of FP16 size, while attention scores remain mathematically unbiased.