01. The Algorithm

How TurboQuant compresses KV cache to 3 bits per coordinate with lossless attention scoring.

The KV Cache Problem

Every time a language model generates a token, it caches a key and a value vector for that token across every single layer. The cache grows linearly with sequence length. At 8,000 tokens on a 3-billion parameter model, that's almost 300 MB of KV cache.

During the decode phase (the part where the model actually generates its response), the GPU isn't doing much math. It's sitting there waiting for cached data to load from high-bandwidth memory. Over 90% of wall-clock time is spent on memory reads.

KV cache grows with each token. The GPU spends most of decode time waiting on memory.

Two problems at once: The KV cache takes too much memory to store and too long to load. TurboQuant compresses the cache down to 3 bits per coordinate, solving both.

Step 1: Random Rotation

Quantization is rounding. You snap each coordinate to the nearest bin. But raw key vectors have outliers: a handful of coordinates with huge magnitudes, the rest clustered near zero. A uniform grid wastes most of its bins covering the outliers. The dense middle gets poor resolution.

TurboQuant fixes this in two steps. First, each vector is normalized to a unit vector (the original magnitude is stored separately as a scalar norm). Then it's multiplied by a random orthogonal matrix Π, which spreads energy evenly across all dimensions without changing the vector's geometry (dot products are preserved since Π is orthogonal). After rotation, every coordinate follows the same predictable Gaussian distribution, and because we know that distribution ahead of time, we can design the perfect quantizer for it.

k̂ = Π · k → each coordinate ~ N(0, σ²/d)

Before rotation: outliers dominate. After: energy spreads into a uniform bell curve.

Step 2: Lloyd-Max Quantization

Now that every coordinate follows the same Gaussian, we can use Lloyd-Max optimal scalar quantization. It places centroids along the bell curve: more where the data is dense, fewer at the tails. This minimizes mean squared error for the known distribution.

We precompute the codebook once offline using a feedback loop: assign coordinates to nearest centroids, recompute centroids as conditional means, repeat. The result is stored as a small lookup table.

Values

3 bits → 8 centroids. All bits go to Lloyd-Max. Values aren't part of the scoring dot product, so no bias correction needed.

Keys

2 bits → 4 centroids for Lloyd-Max, plus 1 bit reserved for QJL bias correction (see below). 3 total bits per coordinate.

Lloyd-Max centroids iteratively converge to optimal positions along the Gaussian.

Step 3: QJL Bias Correction

Here's the catch. Reconstructed vectors don't come back perfectly. There's a small reconstruction loss that systematically shrinks every vector toward zero, which means every attention score ends up a little too small.

Softmax exponentiates those scores, so even a small systematic bias gets amplified into a completely warped attention distribution.

TurboQuant fixes this with QJL (Quantized Johnson-Lindenstrauss). Take the quantization residual, project it through a random matrix S, and keep just the sign of each coordinate. Positive or negative. One bit per dimension.

At decode time, the stored signs compute a correction term that provably cancels out the bias:

score(q, k) ≈ ⟨q, k̂_mse⟩ + ‖r‖ · √(π/2) / m · (S @ q) · sign(S @ residual)

Why not QJL on values? Values are linearly combined with softmax weights. There's no dot product to be biased. Reconstruction error gets averaged across the weighted sum instead of being amplified through an exponential, so all 3 bits go to Lloyd-Max for better MSE.

Quantization shrinks scores (red). QJL correction restores the unbiased estimate (green).

Bit Budget

Component	Lloyd-Max	QJL	Total	Centroids
Keys	2 bits	1 bit	3 bits	4 (Lloyd-Max)
Values	3 bits	-	3 bits	8 (Lloyd-Max)

At 3 bits per coordinate for both keys and values, the total KV cache compresses to roughly 1/5 of FP16 size, while attention scores remain mathematically unbiased.

← Home 02. Kernels →

Anirudh Bharadwaj Vangara

MLE Intern @ Shopify · Computer Engineering @ University of Waterloo · MLH Top 50

ab2vanga@uwaterloo.ca · LinkedIn · X / Twitter