03. Results & Benchmarks

All numbers from Qwen 2.5-1.5B running on a single NVIDIA Blackwell B200.

At a Glance

5.02×

Compression Ratio

~0.985

Cosine Similarity

144.7

tok/s

3 bits

Per Coordinate

Compression

At 3 bits per coordinate for both keys and values, the KV cache compresses from 1,288 KB down to 257 KB, a 5.02× reduction. This holds across all layers and heads.

Compression ratio bar chart — FP16 vs TurboQuant memory usage

Detailed compression ratios — Compression ratios across different configurations

Attention Quality

The fused attention kernel was tested against FP16 reference attention across 5 sampled layers, all KV heads, and 8 query probes per head (mix of random vectors and actual key vectors).

Quality distribution histograms — Distribution of cosine similarity and max absolute weight error across all probes.

Interpretation: On layers 7 through 27, compressed attention output is virtually indistinguishable from FP16. The QJL correction term is doing its job. Attention scores come out unbiased.

Latency Benchmarks

Latency measured on B200 across sequence lengths from 512 to 16K tokens. The fused kernel does score computation, QJL correction, online softmax, and V accumulation in one pass.

Latency benchmark table — FP16 matmul vs fused TurboQuant vs fused + block swizzle

Latency benchmark chart — The fused kernel approaches FP16 matmul speed and beats it at long sequences.

Roofline Analysis

Both standard FP16 attention and TurboQuant's current implementation sit in the memory-bound regime with arithmetic intensity around 1 FLOP/byte. The fully-fused target changes this. Load only compressed data, reconstruct entirely on-chip, and arithmetic intensity jumps past 600. That crosses into compute-bound territory.

Memory Scaling

KV cache memory scales linearly with sequence length. TurboQuant's compressed representation stays roughly 5× smaller at every point on the curve.

Memory scaling table — Numerical breakdown by sequence length

Memory breakdown — Per-component memory breakdown

Live Generation

Qwen 2.5-1.5B generating text with its value cache replaced by TurboQuant's 3-bit compressed version. The model produces coherent, on-topic text at 144.7 tok/s.

Generation from compressed cache — 60 tokens generated in 0.41s from compressed memory. 1,288 KB → 257 KB.

Side by side text comparison — Standard FP16 vs decompressed-value generation. Outputs are nearly identical.

Additional Tests

Needle in a haystack — Needle-in-a-haystack attention retrieval

Long context stress test — Long-context stress test

Full Notebook

Every number on this page comes from turboquant_b200_patch1.ipynb, a Jupyter notebook that runs end-to-end on the B200. It covers model loading, KV cache extraction, compression, quality analysis, latency benchmarks, memory scaling, roofline analysis, and live text generation.

Notebook header — The notebook running on B200 with Qwen 2.5-1.5B

Next Steps

Full Autoregressive Generation

These quality numbers are per-layer. Stacking 28 layers of approximate attention is a different challenge since small errors compound. Next milestone: full end-to-end generation with compressed keys and QJL correction at every layer.

Hitting the Fully-Fused Roofline Target

The kernel currently sits in the memory-bound regime. The fully-fused target pushes arithmetic intensity past 600 into compute-bound territory. V-fused decompression was the first step. Tighter shared memory scheduling and less HBM traffic get us the rest of the way.

Larger Models and MoE

The initial implementation targets Qwen 2.5-1.5B, but the core memory savings inherently scale with model depth. MoE architectures are notoriously memory-bound due to expert weights; pairing them with a native 3-bit KV cache frees up massive VRAM headroom for drastically higher batch sizes and extended sequence lengths.

← 02. Kernels Home →

Anirudh Bharadwaj Vangara

MLE Intern @ Shopify · Computer Engineering @ University of Waterloo · MLH Top 50

ab2vanga@uwaterloo.ca · LinkedIn · X / Twitter