1,288 KB → 257 KB. Same text out.

TurboQuant on Blackwell

A CUDA-native KV cache compression engine built with cuTile, running on NVIDIA's Blackwell B200.

5.02×
Compression
~0.985
Cosine Similarity
144.7
tok/s on B200
5
Kernel Types

The Memory Wall

During LLM inference, the KV cache is the single biggest memory bottleneck. At 8,000 tokens on a 3B-parameter model, you're looking at nearly 300 MB of cached keys and values. The GPU spends over 90% of wall-clock time just waiting for that data to load from memory.

TurboQuant (Google, ICLR 2026) compresses the cache down to 3 bits per coordinate. That's 5× less memory and 5× faster loads, with attention scores that stay mathematically unbiased.

I implemented TurboQuant as a set of custom CUDA kernels using NVIDIA cuTile and ran it end-to-end on a Blackwell B200 with Qwen 2.5-1.5B. The model generates coherent text from a fully compressed KV cache with near-perfect fidelity.

Pipeline

</>
Algorithm
TurboQuant paper
cuTile Kernels
5 kernel types
LLM
Qwen 2.5
B200 GPU

Contents

Live Generation

Qwen 2.5-1.5B generating text from a TurboQuant-compressed KV cache on the B200. The value cache is decompressed from 3-bit indices; the model produces coherent output at 144.7 tok/s with a 5.02× compression ratio.

TurboQuant generation from compressed KV cache
60 tokens generated in 0.41s from compressed memory. 1,288 KB → 257 KB.
Anirudh Bharadwaj Vangara
Anirudh Bharadwaj Vangara
MLE Intern @ Shopify · Computer Engineering @ University of Waterloo · MLH Top 50
· LinkedIn · X / Twitter