Google's TurboQuant — 6x LLM Memory Reduction with Zero Accuracy Loss
How TurboQuant compresses KV caches to 3 bits, the math behind PolarQuant and QJL, benchmark results on H100 GPUs, and what it means for LLM inference.

On March 24, 2026, Google Research published a blog post about TurboQuant, a compression algorithm for LLM key-value caches. Two days later, memory chip stocks dropped, Cloudflare's CEO called it "Google's DeepSeek moment," and AI engineers started building unofficial implementations before Google even released the code.
The headline claim: compress KV caches to 3 bits per value — a 6x memory reduction — with zero measurable accuracy loss and no model retraining. Here's what's actually going on under the hood.
Why KV Caches Are the Bottleneck
Transformer-based LLMs maintain a key-value (KV) cache during inference. Every token the model has processed gets stored as key and value vectors, so the model can attend to previous context when generating the next token.
The problem scales linearly with sequence length. Process a 100K-token document and the KV cache alone can consume tens of gigabytes of GPU memory. For long-context models, the cache often uses more memory than the model weights themselves. This is the wall that prevents you from running large models on smaller hardware or serving more concurrent users on the same GPU.
Prior work like KIVI (ICML 2024) compressed KV caches to about 2.6x. Push further and accuracy degrades. TurboQuant breaks past that limit.
How TurboQuant Works
TurboQuant isn't a single algorithm. It combines two techniques — PolarQuant and QJL (Quantized Johnson-Lindenstrauss) — developed by Amir Zandieh (Research Scientist) and Vahab Mirrokni (VP and Google Fellow) alongside collaborators at Google DeepMind, KAIST, and NYU.
Stage 1: PolarQuant — Eliminating Normalization Overhead
Standard quantization methods split vectors into blocks, then store a normalization constant per block. These constants add 1–2 bits of overhead per block. When you're targeting 3-bit precision, that overhead is significant — it eats into your actual compression ratio.
PolarQuant takes a different approach. Instead of working in Cartesian coordinates (X, Y, Z), it converts vectors to polar coordinates — separating each vector into a radius (magnitude) and a set of angles (direction).
The key insight: after applying a random rotation to the input vectors, each coordinate's distribution converges to a Beta distribution regardless of the original data. The distribution becomes predictable and concentrated.
Predictable distribution means no per-block normalization constants needed. You can map values onto a fixed grid. That alone eliminates the overhead that makes conventional low-bit quantization lossy.
Stage 2: QJL — 1-Bit Error Correction
PolarQuant gets you most of the way there, but the polar coordinate transformation introduces small residual errors. Transformer attention mechanisms rely on inner product computations, and even small biases in inner product estimates accumulate over long sequences.
QJL handles this by applying a quantized version of the Johnson-Lindenstrauss transform to the residual error. It records just the sign bit of each residual vector component — 1 bit total. This is enough to produce unbiased inner product estimates, which is what the attention mechanism needs to stay accurate.
The two stages together: PolarQuant compresses efficiently by removing normalization overhead, QJL corrects the remaining error with 1 bit. The result is 3-bit KV cache compression with mathematically guaranteed unbiased estimation.
Benchmark Results
The research team tested on open-source LLMs including Gemma, Mistral, and Llama-3.1-8B-Instruct across standard long-context benchmarks: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval.
Accuracy
| Benchmark | Result |
|---|---|
| LongBench (QA, code gen, summarization) | At 3.5-bit, matched uncompressed cache scores |
| Needle In A Haystack | Perfect scores at 6x compression, up to 104K tokens |
| All benchmarks vs. KIVI baseline | Matched or outperformed at 3-bit precision |
Going from KIVI's 2.6x compression to TurboQuant's 6x while maintaining equal or better accuracy is a generational jump.
Speed and Memory
- Memory: At least 6x reduction in KV cache footprint (16-bit → 3-bit)
- Speed: Up to 8x faster attention logit computation on NVIDIA H100 GPUs at 4-bit precision vs. 32-bit baseline
Important caveat: the 8x speedup applies to the attention computation step only, not end-to-end inference. Attention is a major bottleneck, but not the only one. Full inference wall-clock improvement will be lower.
Vector Search
TurboQuant also works for vector search beyond KV cache compression. On the GloVe dataset, it achieved the highest 1@k recall ratios compared to Product Quantization (PQ) and RabbiQ — despite those methods using larger codebooks and dataset-specific tuning.
Vector indexing time is near-zero (0.0013s for 1536-dimensional vectors), which matters for real-time applications where new data needs to be searchable immediately.
The Data-Oblivious Property
This is arguably TurboQuant's most practical advantage.
Most quantization methods require a calibration step — you run a representative dataset through the model to collect statistics, then tune quantization parameters accordingly. Change the model, redo the calibration.
TurboQuant skips all of that. No training, no fine-tuning, no calibration. The random rotation makes the distribution mathematically predictable regardless of input data, so the same compression scheme works for any model out of the box.
For production deployment, this simplifies the pipeline significantly. New model released? Apply TurboQuant directly. No preprocessing required.
TurboQuant vs. NVIDIA's KVTC
TurboQuant isn't the only KV cache compression method at ICLR 2026. NVIDIA's KVTC is also being presented.
| TurboQuant | KVTC (NVIDIA) | |
|---|---|---|
| Max compression | 6x | 20x |
| Accuracy loss | Zero (measured) | < 1 percentage point |
| Calibration | None (data-oblivious) | One-time per model |
| Tested model sizes | Up to ~8B params | 1.5B – 70B params |
| Approach | Mathematical transform | Data-driven optimization |
KVTC wins on raw compression ratio. TurboQuant wins on zero accuracy loss and zero calibration. Different trade-offs for different use cases — KVTC may be better when you can tolerate minor accuracy loss and want maximum compression, while TurboQuant is better when accuracy is non-negotiable or you need to deploy across many models quickly.
What TurboQuant Doesn't Prove (Yet)
Being precise about limitations matters more than hype.
Model scale. Published benchmarks max out at roughly 8B parameters. Whether "zero accuracy loss" holds on 70B+ models, mixture-of-experts architectures, or models with 1M+ token context windows is undemonstrated.
No production deployment. Google has not announced TurboQuant running in Gemini, Google Search, or any production system. Google Research papers frequently describe techniques that never ship.
No official code. Open-source release is expected around Q2 2026. Independent developers have already built working implementations in Triton, MLX, and llama.cpp. One developer tested on Gemma 3 4B with an RTX 4090 and reported character-identical output to the uncompressed baseline at 2-bit precision — but these are unofficial results.
Papers and Conference Timeline
| Algorithm | arXiv | Conference |
|---|---|---|
| QJL (Quantized Johnson-Lindenstrauss) | — | AAAI 2025 (presented) |
| PolarQuant | arXiv:2502.02617 | AISTATS 2026 |
| TurboQuant | arXiv:2504.19874 | ICLR 2026 (April 23–25) |
The three papers came out sequentially and build on each other. QJL established the error correction foundation. PolarQuant eliminated normalization overhead. TurboQuant combined both into the final system.
Market Reaction and Outlook
Cloudflare CEO Matthew Prince calling this "Google's DeepSeek moment" carries a specific implication: a software algorithm reducing memory requirements by 6x could shift demand projections for HBM (High Bandwidth Memory). Micron and Western Digital stocks dropped after the announcement. Samsung and SK Hynix were also affected in Korean markets.
The counter-argument is worth noting: better memory efficiency means the same hardware can handle longer contexts, which opens use cases that were previously impossible due to memory constraints. The total addressable market for AI infrastructure could grow, potentially increasing overall memory demand even as per-query requirements shrink.
Regardless of the market narrative, TurboQuant represents a meaningful step in LLM inference efficiency. The formal ICLR 2026 presentation and the eventual open-source release will be the next milestones to watch.
This article is based on the Google Research blog, arXiv papers, and reporting from multiple technology publications. It is intended as informational content only and should not be used as the basis for investment decisions or technical purchasing decisions.