TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate closes a gap that has been open since Claude Shannon defined the theoretical floor for lossy compression in 1948. For nearly eighty years, practical vector quantization methods fell expon...

Episode 0034: Spinning to Zero

Why it matters. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate closes a gap that has been open since Claude Shannon defined the theoretical floor for lossy compression in 1948. For nearly eighty years, practical vector quantization methods fell exponentially short of what rate-distortion theory says is achievable — either achieving good distortion bounds only through expensive offline training, or running online but paying an exponentially growing quality penalty at higher bit depths. TurboQuant reaches within a constant factor of 2.7× of the information-theoretic optimum, with no training required, at inference time — enabling LLM KV cache compression to 3.5 bits per channel with zero quality degradation and near-zero indexing overhead for nearest neighbor search.

Institutions. The paper is a collaboration between Google Research, Google DeepMind, and the NYU Courant Institute of Mathematical Sciences. The full paper is available at arXiv:2504.19874, submitted April 28, 2025. The authors' prior KV cache quantization work, QJL, has code at github.com/amirzandieh/QJL.

The Researchers. Amir Zandieh is a Senior Research Scientist at Google Research working on sublinear algorithms and ML efficiency. Majid Daliri is a PhD student at NYU Courant researching algorithms, LLMs, and sketching. Majid Hadian is at Google DeepMind. Vahab Mirrokni is a Google Fellow and VP at Google Research, leading the algorithms and optimization research groups.

Key Technical Concepts. TurboQuant works by randomly rotating input vectors via a Johnson-Lindenstrauss transform, which induces a concentrated Beta distribution on coordinates and near-independence across dimensions — enabling optimal scalar quantization per coordinate with no cross-dimensional codebook needed. For inner product tasks (attention, MIPS), a two-stage approach applies an MSE quantizer then a 1-bit QJL transform on the residual to debias the estimate. The result is the first data-oblivious, online method to provably match Shannon's rate-distortion bound to within a small constant, improving on prior work like PolarQuant and classical product quantization methods that either required offline training or suffered exponentially growing distortion gaps.

Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.

Listen

34: Spinning to Zero

Show Notes

Episode 0034: Spinning to Zero