My attempt to understand TurboQuant

Apr 04, 2026

Google announced a new quantization algorithm called TurboQuant that enables massive compression of KV cache on a LLM without compromising quality. My attempt here is to understand as best as I can what TurboQuant is, given that I don’t know too much about LLMs.

What is vector quantization and why it is needed?

Suppose you have a vector of 32-bit floats with dimension d, each vector coordinate takes 4 bytes. If d is very high, as is usually the case in LLMs, you face a memory problem. The solution to that problem is to reduce its overall size by compressing it. That is called vector quantization.

The simplest way of doing that is by mapping each coordinate of the vector to a known prespecified set, called codebook. For example, let’s say our vector is [0.76, -0.28, 0.10, 0.57] and we defined the codebook as [-1.0, -0.5, 0.0, 0.5, 1.0], then we can map each coordinate to the index of the codebook whose value is closest to the coordinate. In this case, that would result in the following vector [4, 1, 2, 3] . Because we only need 3 bits to represent the indices of the codebook, we can say we reduced the memory footprint from 128 bits to 12 bits. You can also see that when we try to get our vector back, we lose information: 0.76 became 1.0, -0.28 became -0.5, and so forth. Doing that kind of compression adds a lot of error, so the challenge is figuring out a codebook that minimizes that error. We’re mainly interested in

Mean Squared Error (MSE): How different is the reconstructed vector from the original?
Inner product error: If we compute the dot product between two vectors, how much does the dot product change after one of the vectors is quantized?

I am not really familiar with vector quantization algorithms but it looks like a common approach is trying to learn from trained model weights using K-means algorithm. Essentially, the codebook is the centroid.

Google’s idea

Instead of trying to figure out the codebook by looking at the data, they randomly transform the input by applying a random matrix. That may seem crazy but what happens is that by doing that kind of transformation you can make it so that your data falls into a very known distribution (in this case Beta distribution). Because your data is well-behaved, you can mathematically compute the optimal quantizer.

Google proved that a b-bit quantizer has an MSE upper-bound of $\frac{\sqrt{3}\times \pi}{2} * \frac{1}{ 4 ^b}$. Specifically, for $b = 1, 2, 3, 4$ we have $0.36, 0.117, 0.03, 0.009$, respectively.

MSE optimized quantizers are biased

The above idea works to get the optimal MSE quantizer. But they found out that those quantizers are biased towards the Inner product error. They figured out a solution with a two stage algorithm that first applies the above idea with a bit-width one less than their target budget and then applies a 1-bit quantizer called Quantized Johnson-Lindenstrauss on the residual error. I am not familiar with QJL, so I’ll keep it as a black box for now. But the essence is applying two quantization strategies, one with $(b-1)$ bits to get the optimal MSE quantizer and the other with 1 bit to fix the bias produced by the first one.

References

#google #llm #quantization #turboquant