Google finds a way to shrink AI memory usage by 4.5x without losing accuracy

A new compression technique from Google Research could make AI models cheaper to run, faster to respond, and capable of handling much longer conversations.^[1]

One of the biggest hidden costs of running a large language model is memory. Every time a chatbot processes a conversation, it keeps a running record of what it has already read and generated, stored in what researchers call the key-value cache. The longer the conversation or document, the larger this record grows, and it has to be loaded from memory for every single token the model produces.^[2]^[3] This is a major reason why serving AI to millions of users simultaneously is so expensive.

Google Research today published TurboQuant, a compression technique that shrinks this memory record to roughly a quarter of its original size while maintaining the same output quality as an uncompressed model.^[1]^[2] The technique will be presented at the ICLR 2026 machine learning conference.^[1]

What the technique does

TurboQuant works by representing the numbers stored in that memory record using far fewer bits, a process called quantization. Where a standard model might use 16 bits to store each value, TurboQuant can use as few as 3.5 bits and still produce identical results on standard benchmarks.^[2]

The challenge with existing compression methods is that they need to measure the range of each small block of data and store that information alongside the compressed values, which eats into the space savings. TurboQuant avoids this by first applying a random rotation to each data vector. This redistributes the vector's values so that every coordinate follows the same known statistical distribution, regardless of the original input. Because that distribution is known in advance, the compression scheme can be designed once ahead of time rather than recalculated and stored for each block of data. The rotation itself is stored once and reversed during decompression, so no information is lost.^[2]

TurboQuant is also published alongside two related techniques from the same group. QJL is a 1-bit compression method that TurboQuant uses internally to correct errors.^[3] PolarQuant takes a different approach, converting data into a coordinate system (like describing a location by distance and compass direction rather than street grid coordinates) where the values naturally cluster in predictable ways.^[4] All three methods share the property of eliminating the bookkeeping overhead that other compression techniques require.

Why it matters in practice

The practical implications are straightforward. Reducing cache memory by 4.5 times directly eases one of the main bottlenecks in deploying large language models, particularly for long-context applications where the cache grows large.^[2]

On a standard test where a model must find a single hidden sentence buried in a document up to 104,000 tokens long, TurboQuant matched the recall score of an uncompressed model (both scored 0.997 out of 1.0) while using only a quarter of the memory.^[2] On a broader battery of tasks including question answering, summarization, and code generation, a TurboQuant-compressed version of Meta's Llama-3.1-8B-Instruct model scored identically to the uncompressed version.^[2]

Crucially, TurboQuant requires no retraining. It can be applied to any existing model at inference time, and it works even on newly generated tokens during a conversation, not just the initial input.^[2]

The technique also applies beyond chatbots. Vector search, the technology that powers semantic search and recommendation systems, relies on comparing huge databases of numerical representations. TurboQuant compressed these databases while outperforming established methods on retrieval accuracy, and because it needs no preprocessing, it completed indexing in under a thousandth of a second where competing methods required seconds to over an hour.^[2]

What is new here, technically

The key insight is that randomly rotating high-dimensional data causes each value to follow a known statistical distribution, regardless of the original data.^[2] This means optimal compression codebooks can be precomputed once and reused, rather than being recalculated for each batch of data. The result is a compression method that works instantly, with no training or calibration step.

The authors also prove mathematically that TurboQuant's compression quality is within a factor of 2.72 of the theoretical best any algorithm could achieve, and at very aggressive compression levels the gap narrows further.^[2] These are worst-case guarantees, meaning the technique will not silently degrade on unusual inputs.

Google finds a way to shrink AI memory usage by 4.5x without losing accuracy

What the technique does

Why it matters in practice

What is new here, technically

References