Google Releases KV Cache Compression Technology, Impacting Storage Demand Expectations, US Storage Stocks Plunge Collectively

Wallstreetcn
2026.03.25 21:13

Google has introduced a new memory compression technology, TurboQuant, which can compress the key-value cache of large language models to 3 bits, achieving a 6x memory reduction and up to 8x acceleration. This has sparked market concerns about the outlook for storage demand, leading to a collective plunge in storage chip stocks such as SanDisk, Micron, and Western Digital on Wednesday. Morgan Stanley analysis suggests that this technology only affects the inference stage and does not reduce hardware demand; instead, it may activate more AI application scenarios by lowering deployment costs

US stocks in the storage chip sector experienced a sharp decline on Wednesday. SanDisk fell by as much as 6.5%, Micron Technology dropped 4%, Western Digital slipped over 4%, and Seagate Technology fell over 5%.

Google's release of its new AI memory compression technology, TurboQuant, has sparked market concerns about the future demand for storage. The technology reportedly can reduce the memory footprint of large language models' caches by at least 6 times without losing accuracy, and achieve acceleration of up to 8 times, aiming to address the memory bottleneck in AI inference and vector search.

At the close of trading on Wednesday, the storage chip and hardware supply chain index fell 2.08% to 113.03 points, hitting an intraday low of 109 points. SanDisk and Micron saw declines of over 3.4%, Seagate Technology closed down 2.6%, and Western Digital narrowed its loss to 1.6%.

Google's TurboQuant Impacts Storage Demand

Google's TurboQuant is a memory compression technology designed for large language models and vector search engines, with the core objective of addressing the storage bottleneck in the key-value cache within AI systems.

According to Google's announcement, TurboQuant can compress key-value caches to 3 bits without requiring model training or fine-tuning, achieving a 6x reduction in key-value memory in practical tests on open-source models like Gemma and Mistral. On NVIDIA's H100 GPU accelerator, the algorithm achieved up to an 8x performance improvement compared to unquantized key-value schemes.

The technology achieves compression through a two-step process: first, it employs the PolarQuant method for high-quality compression of data vectors, followed by the use of a quantized Johnson-Lindenstrauss algorithm to eliminate residual errors. Google notes that traditional vector quantization methods incur an additional memory overhead of 1 to 2 bits per number, partially offsetting compression gains, which TurboQuant aims to improve upon.

TurboQuant is scheduled for publication at ICLR 2026, while PolarQuant is planned for presentation at AISTATS 2026. Google has completed validation across multiple benchmark tests including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, and indicated that the technology is also applicable to vector retrieval scenarios in large-scale search engines.

Jevons Paradox Revisited? TurboQuant May Activate More AI Application Scenarios

Morgan Stanley pointed out that Google's TurboQuant technology only affects the key-value cache during the inference stage, does not impact the high-bandwidth memory (HBM) occupied by model weights, and is unrelated to training tasks.

Therefore, this does not represent a 6x reduction in total storage demand or hardware quantity, but rather an increase in single GPU throughput through efficiency gains – the same hardware can support 4 to 8 times longer contexts, or significantly increase batch processing scale without triggering memory overflows.

Nevertheless, the storage sector has seen significant cumulative gains this year, and valuations were already under pressure. Any technological advancement that could potentially reduce hardware demand is sufficient to trigger a defensive market reaction. Morgan Stanley also warned that since this compression technology can be directly integrated into platform infrastructure, it may pose a marginal downside risk to software.

In its analysis, Morgan Stanley cited Jevons paradox, suggesting that efficiency improvements could paradoxically increase overall demand. The logic is that TurboQuant, by compressing data volume and transmission, significantly lowers the service cost per query, making AI deployment more profitable.

This implies that models originally reliant on cloud clusters could be migrated to run on local hardware, effectively lowering the barrier to large-scale AI deployment and thus activating more application scenarios, leading to increased utilization of existing infrastructure.

Morgan Stanley referred to TurboQuant as a "breakthrough that reshapes the cost curve of AI deployment" and compared it to the impact of DeepSeek – signaling positive developments for cloud service providers and model platforms, offering considerable return on investment value in long-context inference and retrieval-intensive applications, while assessing the long-term impact on computing power and memory hardware as "neutral to positive."