Tech Trends 2026 update: New compression tricks could change how enterprises think about AI infrastructure
New algorithms squeeze AI models dramatically smaller without sacrificing accuracy. Here's why that matters beyond the lab.
The popular story about AI progress has mostly been a story about scale. Bigger models, more data, more powerful chips. That story is still being written. But challenges around how AI applications use memory are increasingly driving how businesses actually experience AI.
Specialized, high-speed GPU memory is extraordinarily expensive and physically limited. That’s one of the reasons why some businesses are rethinking their infrastructure decisions. As we described in this year’s Tech Trends report, as they scale AI applications across the organization, enterprises are discovering their existing infrastructure strategies aren’t designed for AI’s demands. The big driver is the computational work an AI model does every time it responds to a query, known as inference. Recurring AI workloads mean near-constant inference, and when using a cloud-based AI service, this can lead to frequent API hits and escalating costs. Some organizations are seeing monthly AI bills climb into the tens of millions of dollars.
Why do AI computations require so much memory? Every time an AI model processes your input, it needs to hold an enormous amount of information in fast-access memory. The component at the center of this is something called the key-value (KV) cache. This essentially functions as the model’s short-term memory. It’s a working record of the conversation so far, the documents you’ve shared, and the context the model needs to make sense of your question. The longer or more complex that context gets, the more GPU memory it demands.
The good news is that developers are working on solutions that use existing memory more efficiently to help AI models run faster and improve their capabilities.
New KV cache compression techniques make AI more resource-efficient
Many solutions focus on shrinking the footprint of the KV cache. Known as KV cache compression, this helps optimize memory usage and computational efficiency during inference. A few KV cache compression techniques exist. One of the most prevalent is quantization — the practice of representing data with less numerical precision. For many practical purposes, the less precise measurement is good enough, and it takes up far less memory.
Traditional quantization methods have a catch, though. They introduce a kind of bookkeeping overhead, requiring extra data to track how the compression was applied. That overhead can eat up a significant chunk of the memory savings.
A new algorithm reduces that overhead substantially through a two-step process. Developed by Google researchers, TurboQuant uses an approach that converts data into a more compact mathematical form. Then it applies a single “error-correction” bit using a separate technique that catches and eliminates any small inaccuracies introduced in the first step.
According to Google’s benchmarks, this approach can result in an up to 8x speedup in certain computations and a 6x or greater reduction in memory usage, while maintaining full accuracy across a range of AI tasks, including question answering, summarization, and code generation.1
Meanwhile, an algorithm known as ChunkKV reimagines the basic approach to compression. Typically, to save memory, LLMs delete certain tokens in the KV cache that it deems unneeded. Developed by researchers at the Hong Kong University of Science and Math, ChunkKV deletes semantic chunks rather than isolated tokens as the basic unit, an approach that preserves linguistic context more faithfully under aggressive compression, improving throughput by 26.5%.2
Elsewhere, FlashAttention is not a KV compression approach at all, but an algorithm that’s integrated into popular machine learning libraries. It optimizes how AI models read and write memory during attention computations. The latest version, FlashAttention-4, achieved a 20% speedup on typical AI infrastructure.3 A separate research line, KVTC, uses a combination of dimensionality reduction, adaptive quantization, and entropy coding to achieve up to 20x compression while maintaining reasoning and long-context accuracy.4
The field is moving fast, and the momentum is real.
Practical applications
Tech Trends 2026 highlights a significant shift underway in how enterprises are thinking about where to run their AI workloads. Cloud-first approaches are giving way to more hybrid strategies, partly for cost reasons, but also because of data sovereignty concerns, latency requirements, and the need for resilience. Real-time AI workloads demand proximity to data sources, especially in manufacturing environments and autonomous systems, where network latency prevents real-time decision-making.
The problem with running powerful AI models on local or edge infrastructure has always been that those environments have far less memory and compute to work with. Efficiency techniques change that calculus. A model that previously required a large GPU cluster to run might become viable on more modest hardware once its memory footprint shrinks by a factor of six or more.
This matters in a world where the majority of enterprises’ data still resides on premises, and organizations increasingly prefer bringing AI capabilities to their data rather than moving sensitive information to external AI services. Smaller, faster models are central to making that vision practical.
What this means for AI infrastructure
The compression techniques described above are just single data points in a larger story about the AI industry’s growing focus on efficiency. For the past few years, progress in AI has been all about scaling through bigger models, more data, and more compute. That approach has delivered remarkable results, but it has also run into physical and economic ceilings that are becoming increasingly hard to ignore.
As we noted in Tech Trends, inference costs have plummeted 280-fold over the last two years, yet enterprises are experiencing explosive growth in overall AI spending because usage has dramatically outpaced those efficiency gains. Emerging compression algorithms may drive a similar pattern. When technology advances in efficiency and capability, people often use more of it, not less.
For enterprises wrestling with AI infrastructure decisions right now, the key takeaway isn’t necessarily to wait for emerging compression techniques. It’s that the software side of AI is advancing rapidly alongside the hardware side. Decisions about infrastructure strategy should account for both. A hybrid architecture that looks barely feasible today, constrained by memory and compute requirements, may look quite different once another generation of compression and optimization techniques filters down from research labs into production systems.
Join our community to get emerging tech insights every week.
Have you seen similar trends in your industry? Share your thoughts in the comments below.
If this post resonated, forward it to colleagues who would also find the insights helpful!
Ed Burns | Editor | Office of the CTO
This article contains general information only and Deloitte is not, by means of this article, rendering accounting, business, financial, investment, legal, tax, or other professional advice or services. This article is not a substitute for such professional advice or services, nor should it be used as a basis for any decision or action that may affect your business. Before making any decision or taking any action that may affect your business, you should consult a qualified professional advisor. Deloitte shall not be responsible for any loss sustained by any person who relies on this article.
As used in this document, “Deloitte” means Deloitte Consulting LLP, a subsidiary of Deloitte LLP. Please see www.deloitte.com/us/about for a detailed description of our legal structure. Certain services may not be available to attest clients under the rules and regulations of public accounting.
Copyright © 2026 Deloitte Development LLC. All rights reserved.




