
Big artificial intelligence models are known for using enormous amounts of memory and energy.
But a new study suggests that shrinking part of an AI model’s memory can actually make it perform better, not worse.
Researchers from the University of Edinburgh and NVIDIA have developed a new technique that allows large language models to reason more accurately while using far less memory.
The team found that AI models using memory that was eight times smaller than usual performed better on difficult tasks such as mathematics, science, and computer coding.
Even more surprising, they achieved these gains without taking extra time to think. In some cases, the models reasoned faster while delivering higher-quality answers.
Large language models work by generating step-by-step reasoning as text. These reasoning steps are temporarily stored in a part of memory known as the KV cache.
This memory helps the model remember what it has already “thought about” while producing an answer.
However, as problems become more complex, the amount of memory needed grows quickly. Retrieving large amounts of stored information slows the model down and increases energy use.
The researchers discovered that the KV cache can become a bottleneck. When the memory grows too large, it takes longer for the AI system to access it, which limits how many ideas or solution paths the model can explore at once. This is where the new approach comes in.
The technique, called Dynamic Memory Sparsification, carefully compresses the model’s memory during reasoning.
Instead of storing every single piece of information, the system decides which parts are most important and removes less useful data. Importantly, the process includes a short delay before deleting information, giving the model time to pass along any valuable insights to the remaining memory. This prevents the loss of key reasoning details.
By freeing up memory in this way, the AI can explore more possible solutions or follow longer chains of thought without needing extra computing power. In simple terms, the model becomes better at focusing on what matters most.
The researchers tested this method on different versions of popular AI models and compared them with standard models that used full memory. The compressed models matched or exceeded the original performance across a range of difficult benchmarks.
On a demanding mathematics test used to qualify students for the US Mathematical Olympiad, the compressed models scored significantly higher. They also showed strong improvements on advanced science questions written by experts and on tests that measure how well AI systems can write computer code.
Beyond better performance, the method offers major energy savings. With smaller memory requirements, AI systems can handle more user requests at the same time while using less power per task. This could be especially useful for devices with limited hardware, such as smart home gadgets or wearable technology.
The researchers say their findings point toward a future where AI systems are not just more powerful, but also more efficient and sustainable.
The work was presented at a major AI research conference, and the team plans to continue exploring new ways to help AI systems remember and reason more effectively—by using less, not more.


