
Training large artificial intelligence (AI) models takes enormous computing power.
Often, multiple servers must work together to process data and adjust the model’s settings—a method called distributed deep learning.
But when these servers are in a shared cloud environment, like those used by many businesses and researchers, communication delays can slow everything down.
A team of researchers, led by the University of Michigan, may have found a way to fix this problem by rethinking how servers talk to each other during training.
The system they developed is called OptiReduce.
Unlike traditional methods that wait for every single server to report its data before moving on, OptiReduce sets time limits.
Once that time is up, it continues—even if some servers haven’t responded. This approach can lose a small amount of data, but OptiReduce cleverly estimates the missing information and keeps the AI model learning effectively.
The researchers presented their work at the USENIX Symposium on Networked Systems Design and Implementation in Philadelphia. Their idea challenges a long-standing assumption in AI training: that perfect communication between servers is necessary.
According to lead author Muhammad Shahbaz, an assistant professor of computer science at U-M, the approach is inspired by how computing evolved from general-purpose CPUs to specialized GPUs that are much better at handling AI tasks.
OptiReduce brings that same idea to communication by creating a system tailored for deep learning, rather than relying on one-size-fits-all data sharing.
Traditionally, distributed learning systems wait for the slowest server to finish before moving to the next step. This creates bottlenecks—much like a group of hikers constantly waiting for the slowest person to catch up. OptiReduce, instead, allows the group to move ahead after a set time. It adapts these time limits depending on how busy the network is—longer when there’s more congestion, shorter when things are quiet.
In tests run on both private and public cloud systems, OptiReduce trained AI models up to 70% faster than older methods like Gloo and 30% faster than NCCL, another widely used system. Even with about 5% of the data lost during training, large AI models such as Llama 4, Mistral 7B, and Falcon still reached their target accuracy. Smaller models were more affected by the data loss, but overall, the results showed that perfect communication isn’t always necessary.
The team sees this as the first step in building even faster systems. They’re now exploring ways to take the concept further by moving communication improvements from software to hardware, aiming for ultra-fast data transfer speeds.
With support from NVIDIA, VMware Research, and Feldera, this project could help reshape how AI models are trained—making the process faster, cheaper, and more efficient in the cloud.