[DeepMind] Decoupled DiLoCo: A New Frontier in Distribute...

Our new distributed architecture helps to train LLMs across distant data centers with lower bandwidth and more hardware resiliency. Traditionally, training a frontier AI model relies on a large, tightly coupled system where identical chips must stay in near-perfect synchronization. This approach is effective for state-of-the-art models, but maintaining synchronization across thousands of chips becomes a significant logistical challenge as we look toward future generations of scale.

In a new paper, we are excited to introduce an approach called Decoupled DiLoCo (Distributed Low-Communication). By dividing large training runs across decoupled "islands" of compute, with asynchronous data flowing between them, this architecture isolates local disruptions so that other parts of the system can continue learning efficiently. The result is a more resilient and flexible way to train advanced models across globally distributed data centers. Crucially, Decoupled DiLoCo does not suffer from the communication delays that made previous distributed methods like Data-Parallel impractical at global scale.

Figure 1: Decoupling training runs into separate "islands" of compute (learner units) allows largely uninterrupted training despite hardware failures, as the effects of those failures are isolated.

Decoupled DiLoCo builds on two earlier advances: Pathways, which introduced a distributed AI system based on asynchronous data flow, and DiLoCo, which dramatically reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations. Combining these ideas allows for more flexible AI model training at scale.

Built on Pathways, it enables asynchronous training across separate islands of compute (learner units) so that a chip failure in one area does not interrupt the progress of others. This infrastructure is also self-healing. In testing, we employed a method called "chaos engineering" to introduce artificial hardware failures during training runs. Decoupled DiLoCo continued the training process after the loss of entire learner units and seamlessly reintegrated them when they came back online.

Testing Decoupled DiLoCo with Gemma 4 models demonstrated that when hardware fails, the system maintains greater availability of learning clusters than more traditional training methods while ultimately delivering the same benchmarked level of machine learning (ML) performance.

Figure 2: Left: The Decoupled DiLoCo approach requires orders of magnitude less bandwidth than conventional training methods, making it highly efficient. Middle: With increasing levels of hardware failure, Decoupled DiLoCo continues to deliver a high level of "goodput," or useful training, while that of other approaches nosedives. Right: In real-world experiments, the benchmarked ML performance of Gemma 4 models trained using Decoupled DiLoCo equaled that attained with conventional training approaches.

Decoupled DiLoCo is not only more resilient to failures but also practical for executing production-level, fully distributed pre-training. We successfully trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps of wide-area networking, a level relatively achievable using existing internet connectivity between data center facilities, rather than requiring new custom network infrastructure between facilities. Notably, the system achieved this training result more than 20 times faster than conventional synchronization methods. This is because our system incorporates required communication into longer periods of computation, avoiding the "blocking" bottlenecks where one part of the system must wait for another.

At Google, we take a full-stack approach to AI training, spanning hardware, software infrastructure, and research. Increasingly, gains are coming from rethinking how these layers fit together. Decoupled DiLoCo is one example. By enabling training jobs at internet-scale bandwidth, it can tap any unused compute wherever it sits, turning stranded resources into useful capacity. Beyond efficiency and resilience, this training paradigm also unlocks the ability to mix different hardware generations, such as TPU v6e and TPU v5p, in a single training run. This approach not only extends the useful life of existing hardware but also increases the total compute available for model training. In our experiments, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs, ensuring that even older hardware can meaningfully accelerate AI training. Furthermore, because new generations of hardware don’t arrive everywhere all at once, being able to train across generations can alleviate recurring logistical and capacity bottlenecks.

As we push the frontiers of AI infrastructure today, we’re continuing to explore approaches to resilient systems needed to unlock the next generation of AI.

Blogger's Review: The Decoupled DiLoCo method is undoubtedly an innovative solution to the challenges of future AI training. By reducing bandwidth requirements and increasing fault tolerance, it opens up new avenues for distributed training. This flexible architecture not only effectively utilizes existing resources but also extends the lifespan of older hardware during updates, achieving a truly optimized configuration and efficient utilization of resources.

[DeepMind] Decoupled DiLoCo: A New Frontier in Distributed AI Training