[DeepMind] VaultGemma: The Most Capable Differentially Pr...

We introduce VaultGemma, the most capable model trained from scratch with differential privacy. As AI becomes more integrated into our lives, building it with privacy at its core is a critical frontier for the field. Differential privacy (DP) offers a mathematically sound solution by adding calibrated noise to prevent memorization. However, applying DP to LLMs introduces trade-offs.

Understanding these trade-offs is crucial. Applying DP noise alters traditional scaling laws—rules describing performance dynamics—by reducing training stability (the model's ability to learn consistently without experiencing catastrophic events like loss spikes or divergence) and significantly increasing batch size (a collection of training examples sent to the model simultaneously for processing) and computation costs.

Our new research, "Scaling Laws for Differentially Private Language Models," conducted in partnership with Google DeepMind, establishes laws that accurately model these intricacies, providing a complete picture of the compute-privacy-utility trade-offs.

VaultGemma is the largest (1B-parameters), open model trained from scratch with differential privacy. We are releasing the weights on Hugging Face and Kaggle, alongside a technical report, to advance the development of the next generation of private AI.

Understanding the Scaling Laws

With a carefully thought-out experimental methodology, we aimed to quantify the benefit of increasing model sizes, batch sizes, and iterations in the context of DP training. Our work required making some simplifying assumptions to overcome the exponential number of combinations one might consider trying.

Key Findings: A Powerful Synergy

Before diving into the full scaling laws, it’s useful to understand the dynamics and synergies between the compute budget, privacy budget, and data budget from a privacy accounting perspective. This analysis is significantly cheaper to do as it does not require any model training, yet it yields a number of useful insights.

Applying the Scaling Laws to Build VaultGemma

The Gemma models are designed with responsibility and safety at their core. This makes them a natural foundation for developing a production-quality, DP-trained model like VaultGemma. We used the scaling laws to determine both how much compute we needed to train a compute-optimal 1B parameter Gemma 2-based model with DP, and how to allocate that compute among batch size, iterations, and sequence length to achieve the best utility.

Results

Armed with our new scaling laws and advanced training algorithms, we built VaultGemma, to date the largest (1B-parameters) open model fully pre-trained with differential privacy. The final training loss of VaultGemma was remarkably close to what our equations predicted, validating our research and providing the community with a reliable roadmap for future private model development.

Blogger's Review: The release of VaultGemma not only enhances the application of differential privacy technology in large language models but also points to future directions for AI development. With the increasing demand for privacy protection, the research outcomes of VaultGemma will provide a theoretical foundation and practical guidance for privacy design in more AI systems. The establishment of its scaling laws will help optimize training processes and promote further development and application of AI technology.

[DeepMind] VaultGemma: The Most Capable Differentially Private LLM

Understanding the Scaling Laws

Key Findings: A Powerful Synergy

Applying the Scaling Laws to Build VaultGemma

Results