Google DeepMind has introduced a new language model called VaultGemma, designed with a focus on privacy. It is the largest open model to date trained from scratch with differential privacy, containing 1 billion parameters.
Normally, large language models can memorize parts of their training data, including sensitive information like names, addresses, or entire documents. Differential privacy avoids this by adding controlled random noise during training, making it statistically impossible to trace the model's outputs back to specific examples. In theory, even if VaultGemma were trained on confidential documents, those documents could not be reconstructed later.
According to Google, early tests confirm that the model does not reproduce training data. The tradeoff is performance: its output is roughly comparable to non-private LLMs released about five years ago.
The model weights are openly available on Hugging Face and Kaggle.