Today, we introduce DiffusionGemma, an experimental open model that explores text diffusion, an exceptionally fast approach to text generation. Released under an Apache 2.0 license, this 26B Mixture of Experts (MoE) model moves beyond the sequential token-by-token processing of typical autoregressive Large Language Models (LLMs). Instead, it generates entire blocks of text simultaneously, delivering up to 4x faster text generation on GPUs.
Built upon the industry-leading intelligence-per-parameter of our Gemma 4 family and cutting-edge Gemini Diffusion research, DiffusionGemma integrates a novel diffusion head designed to maximize generation speed. While autoregressive Gemma 4 models remain the standard for high-quality production outputs, DiffusionGemma is designed for researchers and developers exploring speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures.
Unlocking New Value for Developers
Developers building real-time interactive AI applications often struggle with the latency bottlenecks of local inference. DiffusionGemma addresses these challenges directly, with some key trade-offs:
- Blazing Fast Inference: By shifting the decode bottleneck from memory-bandwidth to compute, DiffusionGemma generates up to 4x faster token output on dedicated GPUs (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090).
- Accessible Hardware Footprint: Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
- Bi-directional Attention: Generating 256 tokens in parallel with each forward pass allows every token to attend to all others, providing significant advantages for non-linear domains such as in-line editing, code infilling, amino acid sequences, or mathematical graphs.
- Intelligent Self-Correction: The model iteratively refines its own output, allowing it to evaluate the entire text block at once to fix mistakes in real-time.
Experimental Status & Production Recommendations
Because it prioritizes speed and parallel layout generation, DiffusionGemma’s overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4. You can improve DiffusionGemma's performance on specific tasks through fine-tuning. For instance, Unsloth fine-tuned DiffusionGemma to play Sudoku—a task autoregressive models struggle with due to future token dependencies. DiffusionGemma's bi-directional attention makes this much easier.
Why Diffusion for Text?
While the AI research community has explored diffusion-based text generation for years, applying it to large models has remained a challenge. DiffusionGemma changes this by shifting how models use hardware.
Conclusion
DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs. By downloading and integrating DiffusionGemma, developers can access the experimental model weights on Hugging Face and efficiently serve using tools like MLX, vLLM, and Hugging Face Transformers.
Blogger's Review: The introduction of DiffusionGemma marks a significant breakthrough in the field of text generation, especially in terms of speed and parallel processing capabilities, greatly enhancing the efficiency of local inference. The fine-tuning ability for specific tasks also provides developers with more possibilities, warranting attention to its performance in real-world applications.