In modern deep learning, AdamW serves as the default optimizer, yet its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook, thereby reducing AdamW's memory footprint by ~8x while maintaining the same performance, corresponding to a reduction of 6.5 GiB per billion parameters. The method is motivated by a theoretical result indicating that large mixed Hessian entries constrain the ratio of squared gradients towards one, suggesting that Hessian-aligned parameters are natural candidates for sharing second-moment statistics. Given that computing Hessians is impractical at scale, Gefen infers block structure from the initial squared gradients, requiring no architecture-specific metadata or hyperparameters beyond AdamW defaults. Gefen learns an exact histogram-based dynamic programming quantization codebook and reuses the same blocks for first-moment scaling. Across various experiments, Gefen achieves the lowest peak optimizer memory among the compared AdamW-like methods while maintaining AdamW-level performance. In FSDP and DDP training, the reduced memory footprint enables larger microbatches and significantly improves throughput over AdamW, providing a practical drop-in replacement with lower memory usage that can increase throughput and enable training larger models or using larger batch sizes. We provide the complete Python implementation, including fused CUDA kernels, at this link: Gefen GitHub.
Blogger's Review: The introduction of the Gefen optimizer marks a significant advancement in memory management within deep learning. By sharing second moments and quantizing first moments, Gefen not only reduces memory usage dramatically but also retains the performance of AdamW, offering a more efficient solution for training large-scale models. Its performance in FSDP and DDP training is particularly noteworthy, showcasing its broad potential in practical applications.