[CS.AI] The Importance of Small Initialization for Large ...

Abstract

Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data, and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks.

We identify two widely used empirical settings that restrain the advantage of small initialization and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence.

Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $b3$-initialization rule: expose initialization range as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.

Blogger's Review: This paper reveals the critical role of parameter initialization in training large language models, highlighting the positive impact of small initialization on reasoning capabilities. A simple adjustment can lead to significant performance improvements, providing new insights and directions for future model designs.

[CS.AI] The Importance of Small Initialization for Large Language Models

Abstract