[CS.AI] Weight Norm Reveals Grokking Delay Law: A Causal ...

Grokking refers to the delayed onset of generalization in neural networks, occurring long after they fit the training data. The debate over whether weight norm causes this delay remains unresolved: some studies report a critical norm at the transition, while others observe grokking without any fixed norm. We address this by intervening on the norm during training rather than merely observing it.

Under free training with weight decay, networks grok when the weight norm reaches a value $W_c$, which varies little across seeds and learning rates (CV 1% to 2%) and increases with the modular base as a power law. When we clamp the norm to a fixed multiple $\rho$ of $W_c$ and maintain it, the network still groks, but the delay $T_{grok}$ is proportional to $\exp(\beta \rho)$. One exponent, $\beta$ near 7.5, fits this delay across four moduli ($R^2 = 0.996$). Over the swept ranges, the held norm alters the delay by about 19x, while the learning rate changes by only about 2x, and holding the norm above $W_c$ slows grokking rather than preventing it. A final LayerNorm removes the dependence by decoupling weight scale from network function; without it, the exponential law returns. This pinned-norm delay is the exponential counterpart to the logarithmic delay predicted for a freely contracting norm.

Blogger's Review: This paper thoroughly investigates the impact of weight norm on the grokking phenomenon through experimental methods, revealing its crucial role in neural network training. This research not only provides new insights into understanding the learning process of networks but also lays a theoretical foundation for optimizing training strategies.

[CS.AI] Weight Norm Reveals Grokking Delay Law: A Causal Insight