[CS.AI] Rethinking Global Average Pooling: Classifiers as...

Modern image classifiers adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multi-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances.

Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation suggests that common classifier failures reflect known limitations of mean aggregation.

Blogger's Review: This paper uncovers the potential shortcomings of global average pooling and proposes a perspective using multi-instance learning to enhance classifier performance. By extracting spatial class evidence, researchers can gain better insights into the decision-making processes of models, which is significant for future image classification research.

[CS.AI] Rethinking Global Average Pooling: Classifiers as Multi-Instance Learners