NeFut Logo NeFut
Admin Login

[CS.AI] Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

Published at: 2026-06-17 22:00 Last updated: 2026-06-20 13:45
#AI #Machine Learning #optimization

Abstract

Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention.

Proposed Model

The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K is the number of clusters. Through this innovation, the speed and efficiency of image captioning are significantly enhanced.

Blogger's Review: The clustering approach proposed in this paper effectively addresses the computational bottleneck of traditional self-attention in image processing, showcasing the broad potential of vision transformers in practical applications, warranting further exploration and application.

Original Source: https://arxiv.org/abs/2606.14753

[h] Back to Home