In the field of image segmentation, traditional methods primarily focus on texture segmentation or semantic segmentation based on visual cues. This paper introduces a new category called sub-semantic image segmentation, which blurs the line between the two. In sub-semantic image segmentation, language is not used to name whole objects; instead, it partitions an image into stable appearance patterns that can be described by language. To achieve this, we couple a general-purpose vision-language model with SAM 3, a promptable segmentation backbone whose native text pathway can ground rich descriptions into masks.
However, simple coupling fails for several reasons we identify in the paper, and we overcome them by introducing DETECTURE, which addresses three concrete failure modes: language leakage between texture regions, prompt competition inside the segmentation backbone, and semantic distortion at the language-to-mask interface. Since there is no dataset for sub-semantic image segmentation, we introduce a new dataset called TextureADE, derived from the ADE20K dataset using a system we designed.
We compare DETECTURE to several baselines and find that it achieves the strongest performance on various datasets using different metrics. The code is available at GitHub.
Blogger's Review: The introduction of sub-semantic image segmentation opens new avenues in image processing, particularly in fine-grained visual understanding. The innovative approach of combining language and vision marks a significant advancement in segmentation technology, and we look forward to its wide application in real-world scenarios.