Inversion-based image editing offers flexible and training-free control but still struggles with inversion accuracy and the trade-off between editing fidelity and background preservation. While recent methods improve inversion formulations or attention interactions, the role of textual conditioning in shaping diffusion dynamics and editing behavior remains underexplored. We show both empirically and theoretically that the precision of textual conditioning influences inversion stability by modulating the geometry of the diffusion velocity field, while also affecting the consistency of cross-branch attention during editing. These effects directly impact background preservation and semantic fidelity. Building on this analysis, we propose SimEdit, a conditioning-aware framework with two complementary components:
- Conditioning Refinement: Constructs conditioning signals with improved semantic precision and structural alignment to facilitate stable inversion and consistent attention manipulation.
- Token-wise Cross-branch Attention Control: Separates edit-relevant and structure-preserving components and modulates them asymmetrically during attention manipulation.
Extensive experiments on PIE-Bench demonstrate that SimEdit consistently improves both inversion reconstruction quality and editing performance over previous attention-manipulation approaches. Our code is available at GitHub.
Blogger's Review: This study highlights the significance of textual conditioning in diffusion image editing. By introducing the SimEdit framework, it markedly enhances editing stability and fidelity, providing new insights to address limitations in traditional methods.