Abstract
This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard $L_p$ metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to decouple model attributions from final classifications.
We evaluate this vulnerability across state-of-the-art architectures under strict prediction-preserving constraints. By assessing the manipulation cost through domain-specific perceptual audio quality metrics alongside explanation alignment criteria, our framework demonstrates that an adversary can systematically distort automated explanation heatmaps while preserving the predicted deepfake label.
Full code available at: Audio-XAI GitHub
Blogger's Review: This paper reveals the potential fragility of audio model explanations, especially in the realm of deepfake detection. By introducing a psychoacoustic perspective, it proposes new manipulation techniques that warrant further exploration and application in related fields. Such research not only aids in enhancing model robustness but also provides fresh insights into the security of audio processing.