Abstract
Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual training, embeddings often encode language cues with speaker identity, causing speakers to form language-specific clusters. We propose L-Proto, a language-aware episodic prototypical training strategy that constructs language-consistent episodes. By sampling speakers from a single language per episode, L-Proto reduces language-driven variation during training and encourages embeddings to focus more directly on speaker identity.
Experimental Results
Experiments on the TidyVoice Challenge benchmark demonstrate consistent performance improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.
Blogger's Review: The introduction of L-Proto effectively addresses the language dependency issue in multilingual speaker verification. By constructing language-consistent training episodes, it significantly enhances the system's generalization ability. This method holds great promise in practical applications, especially in voice recognition and verification tasks within multilingual environments, providing a new research direction for related fields.