[CS.AI] Pixel-TTS: Image-Based Text Rendering for Robust ...

Recent advancements in pixel-based text modeling demonstrate that representing text as images allows models to leverage visual cues for language understanding. By grounding text in its visual form, structurally similar characters with different Unicode encodings can produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and necessitating embedding expansion during cross-lingual adaptation.

In response, we propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates the need for embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show that Pixel-TTS achieves competitive performance against strong baselines, faster convergence, and robust zero-shot generalization.

Blogger's Review: The innovation of Pixel-TTS lies in integrating visual information into speech synthesis, enhancing adaptability when faced with unknown characters. This approach not only improves robustness but also has the potential for significant impact on multilingual processing, making it worthy of attention and further exploration.

[CS.AI] Pixel-TTS: Image-Based Text Rendering for Robust Speech Synthesis