NeFut Logo NeFut
Admin Login

[CS.AI] Revolutionary VibE-SVC2: A New Framework for Independent Singing Style Control

Published at: 2026-06-18 22:00 Last updated: 2026-06-20 13:47
#AI #Machine Learning #DeepSeek

Abstract

Singing style is a crucial aspect of a natural and expressive singing voice. Singers utilize singing styles to convey the feeling or emotion of the songs. Several works have been proposed to control singing style for making a more expressive singing voice. Recently, VibE-SVC successfully controls vibrato by predicting high-frequency F0 contour.

In this paper, we introduce a singing voice conversion framework, called VibE-SVC2, to improve singing style conversion performance and controllability. The model offers control over two types of singing styles: pitch style and timbre style.

Pitch Style

To resolve the pitch-energy entanglement issue that is unresolved in our previous work, we introduce a novel Energy Style Converter to address remaining style information in the energy contour. In addition, we propose a Zero-shot Pitch Style Converter, which mimics the pitch style of reference audio. To expand the controllability of the model, we propose vibrato rate scaling that is an independent control of vibrato extent, which is unavailable in VibE-SVC.

Timbre Style

For the timbre style, we extend the model to handle a variety of phonation styles. However, addressing specific styles such as vocal fry poses a challenge, as conventional F0 extraction often fails due to their inherent subharmonic characteristics, which degrades the conversion quality. To address this, we propose a novel Subharmonic Correction algorithm to refine the F0 contour for more natural timbre conversion.

Through comprehensive objective and subjective evaluations, we demonstrate that VibE-SVC2 provides fine-grained, independent control over two types of singing styles, outperforming existing methods.

Blogger's Review: The introduction of VibE-SVC2 marks a significant breakthrough in the field of singing style conversion, particularly in the independent control of pitch and energy, greatly enhancing the expressiveness and flexibility of the system. The new algorithms introduced also provide effective solutions for dealing with complex phonation styles, making it a topic worth attention and further research.

Original Source: https://arxiv.org/abs/2606.17126

[h] Back to Home