SCA 2022: Voice2Face Audio-Driven Facial and Tongue Rig Animations with cVAEs
This is a research presentation from the Eurographics Symposium on Computer Animation (SCA 2022). Authors: Mónica Villanueva Aylagas, Héctor Anadon Leon, Mattias Teye, and Konrad Tollmar.
Download the full research paper. (5.9 MB PDF)
In this paper, we present Voice2Face, a tool that generates facial and tongue animations directly from recorded speech using machine learning.
Our approach consists of two steps: a conditional Variational Autoencoder generates mesh animations from speech, while a separate module maps the animations to rig controller space. Our contributions include an automated method for speech style control, a method to train a model with data from multiple quality levels, and a method for animating the tongue.
Unlike previous works, our model generates animations without speaker-dependent characteristics, while allowing speech style control.
We demonstrate through a user study that Voice2Face significantly outperforms a comparative state-of-the-art model, and our quantitative evaluation suggests that Voice2Face yields more accurate lip closure in speech with bilabials through our speech style optimization. Both evaluations also show that our data quality conditioning scheme outperforms both an unconditioned model and a model trained with a smaller high-quality dataset. Finally, the user study shows a preference for animations including the tongue.