VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation SIGGRAPH Asia 2022

YIFANG PAN, CHRIS LANDRETH, EUGENE FIUME, KARAN SINGH

JALI Research Inc., University of Toronto

Abstract

Singing and speaking are two fundamental forms of human communication. While both activities have distinct uses, from a modeling perspective, speaking can be seen as a subset of singing. We present VOCAL, a system that automatically generates expressive, animator-centric lower face animation from singing audio input. Articulatory phonetics and voice instruction ascribe different roles to vowels (projecting melody and volume) and consonants (lyrical clarity and rhythmic emphasis). Our approach directly uses these insights to define axes for Melodic accent and Pitch-sensitivity (Ma-Ps), which together with Ja-Li axes for Jaw and Lip contribution, define a 4D space into which a variety of singing styles can be readily embedded. We train a network to learn audio features in a sung signal to map to the dynamic visual contributions of Ma-Ps-Ja-Li. The viseme animation curves are then computed based on aligned lyrics and the 4D vocal space. in our system. Vowels are processed first, dilated from their spoken behavior to bleed into each other based on melodic accent (Ma), with pitch sensitivity (Ps) modeling vibrato. Consonant curves are then layered in, weighted inversely with Ma. We evaluate the impact of our algorithmic parameters, compare against prior art on spoken and sung performance, and provide a qualitative comparison to video references for gallery of singing animations.

Paper (3.38MB)

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation SIGGRAPH Asia 2022

Abstract

Ella

Whitney

Supplemental Video