Data-Driven Speech Animation using Decision Trees

Sample videos for:

Data-Driven Speech Animation using Decision Trees

We trained a model that automatically predicts the configuration of a person's lower face in response to phonetic inputs. Technical details are available in [1]. Below are some demo videos.

Retargeted to CG Characters

The video above shows our approach retargeted to CG characters. The lower face animations are generated completely automatically by our approach (retargeted to the CG rigs). An animation artist generated the basic head and torso movements manually. We are currently working on improving the retargeting technology, and will have better animations soon. See videos below for how our approach performs for the reference face.

Animating to Chinese

The video above shows our approach animating to Chinese. We pre-transcribed the Chinese audio to phonetics, after which our approach can be applied directly.

Comparison with Baseline Methods

Each video below is generated by a pair of prediction models predicting a frame-by-frame sequence of lower face configurations given a sequence of phonetic inputs. In all comparisons, our approach is the one on the left.

Comparison with SEARN [2]:

Conventional decomposition approaches such as SEARN [2] do not directly model multi-frame temporal curvature, and can lead to jittery animations.

Comparison with DAgger [3]:

Conventional decomposition approaches such as DAgger [3] do not directly model multi-frame temporal curvature, and can lead to jittery animations.

Comparison with HMM Approach [4]:

Existing state-of-the-art visual speech approaches such as HMMs [4] make overly strong modeling assumptions that result in over-smoothed animations.

Comparison with Dynamic Visemes [5]:

Existing state-of-the-art visual speech approaches such as Dynamic Visemes [5] stitch together multiple animation subsequences, which can lead to animations that are poorly aligned with the input audio.

Comparison with Ground Truth:

Compared to the ground truth, our approach still appears somewhat under-articulated or over-smoothed.

References:

[1] T. Kim, Y. Yue, S. Taylor, and I. Matthews. "A Decision Tree Framework for Spatiotemporal Sequence Predicton". In ACM Conference on Knowledge Discovery and Data Mining (KDD), 2015. [pdf]

[2] H. Daume III, J. Langford, and D. Marcu. "Search-based Structured Prediction". Machine Learning, 75(3):297-325, 2009. [pdf]

[3] S. Ross, G. Gordon and J. A. Bagnell. "A Reduction of Imitation Learning and Structured Prediction to No-regret Online Learning". In Conference on Artificial Intelligence and Statistics (AISTATS), 2011. [pdf]

[4] H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. Black, and K. Tokuda. "The HMM-based Speech Synthesis System Version 2.0". In Speech Synthesis Workshop, 2007. [pdf]

[5] S. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews. "Dynamic Units of Visual Speech". In ACM/Eurographics Symposium on Computer Animation (SCA), 2012. [pdf]