FinalProposalx - University of Sheffield

aspiringtokΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

113 εμφανίσεις

1



Fig 2.1

Diagram illustrating the general solution
proposed by a number of researched papers.
S
erves as

an outline for Section 2 and 3


1
. Introduction

This paper presents a proposal for an implementation of a
n intelligent

system able to realise real
-
time speech
-
guided upper facial feature animation of a synthetic 3D face, where the audio input is understood to come from the
character.

Various pieces of literature have been reviewed (Section 2) and our general approach for
project is disclosed in
Section 3. Section 4 holds a summarized view of the proposed approach and Section 5 contains the Gantt Chart
describing the time plan and milestones for the implementation. Section 6 is the bibliography.

2.

Literature Review

Attempts to achieve more natural feel to the animations have
led various researchers to find a solid relati
onship between both facial
expressions and speech. In day to day conversations, facial expressions
and head movements play key role in development of understanding
during our interactions. The idea, that by extracting prosodic data we
can reproduce facial
expressions, is based on
association

established
between two domains via studies validated by experimentation.

In this section, we cover the background research of the
problem domain and list multiple solutions to problems related to the
project. The
section is structured in four subsections: data acquisition,
data processing, machine learning and graphical representation.


2.1 Data Capture

The Data Capture section is divided into Video Capture and Audio Capture bits. The Video Capture part
discusses problems with scene composition and simplifications often imposed on it. Moreover, also camera setup and
possible problems with these are discus
sed. The Audio Capture section describes
utilisation of assortment of
data
capture
techniques
by various researchers.

Video Data Capture

Tracking facial features from video in any real world scene is a difficult task. There are multiple problems like
noise
, occlusions, variable lighting, etc. Researchers ease the problem by simplifying the scene and constricting its
variety. Lighting can be limited to single colour from a rigid source(s) (Fabrice 2000, Nguyen 2010). Depending on the
application some researc
hers simplify composition even further, Martinez
-
Lazalde (Matrinez
-
Lazalde 2010) constraints
the scene to only have lighting that illuminates tracking markers located on the actor’s face. This significantly limits the
complexity of the scene and noise in t
he recorded data allowing for relatively easy extraction of
tracking

data. Further
limitations are commonly imposed on the background of the recorded object (Nguyen 2010, Gross 2006, Martinez
-
Lazalde 2010), as this ensures the background will not affect th
e tracker’s performance by introducing additional noise to
the scene. Moreover, actor’s position is often assumed to be fixed (Fabrice 2000, Nguyen 2010, Matrinez
-
Lazalde 2010,
Barker 2005).

Researchers exhibit control not only on the scene complexity but
also on the camera setup. Depending on the
application, different kinds of cameras can be used; from a home
camera
(Barker 2005) to high speed industrial cameras
(Martinez
-
Lazalde 2010). Many researchers use multiple cameras on a scene to help deal with
occlusions, to estimate
depth, and extend field of view. Depth information can also be obtained from depth recovery systems (Yilmaz 2006,
Mittal 2003). Research indicates that using multiple camera can provide superior tracking video than a single camera
s
etup (Mittal 2003,

Dockstader 2001). While filming Avatar
,

James Cameron used a low weight camera fixed onto the
actors’ head and pointing at their face. This allowed removing problem of inter
-
object or pose
-
variation caused
occlusions, removed rigid head

movement and field of view problems (Cameron 2010).

Frame rate and data compression are another important settings to the camera setup. Usually frame
-
rates of 20
-
30 fps are employed (Barker, Kapoor 2002), however as shown (Matrinez
-
Lazalde 2010, Barker 20
05) these are not
sufficient and often cause blurred markers in the recorded data. A frame rate of 100fps is quoted as the golden mean
(Matrinez
-
Lazalde 2010).

2


Audio Data Capture

While designing a system able to make intelligent decisions
, clean data colle
ction is critical (Michell 1997).
Hence much thought is put into choosing appropriate data to be re
corded (Yehia 2002, Mary 2008,
Hofer 2007, Graf
2002, Foxton 2009, Felcha
-
Garcia 2008).

Yehia (Yehia 2002) in his paper relies on more controlled approach wh
ere the subject is aware of how the data
is going to be used. For data acquisition, he uses recordings multilingual scripted sentences with repetition and leaves one
for testing while using rest for training purposes. Head movement is dealt by recitation o
f 59 CID Everyday Sentence Set
sentences. Mary and Yegnanarayana (Mary 2008) use a CV based structure to keep input “suited both to the production
and the perception mechanisms”.

Hofer’s (Hofer 2007) methods for data collection and processing follow less c
ontrolled method by allowing
actor to recite fairy tales from memory. Graf’s (Graf 2002) approach takes on a wider dataset with not only recording
sentences and fairy tales, but speech covering all diphones of English and some excerpts from Wall Street Jou
rnal as
well. Some researchers also accommodate range of emotion sets (sad, angry, happy etc.) while recording. Foxton’s
(Foxton 2009) technique uses this method in slightly different where an actress is instructed to repeat a pair of words,
each time emph
asizing a different part and altering emotion. Felcha
-
Garcia (Felcha
-
Garcia 2008) achieves the goal by
observing dialogues amongst sets of people.

2.2 Data Processing

The Data Processing section is primarily concerned with taking in the captured data, and
cleaning it up to be applicable
in the
Machine Learning section. This shall consist of discretizing

t
racker information over time, discre
tizing audio into
prosodic values over time, and associating some tracker information with prosodic values.


Facial
Feature Tracking

A way of representing facial features in a video stream must be decided on before these can be extracted. There
are many ways of representing objects but for the facial features various forms of joint shape
-
appearance representation
are mo
st appropriate as these provide us with both information about shape and characteristics of feature. These include
probability densities of features, templates and Active Appearance Models. A standard for locating points to track on
various facial features

is specified by MPEG
-
4 FAP (Pockaj 1998).

Algorithms utilising probability densities and templates attracted large interest of the researchers because of
t
heir
relative simplicity and computational efficiency. One such algorithm
called CAMSHIFT
utilises a

colour
probability distribution in a region specified by a simple geometric shape, thus it requires an initialisation stage (Bradsky

1998). While tracking, it maximizes the appearance similarity by comparing the histogram of the tracked feature to the
win
dow around its hypothesized location. At each iteration
,

similarity of the

histogram is increased and the

process
continues until algorithm converges to a new feature location. A system utilising an algorithm similar to CAMSHIFT,
however based on feature i
llumination rather than colour, is presented (Barker 2005)

and is used for tracking physical
markers located on actor’s face
.

Shi (Shi 1993) extend Lucas
-
Kanade (Lucas 1981) optical
-
flow method to incorporate affine transformations of
tracked point. Their

tracker repeatedly computes translation of a region centred on the interest point. Once new location
is calculated, quality of tracked path is estimated by computing affine transformations. However, this algorithm has got
inherent problems of feature drif
ting and/or being lost during tracking, it is also unreliable under occlusions. Vario
us
approaches to solve these have

been proposed. Fabrice (Fabrice 2000) specialises this algorithm for tracking of facial
features and recovery of points lost in between f
rames, either due to tracking inaccuracy or partial occlusions. This is
done by incorporating knowledge of facial topology into feature point search regions. Wang (Wang 2003) combines the
KLT tracker with a model based method trained to recover mouth featu
res if such become lost. One can utilize a set of
face shape subspaces learned by Probabilistic Principal Component analysis as a feedback mechanism to KLT algorithm,
which enables them reliable tracking of facial features in presence of continuous partial

occlusions and head motion
(Nguyen 2010).

Another issue with such kernel trackers is the point jittering problem. Jittering relates to change of marker’s
centroid location in
-
between subsequent frames with no movement involved. This can be resolved by int
roducing a
Kalman filter (Matrinez
-
Lazalde 2010).

Representing facial features utilising trained facial models is another popular technique. A lot of work has
focused on Active Appearance Models (Zhou 2010, Cootes 2001). These utilise models of texture and

shape variations to
represent facial features. Such models are able to realise far more effective and robust tracking of facial features and do
not suffer as much from point drifting away/being lost during tracking problems that hindered model
-
free algori
thms’
reliability.
Nonetheless, model
-
based systems require initial training and have problems generalising to unseen images
not present in the training set (Yilmaz 2006).

3


Audio Prosody Extraction

It is important to know that speech is not just series of
produced sounds. Suprasegmental or prosodic features
provide difference to the meaning of words (Ashby 2005). Speaker can thus signal important information along with
articulated segments of speech.

Phonological analysis of speech though provides insight i
nto what is said but it is the prosody that defines how
it is being said. It is defined by various parameters as stress, intonation and pauses. A general set of prosodic features
includes pitch, jitter, spectral energy and distribution number of accentuate
d syllables (Albrecht 2002). This information
carried in “acoustic packaging” (Albrecht 2002) of words is termed as Paralinguistic or prosodic information.

Foxton (Foxton 2008), concludes on the cross
-
modal nature of speech, considering facial expression a
s the
“visual part of speech”, proving the relation between audio and visual prosody. Cavé et al (Cavé 1996) have argued that
communication is trimodal, requiring the integration of verbal, vocal and gestural channels.

The idea that speaker head and facial

movements may be linguistically informative is not new. From the point of
view of speech motor control, this is interesting because motion of the head and that of the eyebrows, is integrated with
the system generating speech, but is under independent cont
rol (Barr 2001). Studies held using various languages (French
(Foxton 2008), English (Graf 2002, Flecha
-
García 2008), Japanese (Munhall 2005), Swedish (Samer 2010) have
delivered same results.

Earlier works on head gestures during speech targeted issues of

timing and how the movement is organized
(Hadar, Steiner, Grant, & Rose, 1983 (Hadar 1983), 1984 (Hadar 1984). A language independent relation with prosody
was established as early as that. On the lines of these kinematic based studies, several further re
searches have shown that
subjects tend to use facial and head movements to express emphatic stress (Bernstein 1998) and to differentiate amongst
various types of statements.

Munhall (Munhall 2004) et al report that non verbal gestures carry a great weight
in perception but Sargin
(Sagrin 2002) points out various challenges in developing
such a relationship:



What Features to Extract?

Most researchers focus on the prosodic features according to their specific
implementation. Our interest is in co
-
occurring responses relating to muscular movements in face according to
variance in speech rather than conversational or interactive response
s. Ekman

(Ekman 1979) states that there is
neither a particular set of rules to co
-
relate the gestures with speech nor any universal definition of emotion or facial
expression.



What problems are there with Synch
r
onization?

While tracking speech related eye
brow movement
, Batons

and
Underliners

(Ekman 1979) are speaker conversational signals which tend to occur with voice stress but differ in the
length of speech they cover and depend on the context. Aligning the prosodic feature with respective facial
moveme
nt when there is no specific temporal segmentation, as each may span different time intervals, is also a big
issue.



Can this become speaker Independent?
Training data, owing to the variations in prosody and gesture routines, is
inconsistent, even for the s
ame individual. There are a number of emotional states to be considered. An expression
being voluntary or non
-
voluntary adds to the problem along with the fact that same movement may hold different
meanings to different people (phenomenon is handled by def
ining
Emblems

) (Ekman 1999). So the conclusion is
that this can be just barely accurate for a single individual.

Graf (Graf 2002) in his research concludes that though the system
may
seem dependent on individual variations,
there is a “well
-
defined” set
of rules controlling intonation of a language. All the prosodic characteristics can be defined
by time
-
varying features as simple as loudness, duration of a phoneme, rhythm and pitch. These parameters can be
classified into three broad categories (Kuratate

1999): Quality, Timing, Pitch contour

Although most prosodic features are known to categorize emotions, the most expressive one is the rate of vocal
-
fold cycling known as pitch. According to
Cavé

(Cavé 1996), majority of eyebrow movement

while conversing
, as much
as 71%,

is related to pitch contour. Raising pitch has a general trend with raising of the eyebrows, while lowering of pitch
and lowering of eyebrows occur in sync too. 38% of overall eyebrow movements are found to occur during gaps in
speech. Fl
echa
-
García (Flecha
-
García 2008) reports that length of utterances and pitch accents can be directly related to
eye
-
brow raises. The intensity of facial expressions varies directly with loudness and intensity of speech. These
characteristics of speech are
seen in power spectrum.

Paeschke et al (Paeschke 1999) confirmed the relation between basic emotions set and characterizing
fundamental frequency and examines F0 (fundamental frequency) contours for syllables and sentences. The analysis may
be done using a
utomatic speech recognizer (ASR) to avoid complications of labelling standards as ToBI (Tones and
Break Indices) on data.

Blocking and Data Association

This section outlines a step in the process not explicitly covered by most of the research literature bu
t is
understood to happen. This section is concerned with how training data is gathered and fed into the Machine Learning
aspect of the project.

In the case of Neural Networks, associated input and output data must be fed into the Neural Network at once
du
ring training. The training data’s input data must match the format of data the neural network will receive in practice
4


otherwise you
cannot

ever expect the training data’s output (Mitchell). If the Neural Networks input requires multiple
samplings over ti
me of a variable, the neural network must have inputs for each sampling and the training data must have
multiple samples to feed into these connections, just as in practice there will have to be a mechanism for storing the
samples of that variable such tha
t they can be fed in as input all at once (Chen 2001).

Associating input with output demands that you know what the two variables you are trying to correlate with a
Machine Learning Model, and that you sample them in a way that this correlation can be iden
tified (Mitchell 1997). In
the case of this problem audio prosody must be sampled along with facial movement, these two pieces of data must be
fed into the Machine Learning Model as training data such that when in practice the Machine Learning Model can ta
ke as
input the audio prosody, and output the facial movements (Chen 2001).

2.3 Machine Learning

Once various types of data are gathered and processed, researchers often need to establish relationship between
that data. Utilising Machine Learning
algorithms is one of the ways of doing that.

Machine Learning is a branch of Artificial Intelligence that is dedicated to developing systems that evolve
behaviour given a set of empirical data, and examples utilizing high level probability distribution mat
hematics. There is a
focus on identifying correlations and patterns in complex data. Often the success of a Machine Learning model hinges on
the data, making data inquisition and cleaning a high priority. The success of an application should be quantified
with
this definition in mind: A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if it's performance at task T, as measured by P, improved with experience E. One class of
Machine Learning
of interest is Supervised Learning, which is concerned with mapping a set of inputs to a set of outputs
utilizing a set of examples. This is the class of Machine Learning we are primarily concerned with in this paper

(Mitchell
1997).

The Machine Learning m
odels used in the papers researched include Neural Networks (Ishikawa 1999, Kuratate
1999, Yehia 2002, Hong 2002, Chen 2001), Hidden Markov Models (Sargin 2008, Busso 2005, Brand 1999), and
Gaussian Mixture Model (Deng 2004). The Neural Network appears to
be the most popular approach and typically a
forward feed Neural Network is what they all use. Forward feed Neural Networks comprise of a sequence of layers of
Neural Nodes or Neurons, in which every node takes input from every node in the previous layer,
and outputs to every
node in the next layer. These are likely used because they have shown to be excellent approximators of complex
correlations (Mitchell 1997).

2.4 Facial Representation

Once relation between data is learned and established, it must be
transformed onto graphical facial

model. A
number of issues need

to be considered and this section will describe the state of the art in this domain.

Facial representation and animation is a widely studied field of computer graphics, our sensitivity to art
ificial
representations of faces makes it a complex field. It is hard to create a face which we cannot distinguish from that of its
real world counterpart, a so called photo
-
realistic representation.

There is a wealth of differing techniques to create faci
al expression animation, these can be broken down into
five categories, facial animation editing, Performance driven, Blend shape interpolation, Deformation based and Physics
based.

Facial Animation Editing is the more traditional approach of manually crea
ting the pose faces for the
animations. Animators use a verity of deformation approaches to create the poses, packages like Maya (Autodesk) use
this technique. However, the sophisticated mix of modelling tools used to create the poses renders this approach

undesirable for automation in real time graphics.

Performance Driven Facial Animations uses motion captured from a real world actors to directly drive the
motion of the graphical representation. Points are tracked on a face, these are then transferred to
the graphics model and
applied with a deformation technique to achieve a reconstruction of the actors performance, a technique heavily used in
the film industry. this approach can be used to allow an animator the flexibility to create a model and then appl
y motion
capture data of an actor (Chuang 2002).

Blend shape interpolation draws from a corpus of pre
-
generated polygonal meshes which correspond to facial
expressions.
Interpolation

between these meshes achieves a desired facial expression (Pighin et al,
1997), (Kouadio et al
1998) . (Blanz 1999) uses sliders to alter weights which are used within the linear interpolation calculations to create a
blend between separate expression meshes, this technique has been successfully used to animate faces in major m
otion
films such as Star Wars (Lucasarts 1999). This approach tends to require manual intervention by an animator to achieve a
desirable output.

Deformation based approaches create a new facial expression by manipulation of the structure of the mesh to
def
orm it into a new shape which we interpret as expression. This technique disregards underlying facial anatomy
reducing computational load and has the advantage of using existing graphical techniques for polygonal deformation,
such as splines, free form def
ormations and soft body physics systems (Sederberg 1986), (Yoshizawa 2002), (Mesit
2007), (Debunne 2000), (Orivit 1995), (Rivers 2007), (Terzopolos et al 1987).

5



Fig 3.1. Diagram illustrating our proposed approach


Physics
-
based muscle modelling of a human face is a more complex approach. The aim of which is
to simulate
the muscular layout and tissue behaviour of the face to generate a more realistic representation (Waters 1987). (Choe
2001) models the skin surface using a finite element method to simulate the deformation caused by expression muscles,
the majo
rity of their work focuses on the relationship between muscle actuation and facial surface deformation. However,
full physical representations of facial muscle actuation can result in high computation. (Terzopoulos 1990) outline a
hierarchical model of the

human face, incorporating physical approximations of facial tissue, they reported that their
approach was slow due to the high computational costs. Physics based models heavily rely on the mathematics of the
system, and don't give a chance to affect the o
utput, without changing the muscular system. (Choe 2001) took the split
approach of giving artists the ability to sculpt the initial draft of how the system should react so they could get as they p
ut
it 'a more predictable deformation of the face' after tr
aining.

The face also possesses extra attributes which combine to create an expression. Human skin possesses a
wrinkling quality when deformed, a feature very prominent in a smiling face. (Viaud 1992) created what they called a
'wrinkle mask' which modelle
d potentially expressive wrinkle lines that can be manually applied to varying models. A
more automat
ed approach is presented by (Wu
1994) who used muscle masks and spring forces to approximate the
muscle contractions which generate the wrinkles akin to th
e physical based approach discussed earlier.

The final addition to facial animation is the direct animation of the eyes. Eye movement has a direct impact on
the way we interact with each other, and is important when creating a realistic facial expression
representation (Vertegaal
et al 2001), (Lee et al 2002).

3
. Proposed

A
pproach

In this section we describe how we intend to tackle the problem.
The section is divided into Data Capture, Data
Processing, Machine
Learning and Facial Representation. Figure below illustrates our
proposed approach in a graphical way
.
.


3.1 Data Capture

This section sha
ll describe

our approach toward capturing
the data, setting

con
straints and kind of data to

be captured. Also,
where known a concrete hardware to be used is adduced. Since we
are using both video and audio data, the section will be split into
Video Data Capture and Audio Data Capture parts.

Video Data Capture

We intend to record video data of a
person in a fixed
position in front of the camera setup. For the recording itself, we
will use two cameras, a Kinect camera which will provide us with
accurate (16bit) depth information as well as 640x480 pixels video
recording. We also intend to use and a

higher definition camera that
will provide us with more accurate video data at a higher frame rate
of ideally 100fps. Currently we are in possession of Canon XM2
camera which records data at 25fps; it could be used however we
are still looking for other c
amera with a higher FPS rate. To limit
the complexity of a scene we intend to perform recording in a dark
room, using UV lighting at 320
-
400nm, light
-
reflective markers on
actors face and a black material in the background. Actor’s position
will be fixed w
ith at most 30 degree pose variations and no inter
-
object occlusions.

Audio Data Capture

Use of fairy tales (Hofer, 2007) seems like a clever idea as it’ll cover a large emotion range thus highlighting
audio and visual features. To make processing easier,

we are going to use scripted set of data

(Yehia 2002) and set of
spectrum balanced sentences which are known to increase intelligibility (Healy 2006).

Close talking microphone though provides noiseless speech data but hinders with visual data hence they a
re not
to be used. Our audio recording equipment is dependent on availability. We have fairly high quality B&K microphones
available which with pre
-
amplifiers and MOTU 8pre and its high control level for talk backs and listen backs
(MOTU.com, 2010), will s
urely produce good results in semi
-
isolated environment.

6


3.2 Data Processing

Video and Audio data needs to be processed so that later we can establish relationship in between these two. This section
will cover how we intend to extract location of facial fe
atures from Video data and how the prosodic features will be
extracted from Audio. Hence, similarly as before this section is split in two parts: the Facial Feature Tracking and Audio
Prosody Extraction.

Facial Feature Tracking

As mentioned in Data Capture

section, recorded actor will have reflective physical markers positioned
eyebrows. The markers will be located according to the MPEG
-
4 FAP specification. Further 2 fixed points will be
located on a skull cap and will be used as a reference for re
-
initiali
sation of other markers if these became occluded or
lost. These fixed points could be also used to remove rigid head movement in a way similar to presented et al., (Martinez
Lazade 2010), however we intend to use a head tracker acquired from Dr Jon Barker
which we hope will provide more
accurate data.

Tracking algorithm will be used to track the light
-
reflective markers. Due to the scene setup, we intend to
develop a kernel based algorithm utilising colour as a feature to track. This is chosen due to its s
implicity, computational
efficiency and relative robustness with our video data. Essentially a modified version of the algorithm proposed by Dr.
Jon Barker (Barker 2005) will be developed. The modifications relate to using colour density distribution (hist
ogram)
rather than a luminance measure. Based on the research presented, we are certain that utilising UV markers and UV light
to light up the markers will give us very distinct and unique features to track within the scene and tracking these with a
histog
ram
-
based tracker should provide us with good and accurate data. We expect tracker’s output to be subjected to a
little (1
-
2 pixel) jitter


this we believe is acceptable as we intend to record videos of actors reasonably close to the
camera thus the displ
acement of the tracked features should account to ~5
-
25pixels (for 640x480 pixel resolution camera
and face occupying at most entire and at least 2/3 of the camera view height). Nonetheless, if after the first integration
stage we decide the jitter causes
problems with to graphical representation, a Kalman filter will be implemented to
smooth out the data.

Initially we will not implement other means of dealing with occlusions as just keeping the tracked marker in
previously known location. In the further it
eration of the system we intend implementing Adaptive Marker Collocation
model incorporating online updates as described by Dr. Barker et al., (Barker 2005).

Audio Prosody Extraction

Based on our research and in our aim towards automatic realistic
synthesis of head gestures and eyebrow
movement from spee
ch prosody, we focus our analysis on intensity,
pitch and
duration of utterances

(Ananthakrishnan

2005).

The motion of the facial markers and head motion tracker can be calculated provided a correlat
ion model from
the neural network from the voice signal. Linear Predictive analysis over data provides us with pitch contours. LP is
widely used because it is fast and simple, yet an effective way of estimating the main parameters of speech signals

(Seo
20
08). We fit a cubic spline over unvoiced regions. Power Spectrum is used to scale the intensity of expressions. Other
prosodic features to be measured are root mean square value (RMS) to locate pauses, signal to noise ratio (SNR) to
remove any noise gather
ed.

Blocking and Data Association

We will be sampling the audio prosody data a variable number of times over a variable period of time, and
associating this with the tracker information of an instant in the middle of the period of time, as our literature h
as
indicated (Chen 2001).

We have no research to indicate precisely what the density and duration of sampling should be used to associate
with a given frame of tracking infor
mation. Part of the project is aimed to

be spent experimenting with these values to

find a configuration that produces viable results
.

3.3 Machine Learning

We intend to use
Artificial

Neural Network for training the relationship in between eyebrow movement and
speech data as the research materi
al covered supports this idea. Chen (Chen 2001) provides a topology for a neural
network that is said to map the frequency and energy of formants to MPEG 4 Facial Animation Parameters. We could try
to use a similar topology to map our prosodic speech featu
res to the associated tracked points in video. Like in the cited
paper we will have input for multiple periods of time sampled both before and after the associated facial marker tracking
information for each prosodic feature, such that the Neural Network c
an work out the importance of their change over
time.

Another approach which is novel to our research would be to make multiple Neural Networks for different areas
of the face. In this way we attempt to simplify the problem by splitting it up into several
presumably simpler problems,
7


such that the Neural Networks may provide better results, and perhaps with simpler topologies. This method may include
separating the problem out into individual eye brow and cheek areas for each side of the face.

The output of

the Neural Network given a time segment of audio prosody input will be the same format as the
facial tracking data. Once these tracking points have been reconstructed they will be interpreted to animate the facial
representation.

3.4 Facial Representation

We propose a two stage approach to 3D facial representation, the first being a very straight forward placing of
points in 3D space which correspond to the positions of the tracked points generated by the neural network. The second
builds on the last by us
ing these points to position and deform a pseudo physically based facial model. This approach will
not incorporate any skin wrinkling capabilities or sophisticated eye direction animation as this is out of the scope of the
project.

Stage 1



Tracking points in 3D space

The points attained from the trained neural network must be represented into three dimensional space. The static
points mentioned earlier in 3.2 and 3.3 are used to tra
nsform the
points based on the current head orientation
by
constructing a transformation matrix.

To ensure that the markers directly mark up to the correct attributes on the graphical face an initial calibration
stage must be undertaken, this will be the only manual intervention required for the system. The mar
kers will be
positioned on the neutral face manually, giving the flexibility of using a graphics head of any shape and size.

Stage 2



Pseudo
-
physical model

The second stage builds on the previous by using the points to drive a three dimensional face. The
approach is
simi
lar to that outlined in (Rivers
1987) to be used on the face, to create a soft outer skin. Facial tissue and muscles are
treated as one and are simulated using a mass
-
spring damping system (Terzopoulos 1990), (Wu 1994).


Fig 3.4.1
:
Mass
-
Spring Damping system visualised in two dimensional space incorporating both deformable tissue (pink)
and non
-
deformable bone layers (yellow).

Left:

Lattice in relaxed state
;
Right:

Lattice with force applied to one of the
masses (
Dark blue indicates
non
-
deformable masses part of the bone. The remaining masses deform according to mass
-
spring physics.
)


The expressive face is constructed in four layers. A voxelised lattice governed by mass
-
spring physical
attributes (Figure 3.4.1), a scull used to force

non
-
deformable areas of the head to remain rigid. As outlined by (Choe
2001) the skeletal structure of the head has very limited degrees of freedom, however, the face possesses a rich collection
of expression muscles, in order to model these in our system

we propose another layer of bone points, these will
correspond to the MPEG4 mark
-
up system. For example the eyebrow regions will be governed by a bone modifier which
consists of an invisible bone used to move the eyebrow region of the lattice. Eye balls w
ill be treated separately from the
rest of the system as animation of these is far simpler and doesn’t require any deformation.

Performance is a key factor when creating a physics
-
based system, this has been kept in mind when designing
the system, and will

be primarily tackled by utilising the graphics processing unit for the majority of computation. Both
nVidia’s GPU toolkit CUDA and GLSL will be used to achieve this.

3.5 Evaluation

In this section we cover h
ow the project shall be evaluated
. This section
is split into parts to ensure the processes
we follow provide us with expected results and data we use is suitable for the task. The sections include evaluation of
methods used for collecting and processing the data, evaluation of machine learning approach

and graphics evaluation.

Data Collection Evaluation

Regarding the video and
audio data, we
evaluate video and audio recorded by each of the cameras to see
whether it is suitable for the tracker developed. The data will be judged by the following criteria:

darkness in the room,
uniqueness of markers within the scene, background

noise level, head pose changes,

clarity of the voice and occlusions.

Only data passing above criteria will be used for training of ANN.

8


Data Treatment Evaluation

Accuracy of the faci
al feature tracker will be evaluated by manual labelling of a 1 minute video sequence from
our data set. We will compare actual location of markers in the video sequence against output produced by the algorithm.
A sequence of 1 minute should provide us 150
0 labelled frames (for 25fps recording) which
we believe will
provide
reliable estimate of the tracker’s accuracy.

Audio prosody feature tracking will be assumed to be accurate so long as the noise levels are low and the
speaker’s v
oice is clear. We may em
ploy MAT
LAB to ensure that the algorithms we use are correct.

Machine Learning Evaluation

After the Neural Network has been trained, we will test it with a sample of the recorded data it was not trained
with to see if it accurately predicts the head an
d
brow movements of the actor o
r if the
model produces movements that is
found

believable.

Graphics Evaluation

Deng
(Deng 2008) outlines the ‘Ultimate goal for research in facial modelling and animation’ as being a system
which: Creates realistic animations,

Operates in real time, Is automated as much as possible, Adapts easily to individual
faces. Evaluation of the graphics system will be driven by these goals.

In order to isolate the evaluation of the graphics system from the rest of the system the speech p
rocessing of the
project will be bypassed, instead the face will be run directly from the video marker data akin to performance driven
graphics systems.

The graphics simulation is required to perform at real time
speeds;

this can be directly measured in th
e system as
frames per second (fps). Any fps over 30 will be treated as real
-
time, true performance will be determined by increasing
the lattice resolution and measuring the fall in fps as the computational load increases exponentially.

The system will be
evaluated on two meshes to determine adaptability. In order to establish if the animations
generated are realistic they need to be tested against human judgement, as humans are very sensitive to facial animation.
This will be undertaken in the form of a ca
tegorisation survey where each participant is asked to categorise outputted
facial images into categories of emotion.

4.
Summary

The area of prosody
-
d
riven facial animation has been well researched, we have reviewed a number of these and
in doing so have b
roken the task down into smaller component parts. The
background of each of these separate parts was

then further studied to gain a full understanding of the problem. The system we propose was based on this broken down
analysis of the problem

A facial feat
ure extraction system will be developed in order to pull out locations of facial features from a video
sequence. The system will be developed in two iterations. After the first iteration we will have a tracker able to extract
and save location of a number
of features in the video sequence. The second iteration will extend the tracker to be more
reliable with occlusions. If it proves necessary, a filtering system will be employed to smooth the trajectory of facial
feature movement. Data extracted by the trac
ker will be then used to establish relation between facial features movement
and speech.

Prosodic feature extraction over a controlled data set shall enable us to generate correlated sequences in facial,
head and speech characteristics.

A neural network wi
ll be constructed that have inputs to accommodate a series of inputs representing the
multiple samples of captured speech prosody over a duration of time, and have outputs to accommodate the tracking data
needed to animate the facial representation.

A grap
hics system was chosen which takes a pseudo
-
muscle based approach, it incorporates previous work in
the area and based on the prototypes already created should produce a highly flexible representation at real time speeds.

9


5. Gantt Chart


Fig 5.1

Gantt Chart for our
proposed a
pproach, we expect to be able to complete the project by
year’s

end


10


6. Bibliography

Albrecht, Irene. et al. (2002) “Automatic Generation of Non
-
Verbal Facial Expressions from Speech”

Ashby, Maidment (2005) “Introducing
Phonetic Scie
ne” University Press, Cambridge

Ananthakrishnan, Narayanan, (2005)“An Automatic Pro
sody Recognizer Using A Coupled
Multi
-
Stream Acoustic
Model And A Syntactic
-
Prosodic Language Model”,
Proceedings of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP)

Autodesk. Autodesk Maya [Computer Program] http://usa.autodesk
.com (Accessed 4 December 2010)

Barker, Jon. (2005) “Tracking Facial Markers With an Adaptive Marker Collocation Model”. Department of Computer
S
cience, Un
iversity of Sheffield

Barr, D.. (2001) “Trouble in Mind: Paralinguistic Indices of Effort and Uncertainty In Communication” Oralit´E ET
Gestualit´E. Actes Du Colloque Orage , Pages 597

600

Bernstein, L.E. et al (1998)

“Demorest M.E. Single
-
channel
vibrotactile supplements to visual perception of intonation
and stress” Journal of the Acoustical Society of America

Blanz, V. Vetter, T (1999) A Morphable Model For The Synthesis of 3D Faces. Proceedings of the 26th annual
conference on computer graphics
and interactive techniques. pp. 187
-
194

Bradsky, Gary R. (1998) “Computer Vision Face Tracking for use in a Perceptual User Interface”. Microcomputer
Research Lab, Santa Cl
ara, CA, Intel Corporation

Brand, Matthew. et al. (1999) “Voice Puppetry”. “SIGGRAPH

'99 Proceedings of the 26th annual conference on
Computer grap
hics and interactive techniques

Busso, Carlos. et al. (2005) “Natural Head Motion Synthesis Driven by Acoustic Prosodic Features”. “Computer
Animation and Virtual Worlds”

Volume 16 Issue 3
-
4, J
uly 2005

Cameron, James. et al. “Interview with James Cameron”. “Popular Mechanics” www.popularmechanics.com

Cavé, C. et al. (1996) “About The Relationship Between Eyebrow Movements And F0 Variations”, In Proc. Of The
Icslp, Philadelphia, Pp. 2175
-
2179

Che
n, Yiqiang. et al. (2001) “Speech Driven MPEG
-
4 Based Face Animation via Neural Network”. ”Proceedings of
PCM '01 Proceedings of the Second IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia
Information Processing”

Choe, B. et al (2001). Per
formance
-
Driven Muscle
-
Based Facial Animation. The Journal of Visualization and Computer
Animation. Vol 12, pp 67
-
79

Chuang, E. Bregler, C. (2002). “Performance driven facial animation using blendshape interpolation.”

Cootes, T.F. (2001) "Active Appearance

Models", IEEE PAMI, Vol.23, No.6, pp.681
-
685

Debunne, G. et al (2000). Adaptive Simulation of Soft Bodies in Real
-
Time. Computer Animation, pp 133
-
144

Deng,

Zhigang. et al. (2004) “Audio Based Head Motion Synthesis For Avatar Based Telepresence Systems”
“‘04
Proceedings of the 2004 ACM SIGMM workshop on Effective Telepresence” 2004

Deng, Zhigang. et al (2008). Data
-
Driven 3D Facial Animation. Springer
-
Verlang London Limited.

Dockstader, S. and Tekalp,, A. M. (2001). Multiple camera tracking of interacting

and occluded human motion.
Proceedings of the IEEE 89, 1441

1455

Ekman, P. (1979) “About brows: emotional and conversational signals” , In M. v. Cranach, K. Foppa, W. Lepenies, and
D. Ploog, editors, Human Ethology: Claims and limits of a new discipline:
contributions to the Colloquium.,
pages 169

248.

Ekman, P. (1999). “Emotional and conversational nonverbal signals”. In L. S. Messing & R. Campbell (Eds.), Gesture,
speech, and sign (pp. 44
-
55). Oxford University Press.

Fabrice, Bourel. (2000)“Robust Facia
l Feature Tracking”, Low School of Computing, Staffordshire University Stafford.

Flecha
-
García, María L. (2008) “Eyebrow Raises In Face
-
To
-
Face Dialogue As Markers In Discourse Structure,
Utterance Function, And Prosody”, Speech and Face to Face Communica
tion Workshop in memory of Christian
Benoît

Foxton, Jessica M. et al (2009), “Cross
-
modal facilitation in speech prosody”

Graf, Hans Peter. et al (2002)“Visual Prosody: Facial Movements Accompanying Speech” in Proc. Of IEEE Int’l Conf.
on Automatic Faces a
nd Gesture Recognition

Gros, R. Matthews, I. and Baker, S. (2006) “Active Appearance Models with Occlusion” in Image and Vision
Computing, Vol. 24, No. 6, 2006, pp. 593
-
604.

Hadar U. et al (1983) “Head movement correlates of juncture and stress at sentence

level. Language and Speech”

Hadar

U.,

Steiner

T.J.,

Grant

E.C.,

Rose

F.C., “
The timing of shifts in head posture during conversation.

Human
Movement Science”
(1984)

Healy, Montgomery, (2006) “Consistency of Sentence Intelligibility Across Difficult
Listening Situations”

Hofer, Gregor, et al. (2007) “Automatic Head Motion Prediction from Speech Data”, In
Proc. Interspeech 2007
,
Antwerp, Belgium

Hong, Pengyu, et al. (2002) “Real
-
Time Speech
-
Driven Face Animation with Expressions Using Neural Networks I
EEE
Transactions On Neural Networks”, Vol. 13, No. 4.

11


Ishikawa, Takahiro, et al. (1999) “3D Estimation of Facial Muscle Parameter from the 2D Marker Movement Using
Neural Network” “Lecture Notes in Computer Science “ Volume 1352, 671
-
678

Kapoor, Ashish. (2
002) “Real
-
Time, Fully Automatic Upper Facial Feature Tracking”. Proceedings of the Fifth IEEE
International Conference on Automatic Face and Gesture Recognition, Cambridge.

Kuratate, Takaaki, et al. (1999) “Audio
-
Visual Synthesis of Talking Faces From Spe
ech Production Correlates”

“Proceedings of 6th European Conference on Speech” Communication and Technology.

Lee, S. P. et al (2002). Eyes Alive. Proceedings of the 29th annual conference on computer graphics and interactive
techniques.

Lucas, B. D, Kanade
., T. (1981). An iterative image registration technique with an application to stereo vision. In
International Joint Conference on Artificial Intelligence

Lucas, George (1999). Starwars: Episode 1
-

The Phantom Menace. [Film]. Lucasarts

Mary, Leena (2008),

B. Yegnanarayana, “Extraction and representation of prosodic features for language and speaker
recognition”

Matrinez Lazalde, Oscar Manuel. (2010) “Analyzing and Evaluating the Use of Visemes In an Interpolative Synthesizer
for Visual Speech”. DCS Univers
ity of Sheffield, Thesis for The Degree of Doctor in Philosophy.

Mesit, J. et al (2007). 3D Soft Body Simulation Using Mass
-
Spring System with Internal Pressure Force and Simplified
Implicit Integration. Journal of Computers, Vol 2, pp 34
-
43

Mitchell, Tom.

M. (1997) “Machine Learning” McGraw Hill.

Mittal, A. and Davis, L. (2003). M2 tracker: A multiview approach to segmenting and tracking people in a cluttered
scene. Int. J. Comput. Vision 51, 3, 189

203.

MOTU.com , (2010) “MOTU 8pre
-


http://www.motu.com/products/motuaudio/8pre


Munhall, KG. et al (2004)“Visual Prosody and Speech Intelligibility”

Nguyen, Tan Dat. (2010) “Tracking Facial Features under Occlusions and Recognizing F
acial Expresions in Sign
Language”. Department of Electrical and Computer Engineering, National University of Singapore.

Paeschke, A. et al (1999) “F0
-
Contours in Emotional Speech”. In Proc. Int
ernational Congress of Phonetic
Sciences ’99,
pages 929

931

Pi
ghin, F. et al (1997). Realistic Facial Animation Using Image
-
Based 3D morphing.

Pockaj, Rober
ts. (1998) “FAPS Specification”
http://www.dsp.dist.unige.it/~pok/RESEARCH/MPEG/fapspec.htm.

Prighin, F. et al (1998) Synthesizing realistic facial expressings
from photographs. SIGGRAPH 98 Conference
Proceedings, Annual Conference Series.pp 75
-
84.

Provot, X. (1995) Deformation Constraints in Mass
-
Spring Model to Describe Rigid Cloth Behavior. Graphics Interface

Rivers. A. et al (2007). Fast Lattice Shape Matchin
g for Robust Real
-
Time Deformation. ACM Transactions Graphics
(SIGGRAPH 2007).

Samer, Al Moubayed (2010)

“Auditory visual prominence From intelligibility to behavior” Journal on Multimodal User
Interface, Vol3, No4, pp 299
-
309.

Sargin, Mehmet Emre, et al.
(2002) “Prosody
-
Driven Head
-
Gesture Animation”

Sargin, Mehmet Emre. et al. (2008) “Analysis of Head Gesture and Prosody Patterns for Prosody
-
Driven Head
-
Gesture
Animation” “IEEE Transactions on Pattern Analysis and Maching Intelligence” Vol. 30 no. 8. pp.
1330
-
1345

Sederberg, T. et al (1986). Free
-
Form Deformation of solid geometry
models. Proceedings of the 13
th

annual conference
on Computer graphics and interactive techniques. pp 151
-
160

Seo, N. (2008) “Pitch Detection”

Shi, Jianbo. (1993)“Good Features t
o Track” Technical Report.

Terzopoulos, D. et al (1990). Physically
-
Based Facial Modelling, Analysis, and Animation. Journal of Visualisation and
computer Animation, 1 , pp 73
-
80

Vertegaal, R. et al (2001). Eye Gaze Patterns in Conversations: There is More

to Conversational Agents Than Meets the
Eyes. Proceedings of the SIGCHI conference on Human factors in computing systems.

Viaud, M. et al (1992). Facial Animation with Wrinkles.

Wang, Jianyu. (2003) “Facial feature tracking combining model
-
based and Model
-
free method”. ICME ‘03 Proceedings
of the 2003 International Conference on Multimedia and Expo
-

Colume 3.

Waters, K. (1987). A Muscle Model for Animating Three
-
Dimensional Facial Expression. Computer Graphics. Pp 17
-
24.

Wu, Y. et al (1994) A Plastic
-
Visc
o
-
Elastic Model For Wrinkles in Facial Animation and Skin Aging. Proceedings of the
second pacific conference on computer graphics and applications, Pacific Graphics '94. Fundamentals of
computer graphics. pp 201
-
214

Yehia, Hani C. et al. (2002) “Linking F
acial Animation Head Motion and Speech Acoustics” “Journal of Phonetics”
Volume 30, Issue 3, Pages 555
-
568.

Yilmaz, Alper. (2006) “Object Tracking: A Survey”. ACM Computing Surveys, Volume 38 Issue 4.

Yoshizawa, S. et al (2002). A simple approach to intera
ctive free
-
form shape deformations. 10th Pacific Conference on
Computer Graphics and Applications.

Zhou, Mingcai. (2010) “AAM based Face Tracking with Temporal Matching and Face Segmentation”. Computer Vision
and Pattern Recognition.