Perspectives for Articulatory

standingtopAI and Robotics

Nov 17, 2013 (3 years and 11 months ago)

101 views

Perspectives for Articulatory
Speech Synthesis

Bernd J. Kröger


Department of Phoniatrics, Pedaudiology, andCommunication Disorders

University Hospital Aachen and RWTH Aachen, Germany

University Hospital Aachen Germany

Bernd J. Kröger

bkroeger@ukaachen.de



Examples: ASS of

Peter Birkholz (2003
-
2007) Univ. of Rostock

application: railway
announcement system

application: dialog system

Outline


Introduction


Vocal tract models


Aerodynamic and acoustic models


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Outline


Introduction: Perspectives


Vocal tract models


Aerodynamic and acoustic models


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Perspectives for Articulatory Speech Synthesis?


Commercial or technical
vs. scientific
:


Is
high quality articulatory speech synthesis
a realistic goal?


Yes!


If we have it:


Advantage in comparison to current corpus
-
based synthesis methods:


Variability


Different voices simply by parameter variation


(sex, age, voice quality)




no need for different corpora


Individual differences in articulation


no need for different corpora



e.g.:

degree of nasalization




individual sound / syllable realizations


Different languages


no need for different corpora

Perspectives for Articulatory Speech Synthesis?


Commercial or technical
vs. scientific goals:


audiovisual speech synthesis:


modeling 3D talking heads


towards “the virtual human” (avatars)


towards “humanoid robots”


Engwall, KTH Stockholm
(1995
-
2001)

Need for more
natural talking
heads

Perspectives for Articulatory Speech Synthesis?


Scientific perspectives
:


ASS may help to collect and condense knowledge of speech
production:


… of articulation (sound geometries, speech movements,
coarticulation)


… of vocal tract acoustics


… of control concepts: different approaches exist:


neural control: self
-
organization, training algorithms (Kröger et
al. 2007)


gestural control: concept for articulatory movements (Birkholz
et al. 2006, Kröger 1998)


segmental control (Kröger 1998)


corpus
-
based control (Birkholz et al. this meeting)


Outline


Introduction: Components of ASS
-
Systems


Vocal tract models


Aerodynamic and acoustic models


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Components of articulatory speech synthesis

speech signal

control module

vocal tract and glottis model

+

area function


tube model

aerodynamic
-
acoustic simulation

Outline


Introduction


Vocal tract models


Aerodynamic and acoustic models


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Outline


Introduction


Vocal tract models: types


Aerodynamic and acoustic models


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Different types of vocal tract models


Statistical models:

Parameters
derived on basis of statistical
analysis e.g. of MR/CT/X
-
Ray
-
image
-
corpra
(Maeda 1990, Badin et al.
2003)


Geometrical models:

vocal tract
shape is described by
a
-
priori

defined parameters

area
-
function related: Stev&House 1955

articulator related: Mermelstein 1973,
Birkholz 2007


Biomechanical models:

Modeling
of articulators using finite element
methods
(Dang 2004, Engwall 2003,
Wilhelms
-
Tricarico 1997)


1D, 2D, 2D+, 3D models

Stevens & House (1955)

Flanagan et al. (1980)

Dang et al. (2004)

Engwall (2003)

Badin et al. (2003):

gridline system

1D

2D

2D+

3D

3D

preferred for ASS

Geometrical 3D vocal tract model: Birkholz (2007)

[a]



[i]



[schwa]

based on MRI
-
data of one speaker (and CT data of an
replica of teeth and hard palate)

Example:

Vocal tract parameters (a priori)


Lips (2 DOF),


Mandible (3 DOF),


Hyoid (2 DOF),


Velum (1 DOF),


Tongue (12 DOF),
!!!



Minimal cross
-
sectional
areas (3 DOF)




23 basis parameters:

Meshes of the vocal tract model (Birkholz 2005)

Figure of the complete vocal tract model (Birkholz 2005)

Variation of individual parameters

Variation of the
lower jaw
,

leaving all other parameters

constant



co
-
movement of
dependent articulators: lips
and tongue

Outline


Introduction


Vocal tract models


Aerodynamic and acoustic models


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Aero
-
acoustic simulation


Four major types of models


Reflection type line analog
: forward and backward traveling partial
flow or pressure waves are calculated; time domain simulation (e.g.,
Kelly & Lochbaum 1962; Strube et al. 1989, Kröger 1993);


Problem: no variation of vocal tract length; constant tube length


Transmission line circuit analog
: digital simulation of electrical
circuit elements: time domain simulation (e.g. Flanagan 1975,
Maeda 1982, Birkholz 2005)


Problem: modeling frequency dependent losses (radiation, …)


Hybrid time
-
frequency domain models
(Sondhi & Schoeter 1987,
Allen & Strong 1985)


Problem: flow calculation


modeling aerodynamics; sound sources


Three dimensional FE
-
modelling of acoustic wave propagation
and aerodynamics
(e.g. ElMasri et al. 1996, Matsuzaki and Motoki
2000): helpful for exact formulation of aero
-
acoustics in the vicinity
of noise sources (glottis, frication);


Problem: complexity and high computational effort; real time
synthesis can not be acheived

preferred

Extraction of the area function: Birkholz (2007)

Note: Area function can not be calculated from 2D vocal tract models:
From midsagittal data we can not deduce cross sectional data!

cross
-
sections:
perpendicular to airflow

vary in shape!

cross
-
sectional area values from glottis to mouth


area function

Midline of VT

Calculation of the area function for KL
-
synthesis:
Kröger (1993): needs constant VT
-
length

… for a complete sentence:

“Das ist mein Haus” (“That’s my house”)

End: discete areafunction

Green: Midsagittal
view

White: gridline
system for calcu
-
lation of area
function

continuous area
function (varying
vocal tract length)

discrete area
function


defining tube
sections for the
acoustic model

Illustration:

Begin: VT model

From VT over area function to vocal tract tubes

now: for a transmission line circuit model (Birkholz 2005):

mouth

opening

trachea / subglottal system

glottis

vocal tract:

pharyngeal

and oral part

Teeth

nasal cavity

and sinuses (i.e.
indirectly coup
-
led nasal cavi
-
ties: Dang &
Honda (1994)

branch

…. vocal tract tubes can vary in length:

Second Example :

Next step: From tube model to acoustic signal

using the transmission line circuit analog (Birkholz 2005):


Geometrical parameters of

an (elliptical) tube sections

Length: l

cross
-
sectional area: A

perimeter: S

(elliptic small
-

round)


mass (enertia),
compressibility
and losses of air
column

On the basis of …

Acoustic parameters of a tube section


parameters of
lumped elements of the electrical
transmission line



calculation of pressure and flow for each tube section

Calculation of the acoustic speech signal: Kröger (1993)

… for a complete sentence:

“Das ist mein Haus” (“That’s my house”)

tube section model
(area function)

oral part

nasal part

(red arrows: insertion
of Bernoulli pressure
drop and of noise
source)

lung pressure,
vocal fold tension,
glottal aperture
for whole
utterance

white: progress of
calculation
(progress bar)

Instantaneous
acoustic signal
(20ms
-
window)

time line for complete utt.

Illustration: …

Display of air flow and air pressure calculated

along the transmission line: Kröger (1993)

… for one glottal cycle within a complete sentence:

“D
a
s ist mein Haus” (“That’s my house”)

magenta: pressure

blue: flow

red: glottal mass pair

light blue: force on
the mass pair

strong acoustic
excitation at time
instant of glottal
closure (after glottal
closing phase)

high flow values
during glottal
opening

tube section model
(current area
function)

lung pressure,
vocal fold tension,
glottal aperture
for whole
utterance

white: progress of
calculation
(progress bar)

acoustic signal
just calculated
(20ms
-
window)

current pressure
values of each
tube section

Summarizing: Vocal tract models

and acoustic simulation


Area function is the basis for the calculation of the acoustic signal.


Calculation of area function can not be done in 2D models


this
disqualifies 2D
-
VT models for articulatory
-
acoustic speech synthesis


Parametric VT
-
models should be preferred currently for building up high
-
quality articulatory speech synthesizers. Advantages:


low computational effort for calculation vocal tract geometries


strong flexibility to get auditory satisfying sound targets


in future these models should be replaced by statistically based models and
by biomechanical models

Vocal tract models:


Problems occurring with different acoustic models:


Variation of length of tube sections


Modeling frequency dependent losses


Computational effort



Conclusion:


The transmission line circuit analog (e.g. Birkholz et al. 2007)
allows a compromise between quality and computational effort:


Real time synthesis
should be possible in the near future on
normal PC’s using TLCA

Summarizing: Vocal tract models

and acoustic simulation

Acoustic simulation:

Outline


Introduction


Vocal tract models


Aero
-
acoustic simulation of speech sounds


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Glottis models


Self
-
oscillating models

(e.g. Ishizaka Flanagan 1972 two
mass model and derivates)


physiological control parameters: vocal fold tension,
glottal aperture …


calculation of glottal area waveform (low, up) and
glottal flow



Parametric glottal area models
(e.g. Titze 1984 and
derivates)


glottal waveform (opening
-
closing movement) is given


Calculation of glottal flow



Parametric glottal flow models
(e.g. LF
-
model 1985)


Acoustical relevant control parameters: F0, open
quotient, maximum negative peak flow (time derivative
of glottal flow), return phase ….




d
irect control of acoustic voice quality

preferred

Different phonation types using

a self
-
oszillating model (Kröger 1997)


Using a self
-
oszillating
model with a chink


able to produce:


normal phonation


loud phonation


breathy phonation


creaky phonation

able to produce:


F0 contours


voiced
-
voiceless
contrast

normal

loud

breathy

creaky

the model:

extended by
a chink, leak

simply two

control parameters:

-

vocal fold tension
-

glottal aperture

Mechanisms for the generation of noise




Separate:


volume velocity sources
(no obstacle case)






pressure sources
(obstacle case, Stevens 1998)

Noise is produced at narrow passages within the VT:

Occurrence of noise sources in the VT

Birkholz (2007)

Occur simultaneously at different places within the VT:


pressure sources
: lung section (no noise), epiglottis, at obstacles (e.g. teeth)


volume velocity sources:
at the exit of each VT constriction

Controlled by degree of VT constriction and amplitude of air flow

Voiceless excitation of the vocal tract


Noise is produced at narrow passages within the VT


The mechanisms of noise generation are not completely understood (no
satisfying 3D FE models solving the Navier
-
Stokes equation)


Current solution:
parameter optimization


The art to construct a good noise source model is to


find the right places for insertion of noise sources


optimize parameters (spectral shape, strength,…) of the source noise

Noise source parameter optimization: examples



Synthesis examples (
real


synthetic
-

/aCa/
), Birkholz (2005):





/f/


/s/ /sh/ /ch/ /x/


But compare with Mawass, Badin & Bailly (2000):



/f/


/s/ /sh/


Summarizing: glottis models

and noise source models


Take self
-
oscillating vocal fold models; can be used for high quality
articulatory speech synthesis


vocal fold tension


mainly determines f0


glottal aperture


voice qualities: pressed


normal


breathy


glottal aperture


segmental changes: glottal stop


voiced








voiceless



Take simple noise models (pressure and velocity sources)


3D acoustic noise source models (solving the Navier
-
Stokes equation)
currently are not satisfying.

Outline


Introduction


Vocal tract models


Aero
-
acoustic simulation of speech sounds


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Generation of speech movements


Starting with segmental input:


text


sound chain, phoneme chain (text2phoneme
-
conversion)


But: How to convert a

chain of segments (phones)


articulatory movements ?


Theoretically and practically elegant solution:


concept of
articulatory gestures
: bivalent character:


discrete phonlogical units




quantitative units for controlling articulatory movements

phonological plan
-
> motor plan (discrete)


k
O

m p a s


segments

Example: „Kompass“

Quantitative control units

discrete
gestures


V
-
row

vocal organ

target

C
-
row 1



C
-
row 2


O

a

do la la ap

fc fc fc nc


og ov og og

motor plan

glottis, velopharyngeal port

From discrete to quantitative realisation of a gesture

dorsal full closing gesture: {
fcdo
}

Quantiative gestural
parameters:

T
on
, T
off
, T
targ
, voc_org,
loc
targ

activation interval

time function for
articulator movement

Modeling reduction is easily possible:

Example: “mit dem Boot” (Kröger 1993)

not reduced

fully reduced

9 steps:

all gestures
still exist!

motor plan

increase in speech rate


increase in gestural overlap

: quantitative



segmental changes : qualitative

Connected speech using gestural control:
Examples (1)

„Guten Tag...“

„Der Zug...“

Connected speech using gestural control:
Examples (2)

„Nächster Halt...“

„Nächster Halt...“

Summarizing: control concepts


Gesture based control concept
can be used:


Links phonemes to articulation: bivalent character of gestures:


discrete phonological units


quantitative units for motor control (activation interval, targets, transition
velocities, …)


gestures quantitatively comprise


description of target
-
directed movement


definition of the target itself (not incompatible with target concepts)


gestures model segmental changes (assimilations, elisions) occurring in
reduction by increase in temporal overlap of gestures


How to deduce rules for coordination of speech gestures for syllables, words,
complete utterances?

Outline


Introduction


Vocal tract models


Aero
-
acoustic simulation of speech sounds


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

Note


We have a lot of knowledge
concerning
the
plant
:


articulatory geometries


speech acoustics

We have much less knowledge concerning

neural control
of speech articulation

Birkholz et al. (2007)

Note


We have a lot of knowledge
concerning
the
plant
:


articulatory geometries


speech acoustics

We have much less knowledge concerning

neural control
of speech articulation

Homer Simpson

no problem

problem

Idea


Copy or mimic
speech acquisition
:


Start like a toddler with
babbling
: i.e. explore your vocal apparatus and
combine motor states with resulting sensory states (auditory, somato
-
sensory)


Imitation
: copying mothers (caretakers) speech signals is now possible,
since auditory
-
to
-
motor relations are already trained


Idea: to build
-
up a corpus of trained speech items (is known as the mental
syllablary, postulated by Levelt and Wheeldon 1994)




idea:
corpus
-
based neuro
-
articulatory speech synthesis

Is based purely on acoustic data; articulatory data are not needed (EMA…):

Toddler are able to learn to speak from acoustic stimulation

cortical

subcortical
and
peripheral

auditory
receptors and
preprocessing

somatosensory
receptors and
preprocessing

acoustic
signal

articulatory
signal

neuro
-
muscular
processing

articulatory state

primary motor map

phonemic map

phonetic
map

motor state

from mental lexicon and syllabification:

motor
planning

somatosensory
map

external
speaker
(mother)

motor execution


motor plan

frequent syllables

infrequent
syllables

prosody

to comprehension

premotor

primary
motor

frontal lobe

motor execution
(control and
corrections)

cerebellum
basal ganglia
thalamus

muscles and
articulators:
tongue, lips,
jaw, velum …

skin, ears
and sensory
pathways

phonological plan

high
-
order

primary

parietal lobe

somatosensory
-
phonetic proc.

cortical

high
-
order

primary

temporal lobe

auditory
map

auditory
-
phonetic
processing

auditory
state

ssst.

subcortical

peripheral

Neurophonetic model of speech production: DFG grant KR1439/13
-
1 2007
-
2010

Outline


Introduction


Vocal tract models


Aero
-
acoustic simulation of speech sounds


Glottis models and noise source models


Control models: Generation of speech movements


Towards neural control concepts


Conclusions

What are the perspectives

for articulatory speec synthesis?


Practically: ASS could reach high
-
quality standards over the next
decades.


My recommendation: Use


3D geometrical (or statistical) vocal tract articulatory models


Simple self
-
oscillating glottis models (2 masses and a chink)


Transmission line analog time domain acoustic models (1D) with
optimized simulation of losses


an optimized simple noise source model


Gestural control concept


Acoustic data base for generating gestural coordination and prosody
(cp. Birkholz et al. 2007 this meeting)



Example for singing using ASS: Dona nobis pacem:

(Birkholz 2007)

Clinical application of a 2D VT model

2D
-
articulatory model synchro
-
nized with natural speech

used in speech therapy
(Kröger
2005): visual stimulation technique

Thank you !!

What do you
think about
these ideas?

I like this stuff.
It is good for
our future!

Literatur


Badin P, Bailly G, Revéret L, Baciu M, Segebarth C, Savariaux C (2002) Three
-
dimensional articulatory modeling of
tongue, lips and face, based on MRI and video images, Journal of Phonetics 30: 533
-
553


Birkholz P, Jackèl, D, Kröger BJ (2007) Simulation of losses due to turbulence in the time
-
varying vocal system. IEEE
Transactions on Audio, Speech, and Language Processing 15: 1218
-
1225


Birkholz P (2007) Control of an articulatory speech synthesizer based on dynamic approximation of spatial
articulatory targets. Proceedings of the Interspeech 2007


Eurospeech. Antwerp, Belgium


Birkholz P (2005) 3D
-
Artikulatorische Sprachsynthese.
Unpublished PhD thesis. University Rostock


Birkholz P, Jackèl, D (2004) Influence of temporal discretization schemes on formant frequencies and bandwidths in
time domain simulations of the vocal tract system. Proceedings of Interspeech 2004
-
ICSLP. Jeju, Korea, pp. 1125
-
1128


Birkholz P, Kröger BJ (2006) Vocal tract model adaptation using magnetic resonance imaging. Proceedings of the 7th
International Seminar on Speech Production. Belo Horizonte, Brazil, pp. 493
-
500


Birkholz P, Jackèl D, Kröger BJ (2006) Construction and control of a three
-
dimensional vocal tract model.
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2006).
Toulouse, France, pp. 873
-
876


Birkholz P, Steiner I, Breuer S (2007) Control concepts for articulatory speech synthesis. Proceedings of the 6th
ISCA Speech Synthesis Research Workshop. Universität Bonn


Browman CP, Goldstein L (1990) Gestural specification using dynamically
-
defined articulatory structures. Journal of
Phonetics 18: 299
-
320


Browman CP, Goldstein L (1992) Articulatory phonology: An overview. Phonetica 49: 155
-
180

Literatur


Cranen B, Schroeter J (1995) Modeling a leaky glottis. Journal of Phonetics 23: 165
-
177


Dang J, Honda K (1994) Morphological and acoustical analysis of the nasal and the paranasal cavities. Journal of the
Acoustical Society of America 96: 2088
-
2100


Engwall, O (1999) Modeling of the vocal tract in three dimensions,
EUROSPEECH'99:

113
-
116


Flanagan JL (1965) Speech Analysis, Synthesis and Perception. Springer
-
Verlag, Berlin


Guenther FH (2006) Cortical interactions underlying the production of speech sounds. Journal of Communication
Disorders 39: 350
-
65


Guenther FH, Ghosh SS, Tourville JA (2006) Neural modeling and imaging of the cortical interactions underlying
syllable production. Brain and Language 96: 280
-
301


Kohonen T (2001) Self
-
organizing maps
.

Springer, Berlin, 3rd edition


Kröger BJ (1998) Ein phonetisches Modell der Sprachproduktion (Niemeyer, Tübingen).


Kröger BJ (1993) A gestural production model and its application to reduction in German.
Phonetica 50: 213
-
233


Kröger BJ (2003) Ein visuelles Modell der Artikulation. Laryngo


Rhino


Otologie 82: 402
-
407


Kröger BJ, Birkholz P, Kannampuzha J, Neuschaefer
-
Rube C (2007) Multidirectional mappings and the concept of a
mental syllabary in a neural model of speech production. In: Fortschritte der Akustik: 33. Deutsche Jahrestagung für
Akustik, DAGA '07. Stuttgart


Kröger BJ, Birkholz P, Kannampuzha J, Neuschaefer
-
Rube C (2006) Learning to associate speech
-
like sensory and
motor states during babbling. Proceedings of the 7th International Seminar on Speech Production
.
Belo Horizonte,
Brazil, pp. 67
-
74


Kröger BJ, Gotto J, Albert S, Neuschaefer
-
Rube C (2005) A visual articulatory model and ist application to therapy of
speech disorders: a pilot study.
In: S Fuchs, P Perrier, B Pompino
-
Marschall (Hrsg.) Speech production and
perception: Experimental analyses and models. ZASPiL 40: 79
-
94

Literatur


Mermelstein P (1973)
Articulatory model for the study of speech production. Journal of the Acoustical Society of
America 53: 1070
-
1082


Saltzman EL, Munhall KG (1989) A dynamic approach to gestural patterning in speech production. Ecological
Psychology 1: 333
-
382


Titze IR (1984) Parameterization of the glottal area, glottal flow, and vocal fold contact area. Journal of the Acoustical
Society of America 75: 570
-
580

Training set:
“silent mouthing”

subset for lips

subset for tongue

Observation:
Hannah (0
-
2):
each morning
during wake up


combination of min, (mid,) and max values {0, (0.5,) 1} of all 10 joint
parameters (Kröger et al. 2006, DAGA Braunschweig)


double closures and non
-
physiological articulations are avoided



4608 patterns
of training data

Training


Design of the net:
one
-
layer feed
-
forward
,


25+18 input neurons (somatosensory), 40 output neurons (motor)


ca. 2000 links


Set of 4608 patterns of training data




min
-
max combination
training set; “silent mouthing”


5.000 cycles of batch training




mean error
ca. 10%
for prediction of an articulatory state (Kröger et al. 2006b, ISSP, Ubatuba,
Brazil)


Software:

Java
-
version of SNNS (Stuttgart Neural Network Simulator)
http://www
-
ra.informatik.uni
-
tuebingen.de/SNNS/

Training results: “motor equivalence”

labial closure

apical closure

dorsal closure

each column:
somatosensory values are the same

(except of jaw parameter)



acoustically relevant closures are kept despite
strong jaw perturbation

position of
lower jaw:
low

position of
lower jaw:
high

… despite prediction error 10%