F0 contour generation in TTS system for Russian language

estonianmelonAI and Robotics

Oct 24, 2013 (3 years and 5 months ago)

67 views

F0 contour generation in TTS system for Russian language

1

F0 contour generation in TTS system for Russian
language

A.V.Babkin, O.F.Krivnova


Russia, 119899 Moscow, Vorobyovi Gori, 1
-
st Building of the
Humanities MSU, Phone: (095) 939
-
26
-
01; Fax: (095) 939
-
55
-
96

avb@scienc
e.park.ru
,
okri@philol.msu.ru


ABSTRACT

In this paper the strategy and ways of F0 contour generation in TTS
system for Russian language developed in Lomonosov Moscow State
University are described. The system is b
ased on two methods:
concatenation of allophones' waveforms and prosodic rules to control
fundamental frequency, duration and intensity. The prosodic rules are
part of the speech control module which carries out the interface
function, bridging the gap be
tween the output of the block of text
linguistic processing and the input of speech signal generation module.
As a result each segment (allophone) in a phrase being synthesized is
attributed by at least two F0 values as its starting and ending points.
Thre
e and even more F0 values can be assigned to the phone if it is
necessary. Signal generation is implemented according to the phrase
control file, which describes the phrase as a sequence of allophones code
names with assigned duration, energy and fundament
al frequency values.
To transform the base allophones to required prosodic values we use
procedures that are close to TD PSOLA technology. In this article the
authors describe all steps in development prosody modification algorithm
based on TD
-
PSOLA techno
logy for concatenation TTS system and pay
additional attention to the ways of increasing naturalness of synthesized
speech.

1.

OVERALL ARCHITECTURE OF THE SYSTEM

The overall structure of our system is in line with the functional
organization of a general T
TS synthesizer. It consists of several blocks or
modules, each of which has its own tasks and functions. The structure of the
system is shown on Fig.1.

2.

Generation of pitch contour

The basic unit, for which the pitch contour is generated, is an
intonati
onal phrase (IP)
-

a coherent, grammatically organized fragment of a
text to which one intonational model (abstract tune) is attributed. The type

2

of intonational model for IP gets out as a result of the work of accent
-
intonation transcriptor and is fixed
as an abstract prosodic marker.










































Figure 1. Overall structure of TTS system for Russian.

This device also determines the levels of words' prominence that is
important to generate n
aturally sounding pitch contours. We assume that
rhythm and accentuation is adjusted by two functionally different
mechanisms: focus accentuation and rhythmization. The focus accents (to
contrast or emphasize some words) are substantially de
-
fined by a spe
aker
intention or by the whole information structure of a text. Frequently this
structure has no evident cues to determine an accent place and type.
Therefore the formalization of focus accentuation represents the most
difficult problem for TTS
-
systems. Ou
r synthesizer is able to synthesize
phrases with different focus accents but we have no rules to determine their
localization automatically: it should be done manually. If a phrase has words
Text Normalization

Linguistic Analysis:

synt act i cal, morphol ogi cal pars
i ng et c.


Text

Preprocessi ng

Aut omat i c Accent
-
Int onat i on
Transcri pt i on

Aut omat i c Phonemi c

Tt ranscri pt i on

Allophonic
Coding

Di gi t al
Si gna
l

Processi ng

Si gnal Generat i on

Speech

Prosodic
P
a
ramet rizat ion

Synt hesi s

Allophonic
Dat abases


Lexi con

Speech

Cont rol

Generat i on

Prosodi c

paramet ri zat i on

Allophonic Coding

Cont rol Fi l e
Generat i on

F0 contour generation in TTS system for Russian language

3

with accent markers, the last of them is considered as the intona
tional center
(nuclear) of a phrase. Otherwise the last content word of a phrase is as its
intonational nuclear by default. It is the most typical situation for the
narrative Russian texts, which construction is based on the use of neutral
linear
-

accent
structures with a final position of the intonational center.

As far as rhythmization is concerned, we distinguish three degrees of
vowel prominence within a word (stressed, strong unstressed, weak
unstressed) and four degrees for lexically stressed vowels
(1 for full clitics, 2
for functional words, 3 for nonnuclear content words, 4 for nuclear content
word). It should be noted that in Russian the prominence markers are very
important not only for adequate pitch generation but also to determine
correctly th
e duration of sounds.

In our system we use 7 abstract intonational models: 1 model of
finality; 1
-

non
-
finality; 3
-

interrogative models (general, special,
comparative questions); 1
-

exclamation (or command). For all models the
possibility of a differen
t position of the intonational center is taken into
account. The formation of F0 contours for concrete phrases within the same
intonational model is carried out in the separate submodules.

The strategy of pitch generation in each intonational submodule is

as
follows. The contour of the synthesized IP is formed as a result of
concatenation of two types of tonal objects
-

tonal accents the main of
which are nuclear and tonal plateaus. Each intonational model is considered
as a cluster of these tonal events
with the possibility of various phonetic
realization determined by the rhythmical and sound structure of the IP.

Tonal accents are aligned with lexically stressed syllables if their
prominence level is not less than 3 and if they are not considered atoni
c in
the chosen intonational model. The main control parameters for pitch accents
are a type of pitch movement (tonal figure), the realization time domain (part
of a phrase to which the accent is phonetically anchored, stressed syllable
including), the loc
alization of pitch target points of the accent in a speaker
pitch range and in realization time do
-
main. We recognize that in Russian
pitch movements forming the accent (and their targets) are very closely
correlated with the boundaries of sound segments.


The tonal plateaus are aligned with unstressed and atonal stressed
syllables in the beginning and end of IP and also in the intervals between
pitch accent realization domains. The controllable parameters in this case are
pitch values at the margins of int
onational phrases and an interval of pitch
change.

The temporal alignment and amplitude of tonal events are controlled by
rules taking into ac
-
count the intonation model itself, the rhythmical pattern
of IP and its segmental make
-
up. To make it possible th
e preliminary coding
of syllables is carried out which fixes such features as accent status of a
syllable, its prominence level (according to IP rhythmical structure), position

4

in the IP and sound make
-
up. All pitch rules are hand
-
written and based on
phon
etic and acoustic analysis of read
-
aloud texts.

The calculation of F0 curves is implemented in two steps: at first in a
semi
-
tone scale with respect to the average pitch (reference line) of a
speaker, then these values are transformed into Hz. The calc
ulated curve
settles down in a working area of the speaker voice range, the boundaries of
which are typical for realizations of the chosen intonational model.

3.

Prosody modification algorithm for Russian TTS

One of the approaches in the creation of the h
igh quality TTS system is
the concatenative approach. Formation of the synthesized speech signal
occurs in this case by means of connection of the acoustic waveform
samples which are called elements of concatenation. The elements of
concatenation are forme
d from the initial samples of the speech signal,
storing in the database, by means of modification of their prosodic
characteristics (such as duration, fundamental frequency and energy) in
accordance with the requirements of the natural language processing

module.

The theoretical foundation for the developing our methods of forming
the prosodic characteristics of speech signal is TD
-
PSOLA approach. The
main idea of TD
-
PSOLA methods consists in the following: the initial
allophone is multiplied by sequence
of time windows synchronized with
fundamental frequency. The received sequence of acoustic segments, which
are preliminary shifted about each other, is summed up, thus making the
required modified allophone. To change the duration of the allophone the
tech
nology of repetition or elimination of some acoustic segments is used. In
the traditional realization of this algorithm, in case of noticeable increase of
the duration of speech signal, and caused by this many
-
timed repetition of
some identical segments, a

particular unnaturalness is observed in perception
of the resultant speech. To make the phonation more natural we have built
special algorithms based on random repetition and making some changes in
the sequence of the identical acoustic segments. These al
gorithms are
realized in the module P2 (Fig.2)

In our Russian speech synthesis system base elements of concatenation,
in the majority of cases, have the phonemic measurement and, thus, are
allophonic realizations of the traditional phonemes. The structure
of module
that is modifying prosodic characteristic of the vocal allophones is given in
the Fig 2. (In our report we are not discussing the prosody modification
algorithm for unvocal allophones (in this case only duration and energy is
needed to be changed
) because this particular part is not such complicated as
for vocal allophones methods.)

One of the main requirements which essentially increase quality of the
synthesized speech is minimization of the distortions in acoustic
characteristics of the transit
ional parts of the allophone. Within the
F0 contour generation in TTS system for Russian language

5

framework of this requirement the modification of the fundamental
frequency is realized along the whole length of the initial allophone;
alteration the duration of the allophone occurs only on its specially
calculat
ed parts


that is called stationary section. The calculation of the
stationary parts can be accomplished on the stage of speech database
creation thus increasing the speed of whole system. But in our system it is
performing in digital signal processing mo
dule, because only in this stage of
speech synthesis it is known to what degree initial allophone has to be
changed and thus giving the possibility to estimate the length of the
stationary part.
















Figure 2. The structure of pr
osody modification module.

Now let us discuss all steps of generation the modified allophone. The
prosody modification module receives the initial allophone with pitch marks
from database and creates the initial sequence of acoustic segments (step
P1). Eac
h segment has it own number and duration witch is defined in
speech database that was calculated during the database creation. Next step
(P2) is analyzing the requirements, which is specified in control information
file and generating the result sequence o
f segments. Each element in this
sequence has reference to initial element and the new duration of segment is
calculated. To avoid some unnaturalness the algorithm realized in this step
makes some changes in case of continuous sequence of elements that has

reference to the same initial segment.

Prosody modification

module of vocal

allophones


Initial allophone


(pitch marks and stationary section)

Control information

(prosody parameters and algorithm types)

P1. Generation the initial sequence of acoustic segments (Ni,T0i)

P2. Ge
neration the result sequence (Nj,T0j,Ni)

P3. Correction Module.

Modification the result sequence (improving quality)

P4. Acoustic synthesis:

Generation the final modified allophone

P5.

Energy modifications of final allophone

Modified allophone


6

In the process of formation of the melodic contour each elements of the
result sequence is given duration that is calculated by linear way between the
values in the ‘start’ and ‘end’ points of the tonal movement. It

brings some
shade of the unnaturalness because it does not reflect natural fluctuation of
fundamental frequency


that is perceived by the listener as ‘computer
voice’. This could be observed during the essential increase of the duration
of the allophone
as for example the synthesis of the ‘singing voice’


in
which the fundamental frequency becomes fixed in the concrete value. In
real speech signal fundamental frequency changed occasionally in certain
limits around the given value.

Works of Klatt (Klatt
and Klatt 1990) suggest the simple formula which
describes the occasional fluctuation of fundamental frequency in speech:


(1)

This additional fluctuation of fundamental frequency enhances
naturalness of the synthesized speech. In
our TTS system this formula was
converted to more complex variant with two parameters:


( 2)

where A = characterizes the degree of fluctuation of the period of the
fundamental frequency and its diapason of values is between 0 and 100
. K


the degree of casualty or quasi
-
periodicity. The fluctuation value (

T) is
calculated for each element and is added to the value of pitch period (T) of
this element. This is realizing in step P3. Transition to this variant of formula
is motivated first and foremost by the model which we use for prosody
modification. The us
age of parameters gives the possibility to enhance or to
reduce the influence of this formula on the synthesized speech. When A=0
fluctuation is absent. According to the tests, the most ‘natural’ phonation is
achieved when:

A
=4
K
=0.00005

( 3 )

These

val
ues

are

used

as

default

values

in

our

system
.
In the course of
further increase of parameter A for example when A=40, the effect of “sob”
is observed


that could be explained by significant vibration of fundamental
frequency.

The next step is generating n
ew modified allophone using the result
information, which has been calculated in the previous steps. The modified
allophone is formed from sequence of result segments by using OLA
(overlap and add) technology. In systems based on TD
-
PSOLA technology
the ty
pe and size of window function has the special significance. They are
chosen to achieve the most exact spectral accordance between synthesized
and real speech. Also great significance has timeline location of the window
function against period. From this t
ime we can talk about the problem of
choosing ‘start point’ of the period. There exists several variants of choice of
these parameters and due to their small noticeable difference in perception
F0 contour generation in TTS system for Russian language

7

we have implemented several of these choices. They differ by w
indow
function and the localization of window within period. We have done several
tests and found that it is difficult to choose the best from them and in our
system we decided to leave some of them and user can switch between them.

The last step is energ
y modification of the result allophone. After
implementing any PSOLA algorithms the energy of the result acoustic signal
is changed and we need to normalize it to some value. The normalization
algorithm is done in this step. In our system we can choose the

way of
normalization. The result allophone can be normalized to the average energy
or his energy can be increased or reduced to some value. In real speech
signal the average energy of each period realizes not only the given energetic
contour but is modifi
ed according to the casual law around the local average
energetic value. We may assume that in order to improve the quality of
synthesized speech it is needed to take into consideration this particular low
or to talk about its mathematical realization. We
haven’t yet investigated into
this sphere but it is known that additional modification will cause certain
tangible effect on the synthesized speech. For example if we take some kind
of sinus periodical formula thus in some value of the period for this form
ula
we receive the acoustic effect which is called the ‘amplitude vibrato’. In the
current version of synthesizer we have already reserved the place for this
inquiry.

All the algorithms and methods mentions in the report have passed the
special testing pr
ogram and are realized as computer program, which makes
part of the Russian text
-
to speech system being developed at MSU.

REFERENCES

Papers or chapters in books/journals:

[1] Babkin, A. V.,

Zakharov L.M.

(1999): Testing of “Text
-
to
-
Speech”
System Developed

in MSU. [in:]
International Workshop “Speech and
Computer” SPECOM99.,
Moscow.


[2] Babkin, A. V. (1998): Automatic synthesis of speech


problems and
methods of speech signal generation. [in:]
Proceedings of International
Workshop “Dialogue98” (Computatio
nal Linguistics and its
Applications),
Kazan', pp. 425
-
437.

[3] Krivnova, O.F. (1998): TTS synthesis for Russian language (second
version for female voice). [in:]
Proceedings of International Workshop
“Dialogue98” (Computational Linguistics and its Applica
tions),
Kazan'.

Internet locations:

[4] http://isabase.philol.msu.ru/SpeechGroup/