Building an Irish Gaelic Synthesizer using the Festival Speech ...

birthdaytestAI and Robotics

Nov 17, 2013 (3 years and 8 months ago)

67 views

Building an Irish Gaelic Synthesizer using the Festival Speech
Synthesis System/

Tógáil Sintéiseora Gaeilge Trí Úsáid a bhaint as Córas Sintéise Cainte
Festival


R. Hogan,







I. Robertson

Department of Computer Science/




Roinn na Nua
-
Ghaeilge

Roinn na

Ríomheolaíochta

1. Introduction

Text
-
To
-
Speech (TTS) synthesis is a means for facilitating the automatic computerized pronunciation of a
text based input stream. It is becoming increasingly popular for both commercial and private uses, for
example in auto
mated telephone
-
answering systems and in readers for the visually impaired. A number of
approaches exist for synthesizer design but the most popular method currently uses the concatenation of
pre
-
recorded speech units. This has been favoured because of the

greater degree of naturalness it adds to
the synthetic waveform. To date there are a number of commercial products available for a wide range of
languages that operate using concatenative speech synthesis [1]. Furthermore, the research community is
making

greater efforts to build speech synthesizers for previously excluded languages. In particular, the
users of the open source Festival speech synthesis system have been to the forefront [2, 3, 4] and it appears
now that the Festival system almost represents

a standard application through which new languages can be
implemented. However, much work remains to be done as many of the so
-
called ‘minority languages’
have not yet been implemented in the Festival system. The Irish language has unfortunately been give
n
this minority status. There has been some previous work done towards building a synthesizer for Irish
Gaelic but it was reported to be incomplete [5]. Thus, this paper offers a realization of a speech synthesizer
for the Irish Gaelic language that was bu
ilt using the Festival system, drawing on its advantages and
ability. The following sections will outline the structure of the Festival Speech synthesis system, some of
the unique difficulties associated with synthesizing Irish Gaelic, the actual implement
ation of the
synthesizer and last will explain the experiments currently being carried out to assess its performance and
determine aspects that require further research.


2. The Festival Speech Synthesis System

Festival [6] is a general framework for buil
ding speech synthesis systems. Its modularity allows the
various tasks involved to be addressed individually under the umbrella of the overall system. A
concatenative TTS system such as Festival typically consists of the following core modules:


1.

Text
-
proce
ssing to convert the input text into a series of phonemic representations that are used as
input to the synthesis module.

2.

A synthesis module to render the phonemic representation into actual sound by extracting the
relevant phonemes from a database of soun
ds.

3.

Prosodic processing to impress relevant prosodic variations on the synthetic waveform to add
naturalness to the synthetic speech.


3. Speech Synthesis for Irish Gaelic

The Celtic languages are part of the family of Indo
-
European languages, and Irish Ga
elic belongs to the
Goidelic subgroup of Celtic languages. The development of a text
-
to
-
speech system for any new language
presents specific problems, and such problems are pronounced with a language such as Irish Gaelic that
has fallen from common use and

has not been the subject of such intense linguistic study as languages that
are used more widely. Because of this, any linguistic research that has been done has not been directed
towards producing results that could be directly applied by speech technolo
gy scientists in a synthesis
context.

To achieve the aim of an Irish Gaelic speech synthesizer, three essential tasks were recognized.
The first was to identify a source of a complete phone set for Irish. This meant that a phone set had to be
created from

first principles. To simplify the task with respect to the resources at hand, it was assumed that
the Irish language shares many phonemes with English. For example, the phone /
a/

as in
cat

has the same
pronunciation as the phone /
a/

in
leat
. Thus, it was
necessary to create the only definition of the phonetic
features for original phones determined to exist in Irish Gaelic, such as the phone /
ng
/ as in
Dún Na nGall
,
or of phones that may already exist in English but are pronounced differently by Gaelic Iri
sh speakers,
such as /
ch:
/: as in
amárach
, not as in
chip
.

The second task was to convert this phone set into a series of recorded diphones (phone
-
phone
transitions) that would be stored in the database of the Festival system. The recording procedure used
by
the Festival system to create any new voice is to generate a list of nonsense words containing the target
diphones, which are then played to the subject who will repeat them as they are being recorded. The most
effective method to achieve this is to map

all Irish phones onto phones in an existing language. This
facilitates the automatic generation of the nonsense words and also the prompt waveforms are used as a
blueprint for the automatic labeling of the recorded waveforms. The mapping process also assi
sts the
creation of diphthongs unique to Irish, as two pre existing phones can be mapped onto one diphthong, such
as /
i@
/ as in
sios

and /
u@
/ as in
suas
. Once the nonsense words were recorded, the next task was to
correctly label the occurrence of each dip
hone and mark its boundaries in the audio stream. This procedure
is very laborious but for languages like English it can be simplified by drawing on speech recognition
technology. This technology can also be used if an English phone
-
set was used to generat
e the prompts.
Thus, the majority of the labels were generated automatically and any errors found due to pronunciation
variances in the new language were corrected by hand labeling.

Once the labeling has been completed, the Festival software analyses the
waveforms to extract
pitch marks and the mean power for each vowel in each diphone, the waveforms are then coded and stored
in the database.

The next step is to enable the system to deal with text inputs, whether they are input directly to
the command line

of the user interface or from an external file. This required the provision of a lexicon file
containing a set of text
-
to
-
phoneme conversion rules for the input words. A machine readable, Festival
-
compatible, lexicon for the Irish language is currently un
der development. To facilitate testing of the
system text
-
to
-
phoneme conversions for various words were added to a lexical addenda file, which
basically allows the system to find the pronunciation before defaulting to LTS rules. Below is an example
of a le
xical addenda entry:


(lex.add.entry

‘(“dhuit” (((g w i t))))


Once the words to be synthesized are contained within the lexicon (or addenda) they will be synthesized
correctly when input to the system.


4. Assessment of the Synthesizer

To assess the oper
ation of the synthesizer, listening and comprehension tests have been devised and are
currently being carried out. The first test set will be to determine the perceived quality of the synthesizer
output. Three different pieces of Irish text will be used ta
ken from the following sources: (1) A newspaper
(2) a poem (3) a piece of contemporary Irish fiction. The subjects will be instructed to rate their attitude
towards the quality of the synthesis on a five
-
point scale. Other categories that will be examined
include
the perceived level of coarticulation errors. To examine the intelligibility of the synthesizer, the subjects
will be played five unknown texts taken from a variety of sources and be asked to transcribe them. This
will give some insight into the ab
ility of user to comprehend the synthetic output. The results from both
these tests will be available in time for the full paper.



5. References


[1] Pogue, D, ‘Hearing Text, Not Tunes, on Your MP3 Player’, The New York Times, 2 May 2002.

[2] Williams B,

‘Text
-
to
-
speech synthesis for Welsh and Welsh English’, Proceedings of the European
Conference on Speech Communication and Technology (EUROSPEECH 95), Madrid, Spain, 18
-
21
September 1995.

[3] Fitt, S. and Isard, S., ‘Synthesis of regional English using a
keyword lexicon’, Proceedings of the
European Conference on Speech Communication and Technology (EUROSPEECH 99), Budapest,
Hungary, 5
-
9 September 1999.

[4] Wolters, M., ‘A Diphone
-
Based text
-
to
-
speech system for Scottish Gaelic’, Diploma Thesis, Institut f
ür
Kommunikationsforschung und Phonetik / Institut für Informatik III, Universität Bonn, 1997.

[5] O’Neill G, Charonnat L, Mercier G, ‘An Irish Speech Synthesizer’, Third ESCA/COCOSDA
Workshop on Speech Synthesis, Jenolan Caves, Blue Mountians, Australia N
ovember 26
-
29 1998.

[6] Black, A. and Taylor, P., ‘The Festival Speech Synthesis System’, System documentation, Edition 1.4,
for Festival Version 1.4.0, (http://www.cstr.ed.ac.uk/projects/festival/manual/festival_toc.html) 17th
June 1999