A Machine Learning Approach to Grapheme-to-Phoneme Conversion in Text-to-Speech synthesis for Swedish

crazymeasleΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 4 χρόνια και 23 μέρες)

247 εμφανίσεις

Computational Linguistics

January 2000

Department of Linguistics

Stockholm University



















A Machine Learning Approach to

Grapheme
-
to
-
Phoneme Conversion in

Text
-
to
-
Speech synthesis for Swedish

Hanna Lindgren


This study focuses on Grapheme
-
to
-
Phoneme
Conversion (GPC) in Text
-
to
-
Speech systems for
Swedish. It is performed at the request of Telia
Promotor AB to see whether the Machine Learning
algorithms in the Tilburg Memory
-
Based Learner
(TiMBL) may manage GPC better than their rule
-
based syst
em in Infovox. The results for TiMBL show
higher accuracies than Infovox, and although there are
some difficulties in comparing the results, the Machine
Learning approach seems like a better option.









Supplementary Course in Computational Linguistics

C
-
Level Essay

Advisor: Harald Berthelsen



1

Table of contents

1

Introduction

................................
................................
................................
........

3

2

Background

................................
................................
................................
........

3

2.1

Grapheme
-
to
-
Phoneme Conversion in Text
-
to
-
Speech synthesisers

............

3

2.1.1

The Rule
-
Based Approach

................................
................................
....

3

2.1.2

T
he Data
-
Oriented Approach

................................
...............................

4

2.2

Infovox
................................
................................
................................
............

4

3

Machine Learning

................................
................................
..............................

5

3
.1

The Memory
-
Based Learning Algorithm

................................
......................

5

3.1.1

Similarity between patterns

................................
................................

6

3.2

Feature importance

................................
................................
.......................

7

3.3

The Decision Tree Learning Algorithm

................................
.........................

7

3.4

TiMBL

................................
................................
................................
............

8

3.4.1

Data format

................................
................................
..........................

8

3.4.2

A training
-
testing example

................................
................................
..

8

4

Data

................................
................................
................................
....................

9

5

Method

................................
................................
................................
................

9

5.1

The form of the lexicon

................................
................................
..................

9

5.1.1

The phoneme set

................................
................................
..................

9

5.1.2

TiMBL format
................................
................................
.....................

10

5.1.3

Context and input information organization

................................
.....

11

5.2

Training and testing

................................
................................
....................

11

5.2.1

Weighting

................................
................................
...........................

11

5.2.2

Other settings in TiMBL

................................
................................
....

12

5.2.3

Training and testing with TIMBL

................................
.....................

12

5.2.4

Testing Rulsys

................................
................................
....................

12

5.3

Making comparison possible

................................
................................
.......

13

6

Results

................................
................................
................................
..............

14

6.1

Grapheme accuracy

................................
................................
.....................

14

6.2

Word accuracy

................................
................................
..............................

15


2

7

Evaluation

................................
................................
................................
........

19

7.1

Rulsys vs. TiMBL

................................
................................
........................

19

7.2

TiMBL

................................
................................
................................
..........

20

7.3

Results from similar algorithms

................................
................................
.

21

7.4

Conclusion

................................
................................
................................
....

22

7.5

Possible difficulties caused by the lexicon

................................
..................

22

7.6

Direc
tions for further machine learning research on
Grapheme
-
to
-
Phoneme Conversion for Swedish

................................
........

23

Sammanfattning

................................
................................
................................
.......

24

Writing Conventio
ns

................................
................................
................................
.

25

References

................................
................................
................................
..................

26

Appendix 1:

Phoneme set

................................
................................
...................

28

Appendix 2:

Graph
eme
-
Phoneme Alignments

................................
...................

29

Appendix 3:

Weights for the 10
-
FC

tests for test one and test two

....................

33



3

1

Introduction

This study is made for Telia Promotor AB, Stockhol
m, and the purpose is to investigate
whether a Machine Learning approach may be a better
-
performing alternative to the rule
-
based Grapheme
-
to
-
Phoneme Conversion
1

for Swedish in their
Text
-
to
-
Speech synthesi
ser
Infovox.

2

Background

A Text
-
to
-
Speech (
TTS
) syn
thesiser is a comp
uter
-
based system that is able to read written
text aloud
. This is what makes
TTS

different from other talking machines (e.g. machines
that read out pre
-
recorded sentences), since they manage automatic production of
new

sentences.


There
are numerous
TTS

applications, e.g. telecommunication, multimedia, man
-
machine
communication, language education, aid to handicapped persons and linguistic research.

2.1

Grapheme
-
to
-
Phoneme Conversion in Text
-
to
-
Speech synthesisers

Grapheme
-
to
-
phoneme conversi
on (GPC) is one topic out of many in the area called Natural
Language Processing (
NLP
). Others are e.g. translation, speech recognition and of course
the whole text
-
to
-
speech synthesis.


There are two main approaches to
GPC
: rule
-
based and data
-
oriented, m
achine learning
strategies.

2.1.1

The Rule
-
Based Approach

It is traditionally assumed that linguistic knowledge has to be formalised in order to
produce accurate information when implemented into a computer program [Daelemans et
al. 1993b]. Therefore, hand
-
made
rules have been the basis of many NLP systems
[Daelemans et al. 1999b].


Many of the text to speech systems of today use rule
-
based
GPC
. Infovox is one of them.
Others are e.g. ModelTalker [ModelTalker 1997] for English, and DECtalk [DECtalk 1999,
Lemmetty

1999], supporting American English, German and Spanish.


The great disadvantages with rule
-
based methods are the huge language
-
dependency and
the time
-
consuming work required to write the rules. When there is a rule, there are also
exceptions. Therefore,
rule
-
based
TTS

systems need not only to consist of rules, but also of
lexicons of exceptions.


Example:

George /je:Orj/
George

where rules probably would suggest / je:Orje/.




1

In this study, grapheme
-
to
-
phoneme conversion will refer to both pure and stress
-
included GPC.


4

2.1.2

The Data
-
Oriented Approach

Compared to the rule
-
based systems, a data
-
oriented Mac
hine Learning (
ML
) approach to
GPC

(and
NLP

in general) requires less language
-
specific and linguistic knowledge. It
would also save some of the time and effort spent in rule writing.


Some ML
TTS

synthesisers of today are Bell Labs
TTS

[Bell Labs TTS 1997
, Lemmetty
1999] available for English, French, Spanish, Italian, German, Russian, Romanian,
Chinese, and Japanese, the TreeTalk [TreeTalk 1999] demo for Dutch and English, and
SKOPE [SKOPE] for Korean.

2.2

Infovox

Infovox is a
TTS

synthesizer administrated by

Telia Promotor AB. Apart from Swedish,
Infovox supports English (British and American), German, Dutch, French, Spanish, Italian,
Norwegian, Danish, Finnish and Icelandic.


The text
-
processing system in Infovox is called Rulsys, and the grapheme
-
to
-
phoneme

conversion is performed through a traditional rule
-
based system. There are two main rule
systems: One at a grapheme level and one at a phoneme level. An overview of Infovox is
shown in figure 2.1.





















Figure 2.1. Infovox text
-
to
-
speech Sy
nthesis.

Modified from [Telia Promotor 1999].

User

Lexicon

Text
Normali
sation

System

Lexicon

GPC

Spelled text

Sentence
Level Rules

Pronunciation
Rules


Speech
Production
Model

Speech

GRAPHEME LEVEL

PHONEME LEVEL


5

3


Machine Learning

Most natural language processing can be seen as a classification problem [Daelemans et al.
1997, Daelemans 1996]. In the TTS field, this means that with an input consisting of
written text,
the task is to give each letter, letter cluster, word, phrase etc. in the text an
acoustic correlation. For grapheme
-
to
-
phoneme conversion, each letter in a word must be
mapped to (a) corresponding phoneme(s).


A machine learning system is a program that l
earns to make accurate generalisations from
a test sample.
It
is built of two components: a learning component and a classification
component. During learning, instances (patterns) are either explicitly stored in memory
(lazy learning), or abstracted or re
constructed (greedy learning).
An instance consists of a
fixed number of feature
-
values, and the classification (target category) for that particular
combination of features
. In the classification phase, new examples will be classified by
means of the stor
ed information. Since ML systems were originally developed for
classification tasks, it has proved very useful for NLP.

3.1


The Memory
-
Based Learning Algorithm

Memory
-
Based Learning (MBL) is a lazy learning descendant of the machine learning
approach to class
ification tasks called
k
-
Nearest Neighbour (
k
-
NN), which
matches feature
-
values to target category through similarity
-
based reasoning (see
3.1.1
). The classification
component searches for the patterns in the tr
aining set that are the most similar to those in
the testing set, i.e. the nearest neighbours. A schematic MBL system is shown in figure 3.1.



























Figure 3.1. The general architecture of an MBL system.


Modified from [Daelemans et a
l. 1999b]


Memory
-
Based


Learning

Learning

Structure


EXAMPLES








Storage



Computation of metrics








INPUT

CASES

OUTPUT



Similarity
-
Based Reasoning


Classification


6

The MBL algorithm used in this study is called IB1, where the input information storage is
tree
-
based. Instances are stored as paths from root node to leaf, feature number one at the
top and the last feature at the bottom. The leaf node contain
s the classification for the
sequence of features in question, along with the number of times each target category
occurs with this feature
-
value pattern.

3.1.1

Similarity between patterns

After training, IB1 recompiles new examples into similarity cases. Throug
h matching these
cases to those in memory, the new examples can be classified. When exact matches occur,
the similarity cases are ignored, and the classification information from the training
example is simply copied to the new example. If there is more th
an one exact match, the
most frequent pattern is chosen. When no exact match is available, the MBL algorithm
searches for the
k

(usually one) most similar examples in memory (nearest neighbours). The
example is classified like the nearest neighbour pattern

is.


There are two metrics used for determining the similarity between two patterns in MBL:
Overlap Metric and Modified Value Difference Metric (
MVDM
). These similarity metrics
calculate the distance between patterns represented by a number of features as

the sum of
the differences between the features.


In Overlap Metric (figure 3.2), the distance between two features is zero if they are equal,
and one if they are different.





if


if


Figure 3.2.

(
X,Y
) is the distance between the patterns
X

and
Y
, represented by
n


features.
(
x
i
,y
i
) is the distance per feature.


With
MVDM
, the difference between

two feature
-
values is based on the difference between
their probability distribution for the target categories. If two features often lead to the same
target category, they are similar.


The similarities between the features are supplemented with informat
ion about feature
weighting (
3.2
).


In the ML program used in the experiments performed in this essay, there is a possibility to
choose the number of nearest neighbours searched for. The default is one, i.e. one

nearest
neighbour is chosen at a time.

A

larger number of nearest neighbours often works especially
well with
MVDM
[
Daelemans et al. 1999b]. The target category of the most frequent of the
k

nearest neighbour patterns determines the classification of the
new pattern. When the
number of nearest neighbours is set to more than one, the shortcut for exact matches can be
switched off for better results. Processing is generally faster when it is on.


7

3.2

Feature importance

When ML algorithms first were developed, th
ey treated all features in a pattern as equally
important. However, some of the features in a training set may be more important than
others. To determine the importance of a feature, there are weighting systems. The
weighting systems used in this study ar
e Information Gain (Info Gain, IG) and Gain Ratio
(GR).


Information Gain estimates how much information each feature contributes to our
knowledge of the correct class label. The difference between knowing and not knowing the
value of the feature in questi
on is calculated. In other words, if a test (where all features
are regarded as equally important) without the knowledge of a specific feature gives results
very similar to a test where the feature is known, that particular feature is given a low
weighting
, i.e. it is not crucial. If test results without a certain feature differ a lot from test
results with that feature included, the feature seems to be important for classification


the
weighting is high.


Information Gain often overrates features with a l
arge number of possible values
[Daelemans et al. 1999b]. To avoid this, Gain Ratio was created. It is a variation of
Information Gain, where the weights are compared to the possible number of values for
each feature. Thus, with an equal number of values fo
r each feature, Information Gain and
Gain Ratio will lead to the exact same results.


Apart from these two weighting systems, there is always a possibility to adjust the weights
manually for ultimate accuracy.

3.3

The Decision Tree Learning Algorithm

Decision
Tree Learning (DTL) is a greedy learning approach to ML, where the input
information is stored in a decision tree, i.e. the information is compressed.


In the DTL algorithm IGTree, instances are stored as paths in a tree where the highest
ranked features a
re on top of the tree, at the node, and the lower are sorted towards the
bottom. The final leaves of the tree contain the target category for each sequence of feature
-
values.


During testing, the path matching the properties of each pattern in the testing

set is gone
through until the leaf is reached. The testing pattern is classified from the target category
information in the leaf. The tree search means that although no information is deleted from
the training set, IGTree access less of it than IB1 does.

For every step towards the leaf in
the tree, information from other paths is ignored. This may cause an impairment of the
results, especially for features with very similar weighting. On the other hand, in some
experiments the contrary occurs as well [Dae
lemans et al. 1999b].


Due to the tree constructing, training takes a little bit longer with IGTree than with IB1.
On the other hand, the classification phase is generally much faster. With large amounts of
data, this is a huge advantage
2
.


IGTree (Informa
tion Gain Tree) was originally developed as an alternative to the IB1
-
IG
combination, but it is applicable with Gain Ratio as well.


N.B. unlike IGTree, IB1 (see
3.1
) is still a lazy learner, since all the infor
mation is searched
through in IB1 but not in IGTree.




2

The 3
-
X
-
3 tests run in this study (see
5.1.3
) took more than 24 hours with IB1, and only three hou
rs or so with
IGtree.


8

3.4

TiMBL

The algorithms described above are collected in the software package TiMBL, Tilburg
Memory Based Learner [Daelemans et al. 1999b
]
.


For the machine learning algorithms used in TiMBL, NLP applicati
ons are thus far e.g.
hyphenation, grapheme
-
to
-
phoneme conversion, syllabification, part
-
of
-
speech tagging,
morphological analysis and word stress. The Induction of Linguistic Knowledge Group
(ILK) at Tilburg University, who developed TiMBL, has performed
several experiments
treating these subjects. Their papers and reports are available at
http://ilk.kub.nl
. For GPC,
see
7.3
.

3.4.1

Data format

TiMBL supports a number of different formats

for the data in the training set. For all of
them, the input information has to contain one instance per line. The last feature in a line
is assumed to be the target category. For GPC, the word
bil
/b´i:l/
(car)
may look like this:

0 0 0 b i l 0 b

0 0 b i

l 0 0 ´i:

0 b i l 0 0 0 l

In this case, each line consists of seven features and one target category. The seven features
are graphemes, and the target category is a phoneme. The fourth grapheme from the left is
the focus grapheme. The other graphemes are
context information (see
5.1.3
).

3.4.2

A training
-
testing example

A test (based on very little data) with IB1, Overlap metric, GainRatio weighting and 3
-
X
-
3
context (see
5.1.3
) may

look like this:


Training set:

0 0 0 b a n a b

the first phoneme in the word
banan

(banana)



0 0 0 s i l 0 s

the first phoneme in the word
sil (strainer)


Testing set:

0 0 0 b i l 0 b

the first phoneme in the word
bil (car)

When the information in the t
raining set is stored, the patterns are compared to the
pattern in the testing set. Since the first pattern in the training set contains three feature
-
values that are different from the feature
-
values in the same places in the training set (
_ _
_ _ a n a
),

the distance is set to three. The distance for the second pattern is one (one
different feature
-
value:
_ _ _ s _ _ _
). Accordingly, the second pattern is the nearest
neighbour to the testing pattern. N.B. during testing, the algorithms ignore the key
phon
eme(s) in the testing set. It is there only to enable comparison between the answer and
the output.


Feature

No. of
Values

IG

GR

1

1

0

1

2

1

0

1

3

1

0

1

4

2

1

1

5

2

1

1

6

2

1

1

7

2

1

1



Table 3.1. Weighting results.


9

The weighting results are shown

in table 3.1. Since the three first features are equal for all
training patterns, they do not contribute at all to our knowledge of the correct class label.
Therefore, the Information Gain weighting is zero. The other four features are equally
important a
nd the weighting is set to one.


Now, since GainRatio weighting is equal for all features, the second training pattern (with
a larger similarity to the test pattern) classification
s

is chosen. The resulting accuracy is
zero.


Result:

0 0 0 b i l 0 b s

Wit
h a larger training set, the weighting will be different. Above all, feature number four
will (as my experiments show) be given a much higher ranking than the others. When
compared to weighting, the distances will not be three and one like above. Thus, wit
h
accurate weighting, the distance to the first training pattern will be smaller than to the
second, since the fourth
s

counts much more than the others, and the resulting phoneme is
b.

Then, the accuracy is one, i.e. 100% correct.


Result:

0 0 0 b i l 0 b

b

4

Data

The data used in this paper is the Chalmers Swedish pronunciation lexicon [Hedelin and
Jonsson], which consists of just over 110,000 words with transcriptions (including four
stress categories). It was compiled by Per Hedelin and Anders Jonsson, at

the Chalmers
University of Technology, dpt. of Information Theory, as an aid in building data
-
oriented
grapheme
-
to
-
phoneme conversion. In this paper, the lexicon is referred to as Swelex.

5

Method

Two tests will be performed. The main test (test one) does n
ot include the stress, i.e. it
treats pure grapheme
-
to
-
phoneme conversion. The other test (test two) is a preliminary
study in mapping graphemes to both phonemes and stress.

5.1

The form of the lexicon

5.1.1

The phoneme set

First, the phoneme set in the lexicon has
to be standardized and rational. I have chosen to
use SAMPA (Speech Assessment Methods Phonetic Alphabet [SAMPA 1999])
, which is a
machine
-
readable phonetic alphabet. SAMPA basically consists of a mapping of symbols of
the International Phonetic Alphabet (
IPA) onto ASCII codes that all computers can read
.
However, the original transcriptions contain phonetic information not covered by SAMPA,
such as secondary stress and an extra phoneme (
3

in

Churchill /
tC3tCIl
/
). Therefore,
SAMPA has been slightly modified
. The modified phoneme set is shown is appendix 1.


10

5.1.2

TiMBL format

The next step is to convert the lexicon into a format supported by TiMBL (see
3.4.1
). In a
3
-
X
-
3 context (see
5.1.3
), e
ach line consists of seven graphemes to the left, and a target
phoneme to the right. The fourth grapheme from the left is the grapheme being transcribed
to the phoneme (the focus grapheme). The other graphemes are context information T
he
word

bil (car)

looks like this
3
:


Test one:

0 0 0 b i l 0 b

Test two:

0 0 0 b i l 0 b


0 0 b i l 0 0 i:


0 0 b i l 0 0
´
i:



0 b i l 0 0 0 l


0 b i l 0 0 0 l

To transform the data into the column format, every grapheme in Swelex’s word list first
needs to be
aligned to one or more symbols from the transcriptions. Every grapheme is
aligned to one transcription unit consisting of at least one phoneme and zero or one length
category (and in test two stress as well).

Test one:

bil /b i: l/

Test two:

bil /b
´
i: l
/

where the first grapheme is aligned to the first transcription unit, etc.

The alignment is executed with a program that (among other information) contains a list of
allowed grapheme
-
to
-
phoneme alignments. The list (stress
-
included) is presented in
append
ix 2.


The stress categories in Swelex are located before the syllable they apply to. Moving the
stress symbols to the main vowel in the syllable makes it easier for TiMBL to generalize
stress information. Pre
-
syllable stress indication may even be impossi
ble to generalize from
on a grapheme
-
phoneme level. It seems likely to assume that vowel
-
bound stress indication
better catches the similarity between e.g.
hall
/h
´al/
(hall)
and
stall
/st´al/

(stable)

(cp.
/
´hal/,
/´stal/). Before the alignment, the stres
s symbols are therefore moved.


There were three groups of words that didn’t agree with my alignment list. For that reason,
either Swelex or the list had to be modified.


(i)

Incorrect transcriptions. Due to the size of the lexicon, it is quite natural that th
ere
are a few transcription errors. Words that did not suit my connection list and that
were obviously not correctly transcribed were removed from Swelex. N.B. all incorrect
transcriptions have not been removed (see
7.6
).


Examples:

bottenplan /b`Ot@npl:²an/
ground floor

the length sign misplaced



stressituation /str`Es²s¹Itu0aS²u:n/
stress situation




a misplaced secondary stress



utsålla /`}:ts²OlnIN/
sort out
noun instead of verb in transcription


(ii)

Abbreviations
. The pronunciation of graphemes in abbreviations is often not
representative for the pronunciation of these graphemes in general. Abbreviations
that did not fit the alignment list were removed from Swelex.


Examples:

BMW /b`e:@mv
2
e:/
BMW

removed



TV /t`
e:v²e:/
TV
removed



OÄ /`u:²E:/ short for

’general subjects’,
not removed, since it fits the list





3

Zeros (‘
0

) implies that there is no feature information, i.e. they represent t
he beginning and end of a word.


11

(iii)

Word clusters and prefixes. Where the pronunciation of a word cluster differs
considerably from the ordinary pronunciation of the words, Swelex counts the c
luster
as one unit. I have decided to work at a word level, so both prefixes and word clusters
have been removed.


Examples:

hoppa på /h¹Opap'o:/
jump on / jump in



huller om buller /h'u0l@rOmb'u0l@r/
helter skelter



geo
-

/j¹e:o
-
/
geo
-



hetero
-

/h¹etero
-
/
hetero
-


(iv)

Foreign words. Some of the words in the lexicon were of foreign origin. Most of them
English, but also French, German, etc. To avoid defining when a word is Swedish and
when it is not, these words were generally kept in Swelex. However, foreign

words
that were considered very infrequent in Swedish or that conflicted with the
alignment list were removed. Both the connection list and the phoneme set (
3

in

Churchill /
tC3tCIl
/
)

had to be modified.


Examples:

embarras /aNbar'A:/

embarrassment
remove
d



deadline /d´edlajn/
deadline
not

removed, since it is rather common



maitresse /mEtr´Es/
mistress

not removed, since it doesn’t conflict




with the alignment list


When all the words above (about 300, primarily abbreviations and incorrect transcripti
ons)
are removed, Swelex consists of 112,579 words.

5.1.3

Context and input information organization

In grapheme
-
to
-
phoneme conversion, it is natural to organize input information like this:
Each instance consisting of a grapheme surrounded by some left and righ
t neighbour
graphemes (context), and the target phoneme. [van den Bosch 1997] shows that a 3
-
X
-
3
context (i.e. three graphemes to the left and three to the right) gives enough information,
although more context refines the results. With large data
-
files, h
owever, the increase in
processing time makes larger context a very impractical option
4
. Therefore, I have chosen to
run TiMBL with a maximum of 3
-
X
-
3 context. Since 2
-
X
-
2 proved to show less accuracy
than 3
-
X
-
3, a 1
-
X
-
1 context test, which presumably woul
d decrease the results even further,
was considered redundant.

5.2

Training and testing

5.2.1

Weighting

Since all features used in this experiment are graphemes and since the number of possible
graphemes is constant, Gain Ratio and Information Gain will give the ex
act same results
(see
3.2
). It is therefore not interesting to run the tests with both of them, so I have chosen
Gain Ratio, which is the default weighting in TiMBL.





4

The 3
-
X
-
3 tests run in this study took more than 24 hours with IB1, and only three hours or so with IGtree.


12

Another alternative is to adjust the weight
s manually. In that case, first, the middle
grapheme would be the most important (since it’s the grapheme to be interpreted to the
phoneme in question) and thus should have the highest weighting. Then, it is reasonable to
think that the further away from t
he middle grapheme, the less important are the other
graphemes. Intuitively, before any tests were run, I guessed that the grapheme to the right
of the middle grapheme would be a little more important than the one to the left. As a
matter of fact, this is
what the weighting algorithms find. However, none of my own
weighting results came close to the results from Gain Ratio, and thus will not be reported
in
6
.

5.2.2

Other settings in TiMBL

As explained in the TiMBL chap
ter, there are two machine learning algorithms used in this
experiment: IB1 and IGTree.


With IB1, all contexts are tested both with Weighted Overlap (
WO
) and Modified Value
Difference Metric (
MVDM
).


The number of nearest neighbours is one (default) in al
l cases except for two
MVDM

tests.
Odd numbers larger than one may improve the results especially with the
MVDM

metric
[Daelemans et al. 1999b]. Thus, these
MVDM

tests are run with the number of nearest
neighbours set to three. When elaborating with the nu
mber of nearest neighbours larger
than one, it is also suggested that turning off the shortcut for exact matches in IB1 may
give better results [Daelemans et al. 1999b]. Therefore, my
MVDM

tests with three nearest
neighbours are run once with shortcut on a
nd once off.

5.2.3

Training and testing with TIMBL

To train and test TiMBL statistically adequate, a
10
-
fold cross
-
validation (10
-
FC
) is used.
The material is divided into ten equally large units. First, TiMBL is trained with unit 1 to
9, and tested with unit 10
. Then, training goes on with unit 2 to 10, and testing with unit 1.
This goes on until ten testing
-
training actions are through. The average accuracy
(percentages of correctly converted graphemes) of the ten results is the result of the
experiment.


Accor
ding to [Weiss and Kulikowski 1997], 10
-
fold cross
-
validation is a good way to get
more accurate results from a small material. On the other hand, [Diettreich 1997] claims
that tests with five replications of 2
-
fold cross
-
validation give a smaller error ra
te than 10
-
fold cross
-
validation. However, with a material consisting of over 100,000 words and thus
more than 100,000×
L

(where
L

is the average number of graphemes per word) phonetic
units (see
5.1.2
) the error

rate ought to be small enough using the former validation
method.


When training and testing is done, the ten resulting files are put together in one file.

5.2.4

Testing Rulsys

For the comparison, the words in Swelex were transcribed with
Rulsys. Unfortunately,

only
the first
-
level GPC module could be carried through (see
2.2
), since the second level
includes several rules that are applied in the sound
-
based speech production model phase
in figure 1.1. The first level tra
nscription is not a complete one. Some rules for co
-
articulation were not executed, and neither was the rule for the non
-
stressed
-
syllable vowel
@ (
schwa,
see appendix 1).


Just as the Swelex transcriptions, the Rulsys transcriptions contain stress informa
tion, but
only two categories. Since the Swelex transcriptions contain four types of stress, an
accurate comparison between the two systems is not possible. The stress will therefore be
removed from the Rulsys transcriptions, and a comparison will only be
correctly realized for
test one.


13

To facilitate a comparison between Rulsys’s and the machine learning grapheme
-
to
-
phoneme transcriptions, the former will, just as Swelex, be changed into the modified
SAMPA phoneme set.


The transcriptions from Rulsys are
now compared to the original transcriptions and the
average word
-
based accuracy is calculated. An error file


containing (1) the words, (2) the
Swelex transcriptions and (3) the Rulsys transcriptions for all the incorrect transcribed
words


is created.

5.3

M
aking comparison possible

Up to this point, the results from TiMBL (which are several, since a number of different
settings are used) are on a grapheme level, and the results from Rulsys are word
-
based.
The accuracies are not comparable.


Example:

Phoneme
-
based

Word
-
based



0 0 0 b i l 0 p

bil /pi:l/


0 0 b i l 0 0 i:



0 b i l 0 0 0 l



1/3 (33%) phoneme errors


1/1 (100%) word errors

A testing set consisting of one three
-
letter word where one of the letters is incorrectly
transcribed will have a 67% (10
0%
-

33%) grapheme accuracy and a 0% word accuracy
(since the only word in the set is incorrectly transcribed).


To compare the tests, the results from TiMBL have to be word
-
based too. Thus, the
grapheme
-
based results will be transformed into word
-
based.


Example:

0 0 0 b i l 0 b


0 0 b i l 0 0 i:


will become bil /bi:l/ again



0 b i l 0 0 0 l

Just as for the Rulsys transcriptions (see
5.2.4
), the word
-
based accuracy is calculated, and
an error file is created
.


How
ever, there is no point in going through this for all the TiMBL results. It seems
reasonable to assume that a poor grapheme
-
based result will also be a poor word
-
based
result. Hence, only the results from the best performing TiMBL experiments will be
trans
formed into a word
-
based form.


14

6

Results

6.1

Grapheme accuracy

For all the tests performed in TiMBL, there are grapheme accuracies as shown in table 6.1.
The best results are 97.21% for test one and 94.01% for test two.


Algorithm

Context

Weighting

Similarity

k
-
NN

Accuracies (%)

Test 1 Test 2

IGTree

2
-
X
-
2

GR





㤶⸳0

㤱⸴8


3
-
u
-
3

GR





㤶⸹3

㤳⸱9

䥂1

2
-
u
-
2

GR

WO

1

96.43*

91.58


3
-
X
-
3

GR

WO

1

97.21

93.76


3
-
X
-
3

GR

MVDM

1

97.19

94.01


3
-
X
-
3

GR

WO

3

96.83*

93.15*


3
-
X
-
3

GR

MVDM

3

97.08

93.91


3
-
X
-
3

GR

MVDM


3**

96.43*

92.96*


* Only 10% of the material has been run, i.e. step one of the 10
-
fold cross
-
validation

** shortcut search off


Table 6.1. TiMBL results. Grapheme accuracies. Best results in bold type.

k
-
NN. Number of nearest neighbours.


Overal
l, the value difference metric performs slightly better than
WO
. The only exception is
the test one experiment with one nearest neighbour, where the difference between the two
metrics for one nearest neighbour is negligible (97.21% for overlap metric vs. 9
7.19% for
MVDM
). The number of nearest neighbours seems to cause a slight impairment of the
results, e.g. from 94.01% to 93.91% in test two with weighted overlap metrics. The
experiments run without the shortcut for exact matches perform significantly wors
e than
when the shortcut is on.


Some of the tests above are run with only 10% of the material. When a few tests were run, I
noticed that the accuracy was very solid (as shown in table 6.2 for the best results). For that
reason, when the first
10
-
FC

test d
id not keep on a level with the previous best results, the
tests were interrupted.



15


10
-
FC

No.

Test 1

Test 2

1

97.26%

94.14%

2

97.16%

93.97%

3

97.21%

94.05%

4

97.25%

93.97%

5

97.25%

94.01%

6

97.16%

93.92%

7

97.16%

94.05%

8

97.22%

94.00%

9

97.21%

94.05%

10

97.19%

93.97%

Average

97.21%

94.01%


Table 6.2.

Results from the ten 10
-
fold cross
-
validation tests.

Grapheme accuracies for the best TiMBL results.


Table 6.3 shows the Gain Ratio weighting for the first 10
-
FC

test of the best results. In bo
th
cases, feature number four is considered the most important. The further the distance from
the focus grapheme, the lower the features are weighted. For both test one and test two,
feature number five is more important than number three. In test one, the

weighting
decreases faster than in test two. For the complete weighting for these tests, see appendix
3.



Test 1

Test 2


Feature
number

Weight

Weight

Ranking

1

0.0386973

0.0699462

6

2

0.0705866

0.0943413

5

3

0.17649

0.188165

3

4

0.884009

0.885378

1

5

0.197349

0.209649

2

6

0.0799479

0.109406

4

7

0.0385612

0.0679605

7



Table 6.3. Weights from step one of the
10
-
FC

for test one and test two.

6.2

Word accuracy

For both Rulsys and the best of the TiMBL tests, there are word
-
based accuracies as shown
in t
able 6.4. For error examples, see table 6.9. It is indicated that TiMBL manages GPC
much better than Rulsys, but since the
schwa

was not used in the Rulsys transcriptions
(see
5.2.4
) this is not a fair comparison. T
able 6.5 shows the word
-
based accuracies where
all words with a Swelex transcription containing
schwa

have been removed. However, even
this is not entirely adequate, since some of the words (and probably not an equal number
for TiMBL and Rulsys) containing

schwa

might contain other errors as well. See also
7.1
.




16


Test

Results

Number of
errors

Rulsys

35.1%

73,062

Test 1

67.0%

37,104

Test 2

56.5%

48,956



Table 6.4. TiMBL and Rulsys results.


Word accuracies and

number of errors.



Test

Results

Number of
errors

Rulsys

62.2%

42,581

Test 1

77.6%

25,257

Test 2

69.7%

34,142



Table 6.5. TiMBL and Rulsys results.


Word accuracies and numbers of errors.
Schwa
words removed.


The short
E
and
e
phonemes are debated
issues in Sweden. In many dialects, these two are
pronounced in the same way. However, since most dialects do distinguish between these
two, they generally count as two phonemes [Elert 1995].


Example:

hetta /h`eta/
heat




hätta /h`Eta/
hood

When looking

at the test results, I discovered that a great deal of the inaccuracies from
Rulsys were
E
-
e

errors, where Rulsys suggested
e
when the key transcription was
E.

There
was no reverse case. Rulsys seemed to under
-
generate
E.

Therefore, in table 6.6, the
accu
racies where the short
E/e

phonemes were unified, are compared to the results in table
6.5. Not surprisingly, the accuracy for the Rulsys transcriptions increased more than for
test one and two.




Rulsys

Test 1

Test 2

E


e

Accuracy

62.2%

77.6%

69.7%

N
umber
of errors

42,581

25,257

34,142

E


e

Accuracy

65.3%

77.8%

69.8%

Number
of errors

39,096

25,011

34,032



Table 6.6. Average word
-
based accuracies and numbers of errors.


Schwa

words removed.


E


e.
e
and
E
count as different phonemes.


E


e.
e
a
nd
E
count as equal phonemes.


To enable length error evaluation, table 6.7 shows the difference between length
-
included
and length
-
ignored accuracies. Rulsys has great difficulties in determining length.



17





Rulsys

Test 1

Test 2

E


e

LENGTH
INCLUDED

Ac
curacy

62.2%

77.6%

69.7%

Number of
errors

42,581

25,257

34,142


LENGTH
IGNORED

Accuracy

76.1%

82.9%

70.6%


Number of
errors

26,894

19,240

33,119

E


e

LENGTH
INCLUDED

Accuracy

65.3%

77.8%

69.8%

Number of
errors

39,096

25,011

34,032


LENGTH
IGNOR
ED

Accuracy

80.0%

83.1%

70.6%


Number of
errors

22,527

18,987

33,058



Table 6.7. Average word
-
based accuracies and numbers of errors.


Schwa

words removed.


E


e.
e
and
E
count as different phonemes.


E


e.
e
and
E
count as equal phonemes.



Length
included. Long vowels count as long, and short vowels as short.


Length ignored. All vowels are of equal length.


Every word in the training set contains at least one stress category, i.e. there is a stress
obligation (it lies in the definition of word str
ess [Elert 1995]). TiMBL does not know this,
since the only input information in the test are focus and surrounding graphemes. Would
the results be improved if this information was added in the test? In table 6.8, stress
ignorance errors are compared to th
e total amount of errors.


Test

Number
of errors

Error rate

(%)

Total

48,956

100

Stress
ignored

2,956

6



Table 6.8.


Word
-
based errors for stress ignorance compared to the total number of errors.


Stress ignored: Words with no resulting stress.



18

Erro
rs divided into categories are exemplified in table 6.9.


Test

Error type

Error

Rulsys

@

kabel /kA:b@l/ /kA:bel/
cable


e

vs.
E

ebb /Eb/ /eb/
ebb


Length

abort /abORt/ /abu:Rt/
abortion


Deletion

kongress /kONgrEs/ /kONres/
congress


Insertion

George /je:O
rj/ /jeOrje/
George


Others

absolut /absUl}:t/ /absu:lu0t/
absolutely



kalkjord /kalkju:Rd/ /kA:lCu:Rd/
limy soil



errors like kj

C need compound boundary



information to be avoided


Test 1

@

kadaver /kadA:v@r/ /kadA:ve:r/
carcass


e

vs.
E

oavsedd /u:A
:vsed/ /u:A:vsEd/
unexpected


Length

cynism /sYnIsm/ /sy:niIsm/
cynisism


Deletion

gammelgädda
/gam@ljEda/

/gam@lEda/

prize pike


Insertion

oljud /u:j}:d/ /Oljj}:d/
noise


Others

magisk

/mA:gisk/ /majIsk/
magical



opium /u:pIu0m/

/u:pju0m/

opium



Test 2

@

banddriven /b`andr
2
i:ven/ /b`andr
2
i:v@n/
caterpillar (e.g. tractor)


e

vs.
E

Beckett /b´ek@t/ /b´Ek´Et/
Beckett


Length

fettig /f`etIg/ /f`e:tIg/

greasy


Deletion

deaktivera /d¹eaktIv
´era/ /daktIv´era/
deactiavte


Insertion

Erna /`{:Rna/ /´{rRna/
Erna


Stress

falang /fal´aN/ /f
1
al
2
aN/
(political) wing


Stress ignored

Åsa /`o:sa/ /Osa/
Åsa


Others

halvt /h´alvt/ /h`alft/
half



Roine /r´Ojn@/ /r`u:´an/
Roine



end possibly caused by Fre
nch pronunciation in e.g. Antoine



Table 6.9. Examples of errors.


word /Swelex transcription/ /resulting transcription/
English translation


The accuracies when the stress symbols were removed from the results from test two are
displayed in table 6.10. (
The stress
-
excluded results from test two are referred to as test
three.) This enables comparison between test one and test two. Remarkably, when stress
was removed from test two, the results were in some cases significantly better than the test
one result
s. The phoneme
-
based accuracy for test three (not in table) is 97.22%, or 30560
errors, which is just as good as test one. When the words containing
schwa

are excluded
from the errors, the result from test three increases from 67.4% to 86.2%, almost twenty

percentage units, while test one goes up only ten. For the succeeding accuracies, the
increase is equal for both tests.



19






Test 1

Test 3

SCHWA
INCLUDED

E


e

LENGTH
INCLUDED

Accuracy

67.0%

67.4%

Number
of errors

37,104

36,668

SCHWA

EXCLUDED

E


e

LENGTH
INCLUDED

Accuracy

77.6%

86.2%

Number
of errors

25,257

15,489


E


e

LENGTH
EXCLUDED

Accuracy

82.9%

92.2%

Number
of errors

19,240

8,807



LENGTH
INCLUDED

Accuracy

77.8%

86.5%

Number
of errors

25,011

15,232



LENGTH
EXCLUDED

Accuracy

83.1%

92.4%

Number
of errors

18,987

8,529



Table 6.10.

Word
-
based accuracies for test one and test two where the stress information is
removed, called test three.

7

Evaluation

7.1

Rulsys vs. TiMBL

In the GPC experiments performed in this study, the algori
thms in TiMBL generally
perform better than Rulsys. Before any adjustment of the results is made, even test two,
which computes stress information as well as pure GPC, shows a better performance than
Rulsys (table 6.4). Surprisingly, when the
schwa

words a
re removed (table 6.5) to enable
fairer comparison, Rulsys still produces more errors than both TiMBL tests.


The more kinds of errors ignored, the better the Rulsys results get, compared to the TiMBL
tests. When
E
and
e
count as equal (table 6.6), Rulsys
increases from 62.2% to 65.3% word
accuracy, i.e. three percentage units, which is more than ten times as much as the 0.2 and
0.1 units increase for test one and two respectively. A TTS module with low accuracy for
E



e

may not necessarily lead to poor re
sults in a listening test, since many listeners (even
those who differ
E
from
e
in their own speech) may not notice the generalisation.


Rulsys seems to have severe length problems. In table 6.7, the length ignorance increases
its word accuracy from 62,2%
to 76.1%, when test one and test two go from 77.6% to 82.9%
and 69.7 to 70.6% respectively. The former escalates to 118.3% of the first value, compared
to 106.4% in test one. Test two increases only 1.3%. Ignoring length is what takes Rulsys
past test two.


With both
E


e

and ignored length (table 6.7), Rulsys comes up to a result that almost
matches test one. Rulsys generates 80.0% word accuracy and test one 83.1%.


20

However, it is still a fact that TiMBL generally performs better than Rulsys. The best
real
istic and comparable word
-
based results are 65.3%, 77.8%, 70.7% and 86.5% for Rulsys,
test one, two and three respectively. Realistic in the matter that crucial information,
length, is kept, but only one phoneme for
e/E
is included. Comparable since these
results
are in the case where
schwa

is removed (inclusion means extremely biased comparison).

7.2

TiMBL

The results from the experiments with different contexts (table 6.1) were expected. It is
likely that more input information generally leads to higher outpu
t accuracy. In GPC, there
may be cases where the third graphemes to the left and right of the focus grapheme
obviously affect the pronunciation of the focus. With 2
-
X
-
2, these affects will not be
detected.


Example:

självmordstanke /SElvmu:RdRsRtank@/
suic
idal thought
,

where the correct pronunciation of the
t
demands knowledge of the
r
that is three
graphemes to the left.


Just as in most cases with machine learning, IB1 performs better than IGTree. This is
caused by the ignorance of neighbour paths in IGTr
ee mentioned in the machine learning
chapter. As seen in table 6.3, the weightings for feature number five and three are rather
similar. Sometimes, like the example above, the pronunciation of the focus grapheme
(feature no. four) may be affected more by t
he preceding grapheme (no. three) than the
succeeding (no. five). In such cases, where the feature weighting contradicts how the
graphemes in context actually influence the pronunciation, it is likely that IGTree will fail
in phoneme prediction. In the s

lvmordstanke

example above, the left context is extremely
more important for correct pronunciation than the right.


The reason why the modified value difference metric generally performs better than the
overlap metric, is that
MVDM

is able to classify the
relative distance between features
-
values. This means that with
WO
, the distance between
e

and
ä
is the same as the distance
between e.g.
e
and
p.
However, it is a reasonable assumption to see
e

and
ä
as more similar
than
e
and
p,
and this is exactly what
MVDM

will learn from a sufficiently large training set.
It will find that the former more often lead to the same target phoneme than the latter.


When the number of nearest neighbours was increased to three, the results went down a
little. This probably me
ans that when the nearest neighbour is found, it is most often the
correct one. If more than one nearest neighbour are searched for, say three, the most
frequent of them is chosen, which may be the second or third as well as the first. It seems
that in GPC
, similarity is more important than frequency.


A similar explanation is probable for the impairment caused by switching off the shortcut
for exact matches. When exact matches are found, they tend to contain the correct target
phoneme more frequently than

the nearest neighbour.


For test two, there seems to be a greater need of large context than in test one. As
mentioned in
6
, the feature weightings in test two decrease faster than in test one when the
distance

to the focus grapheme is increased. Test two also results in a greater difference
between 2
-
X
-
2 and 3
-
X
-
3 context. This implies that in stress
-
included GPC, there is a
greater need of extended context information.



21

The somewhat most unexpected result in
the study is the different accuracies in test one
and three (table 6.10). When all words containing
schwa

are removed from the error file,
the accuracy for test three is almost ten percentage units higher than for test one, i.e. test
three contains more
sc
hwa
-
related errors than test one. Since
schwa
only occurs in non
-
stressed syllables, it is highly stress
-
dependent. Thus, test three, which is trained with
stress
-
included information, might be assumed to result in
less

stress
-
related errors than
test one,

and not more. However, this is obviously not the case. This is because each target
category is an indivisible unit. To enable utilization of stress information to predict
schwa

occurrence, the input information must show the similarity between e.g. /´a/ a
nd /`a/, which
it does not in this experiments. Instead, the opposite occurs, i.e. test three generates more
schwa
errors than test one. One plausible explanation is the frequency of
schwa
as a target
phoneme
,
compared to the other vowels, where each vowel

is computed as up to five
separate units (four stress cases and one without stress). See
7.5

for better stress prediction
approaches, where stress prediction is computed separate from grapheme
-
to
-
phoneme
conver
sion.

7.3

Results from similar algorithms

At the Institute of Language Technology, Dept. of Computational Linguistics, Tilburg
University, many grapheme
-
to
-
phoneme experiments have been made. A few of them also
for stress prediction.


In [Daelemans and van de
n Bosch 1993a] a memory
-
based learning algorithm similar to
IB1 is used for phoneme prediction in Dutch. It presents a phoneme accuracy of 93.4% at
the most. [Daelemans et al. 1998] introduce some cases of 98.5% accuracy for English using
a partially recon
structed IGTree approach. The multilingual tests in [Daelemans and van
den Bosch 1993b] show 97% for Dutch, 98% for French and 90.1% for English using MBL
algorithms.


In [Daelemans et al. 1999a] results from IB1 and IGTree for English GPC are compared.
Th
e accuracies are around 93%. The IB1 results are a little bit better than the results from
IGTree.


[Busser 1998] presents an IGTree GPC test for Dutch with grapheme
-
based accuracies of
about 97%. Word
-
level accuracies are at the most 75%. In a stress pred
iction experiment,
accuracies up to 97% come up. With the same IGTree algorithm, [Busser 1997] presents the
astonishing 98.35% for stress prediction.


None of them treats grapheme
-
to
-
phoneme conversion for Swedish. Most of them are for
Dutch. Is there a la
nguage
-
dependent difference in the way phonemes can be predicted?
English is often seen as a language with very irregular spelling [Crystal 1997], i.e. low
grapheme
-
phoneme correspondence. This may be the reason why the English experiments
present the poor
est results above. Nevertheless, the results from [Daelemans and van den
Bosch 1993b] indicate a language
-
specific difference. Therefore, the results above may not
be fully comparable with the results from this experiment. Still, a comparison is easily
car
ried through, and therefore will be.


For the pure GPC, the IB1 algorithm in this experiment results in about 97.2% (table 6.1)
at the most. The IGTree tests give 96.93%. In the examples above, IB1 and IGTree both
perform from 93% to 98%. Compared to the e
xperiments referred to above, my study seems
to generate average accuracy for the ML algorithms used.


The stress prediction experiments above show much better accuracies than my test two.
This is probably because they are two level experiments ((1) graphe
me
-
to
-
phoneme and (2)
phonemes/syllables to stress) whereas mine are one level only (see
7.6
).


22

7.4

Conclusion

Both the memory
-
based learning algorithm IB1 and the decision tree IGTree performs
comparatively well in grap
heme
-
to
-
phoneme conversion for Swedish. The accuracy exceeds
the results obtained for the rule
-
based GPC system in the TTS synthesiser Infovox.
Therefore, implementing a machine
-
learning component may improve the GPC in Infovox.

7.5

Possible difficulties cause
d by the lexicon

One conceivable problem in the machine learning experiments is that Swelex is not
consistent. As well as the incorrect transcriptions, this may be caused by the human factor.
One example is when a voiced plosive preceding a non
-
voiced beco
mes non
-
voiced as well.

Example:

sagt /sakt/
said

For some reason, sometimes Swelex follows this rule, and sometimes it doesn’t. If Swelex
was modified to follow this rule, the results from TiMBL might improve.


Removing some of the most un
-
Swedish words f
rom Swelex could also make it easier for
TiMBL, since the pronunciation of a grapheme in a given context can differ a lot depending
on the nationality of a word. Foreign
-
word information may impair the results
.

One
example (table 6.9) is the name
Roine

/r´
Ojn@/ which in test two is transcribed /r`u:´an/.
The /Ojn@/


/u:an/ error is probably caused by French words in the lexicon, where
-
oine

often is pronounced /Uan/ or /u:an/ (with Swedish accent), e.g
Antoine
. Removing foreign
words is not an easy task th
ough, since it demands some kind of definition of the boundary
between a Swedish and a non
-
Swedish word. Svenska Akademiens Ordlista (SAOL)
(The
Swedish Academy Word List)

could be an appropriate tool for definition. A similar problem
appears in words that

actually count as Swedish, but have a foreign grapheme
-
phoneme
correlation (see also
7.6
). This applies to e.g.
AIDS
/Ejds/

(AIDS)

which has an English
pronunciation.


An essential, time
-
consuming and perhaps even
more difficult project is to remove all
incorrect transcriptions from Swelex
5
. It involves deciding whether a transcription is
correct or whether it is not. If TiMBL is trained with incorrect information, e.g.
stor

/sti:r/
(large/big)
where the correct tra
nscription would be /stu:r/), it will have great difficulties in
pronouncing
stor

correctly, and probably similar words too. Removing the incorrectly
transcribed words will not necessarily affect the accuracy for TiMBL. If it is trained with
incorrect info
rmation, and answers with the same errors during testing, those errors will
not count as errors, since there are ‘printer’s errors’ in the key. However, if there are cases
where Rulsys has made a linguistically correct transcription, but Swelex has not, th
e errors
in Swelex cause the correct Rulsys transcriptions to be counted as incorrect. The results for
Rulsys will decrease. There may also be cases where TiMBL transcriptions count as wrong,
although they are linguistically correct. Even though an incorre
ct training set will not
automatically have any influence on the accuracy for the machine learning experiments,
it
is obvious that it will lead to imperfect pronunciation.




5

S
ome of t
he incorrect transcriptions have already been removed, since they were not supported by the grapheme
-
phoneme alignment list (see
5.1.2
).


23

7.6

Directions for further machine learning research on Grapheme
-
to
-
Phoneme
Conversion
for Swedish

Apart from further modification of the lexicon, there are several ways of improving
grapheme
-
to
-
phoneme conversion in a machine learning system.


A working TTS synthesiser needs to be able to pronounce several words of foreign origin as
well as

domestic. Hence, foreign words like the ones removed from the lexicon (see
7.5
) need
to be transcribed somehow. One way to manage this is to have one grapheme
-
to
-
phoneme
converter for Swedish words, and another

for e.g. English. When the Swedish converter
cannot pronounce a word, or when the probability for correct pronunciation is too low, the
English converter will be consulted. A complete
-
word lexicon may be motivated for
comparatively common words containing

graphemes of which the pronunciation digresses
from the rest.


It is also possible that some of the length errors in the machine learning experiments can
be avoided with additional training/testing on a syllable
-
based level, stress included. Since
non
-
str
essed syllables never contain long vowels, phoneme
-
based input containing
information on whether the syllable in question is stressed, should help.


For some words, like
kort
/k´ORt/
(short)
or /k´URt/

(card)
, the grapheme
-
to
-
phoneme
converter needs part o
f speech information to determine the pronunciation.


For maximum accuracy in stress
-
included GPC, aligning graphemes to phonemes and
stress at the same time is probably not the ultimate line of action, since the similarity
between
e.g. /´a/ and /`a/ is co
mpletely overlooked.
Instead, a two
-
level approach, where part
one manages the GPC, and part two the stress
prediction is preferable. Like the GPC
experiments performed in this study, part one would be grapheme
-
based. For the stress
prediction in the secon
d part, there are two main options.


It could be phoneme
-
based, i.e. the phoneme transcription forms the input information, and
the target category is the stress label of each phoneme. The other alternative is a syllable
-
based method, where the input consi
sts of syllables, and each syllable corresponds to a
stress category. For both stress predication approaches, the target stress category is either
0

as in no stress, or one of the four stress types used in Swelex. Information concerning
distance to last st
ress value of the same type and, as mentioned in
6.2

(table 6.8), stress
obligation may help to improve results for a stress prediction experiment. It is also possible
that, with a syllable
-
based approach, the s
tress obligation will automatically be interpreted
from the input information. A two
-
level stress
-
included GPC approach may also increase
the accuracy for
schwa

prediction (see
7.2
).


24

Sammanfattning

Den här upps
atsen är en undersökning av grafem
-
till
-
fonem
-
omvandling (
grapheme
-
to
-
phoneme conversion
, GPC) för svenska. Telia Promotors talsyntes Infovox, där GPC utförs
genom handskrivna regler, jämförs med ett maskininlärningsprogram som heter TiMBL
[Daelemans et al
. 1999b].


Ett maskininlärningsprogram (
machine learning program
) är ett program som lär sig att
göra korrekta bedömningar utifrån övningsexempel. Det genomgår först en inlärningsfas,
då den lagrar informationen, sedan en klassificeringsfas, då nya exempel

bedöms/klassificeras.


I det här fallet består klassificeringen i att kunna tolka en bokstav (ett grafem) till ett
fonem. Övningsexemplen är ett lexikon på drygt 110 000 ord med transkriptioner,
Chalmers Svenska Uttalslexikon [Hedelin och Jonsson]. För va
rje bokstav som ska
klassificeras anges även kontext (
context)
, d.v.s. ett visst antal bokstäver i fokusbokstavens
omgivning. 2
-
X
-
2
-

och 3
-
X
-
3
-
kontext (två respektive tre bokstäver åt vänster och höger)
används. Varje bokstav tillsammans med sin kontext ka
llas en
instans
(
instance
).


Två maskininlärningsalgoritmer används; en minnesbaserad (
memory
-
based
) och en
beslutsträdsbaserad (
decision tree
). I den minnesbaserade består inlärningen i huvudsak i
att lagra informationen från träningsexemplen som den är (
den minnesbaserade är en s.k.
lazy learning
-
algoritm som inte abstraherar informationen i träningsexemplen). Nya
instanser klassificeras genom att programmet avgör likheten mellan nya och inlärda
instanser. Den nya instansen får samma klassifikation som de
n mest lika instansen i det
inlärda materialet. I programmet som används finns två olika sätt att avgöra likhet mellan
instanser.


Varje element i en instans kallas ett
särdrag
(
feature
). Utifrån träningsexemplen räknar
maskininlärningsprogrammet ut vilka
särdrag som bidrar med mest information för
klassificeringen


varje särdrag får en viktning. Den minnesbaserade algoritmen
kombinerar likhet och viktning för att klassificera en instans. Precis som för likhet finns
det två sätt att räkna viktningen.


Bes
lutsträdsalgoritmen rekonstruerar informationen innan den lagras genom att använda
viktningen. De viktigaste särdragen hamnar överst i ett beslutsträd, de mindre viktiga
nedåt. Längst ned finns klassifikationen för särdrägskombinationerna så att för varje
klassificerad instans i träningsexemplen finns det en väg att gå nedåt genom trädet och på
så vis klassificeras nya instanser.


Flera tester gjordes med olika kontexter, inställningar och algoritmer. Alla experiment
gjordes en gång med bara GPC, och en gån
g med accentinkluderad GPC. Resultaten från de
bästa testerna sammanfogades sedan till ordnivå, för att möjliggöra jämförelse med
resultat från Infovox:s grafem
-
till
-
fonem
-
omvandlare som är ordbaserad.


Experimenten med accentinformation kan inte på ett rä
ttvist sätt jämföras med
regelsystemet, eftersom accentmarkeringen i systemen skiljer sig åt. Accentexperimentet
har istället jämförts med resultatet från det regelbaserade
utan
accent. Trots att
maskininlärningsexperimentet har mer information att hålla r
eda på, visar det högre
resultat än det regelbaserade utan accent. Experimenten utan accent ger också högre
resultat än Infovox regelsystem. Det finns emellertid vissa hinder i jämförelsen, som kan ge
maskininlärningsexperimenten fördel. När dessa hinder b
ortsetts från (om möjligt), visar
regelsystemet fortfarande sämre resultat. Detta visar att maskininlärning kan vara ett
bättre alternativ än det redan existerande regelbaserade systemet i Infovox.


25

Writing Conventions

(i)

Examples (or emphasis) in the text ar
e in
italic
.

(ii)

Phonetic transcriptions are within /slashes/.

(iii)

Word example translations are in
italic type (within parenthesis)
.


26

References

Bell Labs TTS. Bell Labs Innovations (1997). URL:
http://www.bel
l
-
labs.com/project/tts/

Busser, B. (1998).
Tree Talk
-
D: a machine learning approach to Dutch word pronunciation
.
P. Sojka, V. Matousek1, K. Pala, and I. Kopecek (Eds.). Proceedings TSD Conference,
Masaryk University, Czech Republic.

Busser, G. J. (1997).
Tree Talk
-
D: A Machine Learning Approach to Dutch Grapheme
-
to
-
Phoneme Conversion
. Institute of Language Technology, Computational Linguistics,
Tilburg University.

Crystal, D. (1997).
The Cambridge Encyclopedia of Language.

Second edition. Cambridge
Univers
ity Press.

Daelemans, W. and A. van den Bosch (1993).
Data
-
Oriented Methods for Grapheme
-
to
-
Phoneme Conversion
. Proceedings of the Sixth conference of the European chapter of the
ACL.

Daelemans, W. and A. van den Bosch (1993).
Tabtalk: Reusability in Data
-
Oriented
Grapheme
-
to
-
Phoneme Conversion
. Proceedings of Eurospeech, Berlin.

Daelemans, W. (1996).
Abstraction Considered Harmful: Lazy Learning of Language
Processing
. van den Herik, J. and T. Weijters (eds.) Benelearn
-
96. Proceedings of the 6th
Belgian
-
Du
tch Conference on Machine Learning. MATRIKS: Maastricht, The Netherlands.

Daelemans, W., A. van den Bosch, T. Weijters (1997).
Empirical Learning of Natural
Language Processing Tasks.

M. van Someren and G. Widmer (eds.) Machine Learning:
ECML
-
97, Lecture N
otes in Artificial Intelligence, Berlin: Springer.

Daelemans, W., A. van den Bosch and T. Wejters (1998).
Modularity in Inductively
-
Learned
Word Pronunciation Systems
. D.M.W. Powers (Ed.), Proceedings of NeMLaP3/CoNLL98,
Sydney, Australia.

Daelemans, W., B
. Busser and A. van der Bosch (1999).
Machine learning of word
pronunciation: the case against abstraction
. Proceedings of the Sixth European Conference
on Speech Communication and Technology, Eurospeech99, Budapest, Hungary, Sept 5
-
10.

Daelemans, W., A. v
an den Bosch., J. ,Zavrel and K. van der Sloot (1999).
TiMBL: Tilburg
Memory Based Learner version 2.0. Reference Guide
. ILK Technical Report 98
-
03. Tilburg
University.

DECtalk. SMART Modular Technologies (1999).

URL:
http://www.smartmodulartech.com/systems/

Dietterich, T. G. (1997).
Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms
. Dept. of Computer Science, Oregon State University.

Elert, C
-
C. (1995).
Allmän

och svensk fonetik
. Nordstedts förlag. Värnamo.

Hedelin, P. and A. Jonsson.
Svenskt uttalslexikon.
Dpt. of Information Theory, Chalmers
University of Technology. (Informationsteori, Chalmers Tekniska Högskola)


27

Lemmetty, S (1999).
Review of Speech Synhtesi
s Technology.

Dept. of Electrical and
Communications Engineering, Helsinki University of Technology.

ModelTalker (1997). University of Delaware (ASEL).

URL:
http://www.asel.udel.edu./speech/Dsy
nterf.html

SAMPA. (1999). URL:
http://www.phon.ucl.ac.uk/home/sampa/home.htm

SKOPE. NLP Laboratory in Pohang, Korea. URL:
http://nlp.postech.ac.kr/

T
elia Promotor (1999). Infovox 330 information sheet. URL: http://www.infovox.se

TreeTalk. ILK Research Group, Tilburg University. (1999). URL:
http://ilk.kub.nl/

van den Bosch, A. (1997).
Learning to pronounce written wo
rds. A study in inductive
language learning
. Ph.D. Thesis, Universiteit Maastricht, The Netherlands. Cadier en
Keer: Phidippides.

Weiss, S. and C. Kulikowski (1997).
Computer systems that learn
. Norgan Kaufmann
Publishers, Inc.


28

Appendix 1:

Phoneme set


Co
nsonants p pil p´i:l


b bil b´i:l


t tal t´A:l


d dal d´A:l


k kal k´A:l


g gås
g´o:s



f fil f´i:l


v vår v´o:r


s sil s´i:l


S sjuk S´}:k


h hal h´A:l


C tjock

C´Ok



m mil m´i:l


n nål n´o:l


N ring r´IN


r ris r´i:s


l lös l´2:s


j

jag j´A:g



Rt hjort j´URt


Rd bord b´u:Rd


Rn barn b´A:Rn


Rs fors f´ORs


Rl sorl s´o:Rl


Vowels

i: vit v´i:t


e: vet v´e:t


E: säl s´E:l


y: syl s´y:l


}: hus h´}:s


2: föl f´2:l



u: sol s´u:l


o: hål h´o:l


A: hal h´A:l



I vitt v´It


e vett v´et


E rätt
r´Et


Y bytt b´Yt


u0 buss b´u0s


2 föll f´2l


U bott b´Ut


O håll h´Ol


a hall

h´al


@ pojken p`Ojk@n schwa (unstressed syllables)



9: för f´9:r 2: before r


9 förr f´9r 2 "


{: här

h´{:r E: "


{ herr h´{r E "


{ abstract ´{bstr{kt


3 Churchill tC´3tCIl


Stress ´ läser, lek l´E:ser, l´e:k primary stre
ss in words


with acute accent




` pratar pr`A:tar primary stress in words


with grave accent




1

Kristineberg krIst
1
In@b´{rj secondary stress 1


(before stress)



2

abborre `ab
2
Or@ secondary stress 2



(after stress)


Length : sol s´u:l


29

Appendix 2:

Grapheme
-
Phoneme Alignments




is the 'null' phoneme, e.g. in barn /b A:


Rn/
child


Consonants


b




c




d




f




g




h




j




k




l




m




n




n




p




r




s




t




v




w




x




z





b


b

b


p

b


v

c


C

c


C

c


g

c


k

c


Rs

c


s

c


S

c


tC

c


tS

c


ts

d


d

d


g

d


Rd

d


Rt

d


t

f


f

f


v

g


C

g


d

g


dC

g


dj

g


g

g


I

g


j

g


k

g


N

g


Rs

g


S

g


s

g


tC

g


tC

h


h

h


S

j


C

j


dj

j


I

j


j

j


Rs

j


S

k


C

k


g

k


j

k


k

k


S

l


@l

l


l

l


lj

l


Rl

m


m

n


m

n


n

n


N

n


nj

n


Nj

n


Rn

p


b

p


f

p


p

q


k

r


@r

r


r

s


Rs

s


s

s


S

t


C

t


d

t


ds

t


Rt

t


RtRs

t


Rts

t


RtS

t


S

t


s

t


t

t


ts

t


tS

v


f

v


v

w


U

w


v

x


gs

x


ks

x


kS

x


s

z


Rs

z


Rts

z


s

z


ts



30

Vowels


a




e




i




u




o




y




å




ö




ä





a


A:

a


'A:

a


`A:

a


¹A:

a


²A:

a


E:

a


'E:

a


`E:

a


¹E:

a


²E:

a


o:

a


'o:

a


`o:

a


¹o:

a


²o:

a


{:

a


'{:

a


`{:

a


¹{:

a


²{:

a


e:

a


'e:

a


`e:

a


¹e:

a


²e:

a


E:j

a


'E:j

a


`E:j

a


¹E:j

a


²E:j

à


a

à


'a

a


a

a


'a

a


`a

a


¹a

a


²a

a


E

a


'E

a


`E

a


¹E

a


²E

a


O

a


'O

a


`O

a


¹O

a


²O

a


{

a


'{

a


`{

a


²{

a


¹{

a


Ej

a


'Ej

a


`Ej

a


¹Ej

a


²Ej

a


ej

a


'ej

a


`ej

a


¹ej

a


²ej

a


au0

a


'au0

a


`au0

a


¹au0

a


²au0

a


u0

a


I

a


@


e


e:

e


'e:

e


`e:

e


¹e:

e


²e:

é


e:

é


'e:

é


`e:

é


¹e:

é


²e:

e


E:

e


'E:

e


`E:

e


¹E:

e


²E:

e


A:

e


'A:

e


`A:

e


¹A:

e


²A:

e


9:

e


'9:

e


`9:

e


¹9:

e


²9:

e


i:

e


'i:

e


`i:

e


¹i:

e


²i:

e


{:

e


'{:

e


`{:

e


¹{:

e


²{:

e


}:

e


'}:

e


`}:

e


¹}
:

e


²}:

e


j}:

e


j'}:

e


j`}:

e


j¹}:

e


j²}:

e


2:

e


'2:

e


`2:

e


¹2:

e


²2:

e


ju:

e


j'u:

e


j`u:

e


j¹u:

e


j²u:

e


jU

e


e

e


'e

e


`e

e


¹e

e


²e

é


e

é


'e

é


`e

é


¹e

é


²e

e


E

e


'E

e


`E

e


¹E

e


²E

é


E

é


'E

é


`E

é


¹E

é


²E

e


9

e


'9

e


`9

e


¹9

e


²9

e


I

e


'I

e


`I

e


¹I

e


²I

e


{

e


'{

e


`{

e


¹{

e


²{

é


{

é


'{

é


`{

é


¹{

é


²{

e


a

e


'a

e


`a

e


¹a

e


²a

e


O

e


'O

e


`O

e


¹O

31

Vowels (cont.)


e


²O

e


2

e


'2

e


`2

e


¹2

e


²2

e


3

e


'3

e


`3

e


¹3

e


²3

e


@

e


'@

e


`@

e


¹@

e


²@

e


@:

e


'@:

e


`@:

e


¹@:

e


²@:

é


@

e


j


i


i:

i


'i:

i


`i:

i


¹i:

i


²i:

i


9:

i


'9:

i


`9:

i


¹9:

i


²9:

i


E:

i


'E:

i


`E:

i


¹E:

i


²E:

i


A:

i


'A:

i


`A:

i


¹A:

i


²A:

i


I

i


'I

i


`I

i


¹I

i


²I

i


9

i


'9

i


`9

i


¹9

i


²9

i


E

i


'E

i


`E

i


¹E

i


²E

i


j

i


Ij

i


@

i


a

i


'a

i


`a

i


¹a

i


²a

i


Ij

i


'Ij

i


`Ij

i


¹Ij

i


²Ij

i


aj

i


'aj

i


`aj

i


¹aj

i


²aj


o


u:

o


'u:

o


`u:

o


¹u:

o


²u:

o


o:

o


'o:

o


`o:

o


¹o:

o


²o:

o


2:

o


'2:

o


`2:

o


¹2:

o


²2:

o


U

o


'U

o


`U

o


¹U

o


²U

o


OU

o


'OU

o


`OU

o


¹OU

o


²OU

o


O

o


'O

o


`O

o


¹O

o


²O

o


a

o


'a

o


`a

o


¹a

o


²a

o


@

o


u0

o


au0

o


'au0

o


`au0

o


¹au0

o


²au0


u


}:

u


'}:

u


`}:

u


¹}:

u


²}:

u


j}:

u


j'}:

u


j`}:

u


j¹}:

u


j²}:

u


ju:

u


j'u:

u


j`u:

u


j¹u:

u


j²u:

u


u:

u


'u:

u


`u:

u


¹u:

u


²u:

u


y:

u



'y:

u


`y:

u


¹y:

u


²y:

u


9:

u


'9:

u


`9:

u


¹9:

u


²9:

u


3

u


'3

u


`3

u


¹3

u


²3

u


2:

u


'2:

u


`2:

u


¹2:

u


²2:

u


@:

u


'@:

u


`@:

u


¹@:

u


²@:

u


u0

u


'u0

u


`u0

u


¹u0

u


²u0

u


O

u


'O

u


`O

u


¹O

u


²O

u



@

u


Y

u


'Y

u


`Y

u


¹Y

u


²Y

u


jU

u


v

u


U

u


'U


32

Vowels (cont.)


u


`U

u


¹U

u


²U

u


9

u


'9

u


`9

u


¹9

u


²9

u


2

u


'2

u


`2

u


¹2

u


²2

u


j9

u


a

u


'a

u


`a

u


¹a

u


²a

u


@

u


'@

u


`@

u


¹@

u


²@

u


f

u


p

u


j

u


y:

u


'y:

u


`y:

u


¹y:

u


²y:


ü


Y

ü


'Y

ü


`Y

ü


¹Y

ü


²Y

ü


y:

ü


'y:

ü


`y:

ü


¹y:

ü


²y:


y


y:

y


'y:

y


`y:

y


¹y:

y


²y:

y


i:

y


'i:

y


`i:

y


¹i:

y


²i:

y


}:

y


'}:

y


`}:

y


¹}:

y


²}:

y


Y

y


'Y

y


`Y

y


¹Y

y


²Y

y


9

y


'9

y


`9

y


¹9

y


²9

y


aj

y


'aj

y


`aj

y


¹aj

y


²aj

y


I

y


'I

y


j


ä


E:

ä


'E:

ä


`E:

ä


¹E:

ä


²E:

ä


e:

ä


'e:

ä


`e:

ä


¹e:

ä


²e:

ä


{:

ä


'{:

ä


`{:

ä


¹{:

ä


²{:

ä


E

ä


'E

ä


`E

ä


¹E

ä


²E

ä


{

ä


'{

ä


`{

ä


¹{

ä


²{

ä


e

ä


'e

ä


`e

ä


¹e

ä


²e

ä


@


å


o:

å


'o:

å


`o:

å


¹o:

å


²o:

å


O

å


'O

å


`O

å


¹O

å


²O

å


U

å


@



ö


2:

ö


'2:

ö


`2:

ö


¹2:

ö


²2:

ö


9:

ö


'9:

ö


`9:

ö


¹9:

ö


²9:

ö


2

ö


'2

ö


`2

ö


¹2

ö


²2

ö


9

ö


'9

ö


`9

ö


¹9

ö


²9

ö


@

33

Appendix 3:

Weights for the 10
-
FC

tests for test one and test two


10
-
FC

No.

Feature
No.

Test 1

Test 2

1

1

0.0386973

0.0699462


2

0.0705866

0.0943413


3

0.17649

0.188165


4

0.884009

0.88
5378


5

0.197349

0.209649


6

0.0799479

0.109406


7

0.0385612

0.0679605

2

1

0.0386809

0.0698629


2

0.0706289

0.0943362


3

0.176579

0.188197


4

0.884016

0.885382


5

0.19723

0.209466


6

0.0800135

0.109371


7

0.0385759

0.0680091

3

1

0.0386612

0.0698
729


2

0.0704208

0.094133


3

0.176375

0.188041


4

0.884019

0.885382


5

0.197384

0.209705


6

0.0799155

0.109294


7

0.0385259

0.068033

4

1

0.0386875

0.0698645


2

0.0703911

0.0941452


3

0.176533

0.188157


4

0.883937

0.885289


5

0.197287

0.209546


6

0.0799688

0.109299


7

0.0386878

0.0680768

5

1

0.0385952

0.0698461


2

0.0704425

0.0942321


3

0.17636

0.18802


4

0.883947

0.885338


5

0.197284

0.209581


6

0.0800335

0.109462


7

0.0385945

0.0680154














10
-
FC

No.

Feature
No.

Test 1

Test 2

6

1

0.0387341

0.0699983


2

0.0704951

0.0942467


3

0.176559

0.188169


4

0.883879

0.885256


5

0.197362

0.209611


6

0.0801244

0.109473


7

0.0387153

0.068205

7

1

0.0387031

0.0699535


2

0.07059

0.0943654


3

0.176481

0.188162


4

0.883881

0.885255


5

0
.197373

0.209691


6

0.0802456

0.109608


7

0.0386435

0.0680955

8

1

0.038786

0.0699715


2

0.0706076

0.0943365


3

0.176576

0.188197


4

0.883836

0.88521


5

0.197481

0.209734


6

0.0801152

0.109478


7

0.0385565

0.0679621

9

1

0.0387808

0.0700399


2

0.0
706148

0.094406


3

0.176592

0.188235


4

0.883978

0.885361


5

0.197433

0.209697


6

0.0801076

0.10947


7

0.0386346

0.0679617

10

1

0.0386218

0.069772


2

0.0706354

0.0944017


3

0.176547

0.188237


4

0.883902

0.885269


5

0.19737

0.209643


6

0.0800043

0.109359


7

0.0385785

0.067969