CLASS: Child Language Acquisition and Speech Synthesis

earthblurtingAI and Robotics

Nov 14, 2013 (3 years and 9 months ago)

376 views







CLASS: Child
Language
Acquisition and
Speech Synthesis

PROJECT PROPOSAL


COM6780 Darwin Research Project

Christopher Northwood, Jennifer Pandian, Zhan
Peng, David Rhodes, Xingyi Song, Xu Zhai, Yang Zhao


Supervisor
s
:
Robin Hofe, Guy Aimetti, Phil
Green

CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

2




CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

3









A
BSTRACT

CLASS (Child Language Acquisition and Speech Synthesis) is a project aiming to mimic how
children learn the ability to speak. It builds on the work of ACORNS

(Boves, ten Bosch, &
Moore, 2007)
, a
wider project which aimed to model child acquisition of verbal language.

The CLASS system described in this proposal will extend the DP
-
ngrams approach of
Aimetti
(2009)
,

developed as part of the ACORNS project
, into an active system

consisting of

a
feedba
ck loop connect
ing a speech synthesiser to a modified DP
-
ngrams scorer which guides
a learning controller to modify the synthesiser
output by varying the input parameters
to
produce as close a match as possible to how the CLASS system believes the word sho
uld
sound.

A number of algorithms are proposed to
be used in

the learning controller: simulated
annealing, genetic algorithms and neural networks
. Each one of these algorithms will be
tested and evaluated in order to discover an optimal mechanism for learn
ing how to speak.

The system will be tested
first
against simple sounds
generated by synthesiser, and then
simple sounds generated by
a
human

speaker
, before moving onto more complex
monosyllabic utterances by both machine and human.

The final goal is to p
roduce a system that can
learn how to reproduce the sounds that it has
heard and subsequently remembered.

CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

4


T
ABLE
O
F
C
ONTENTS

Abstract

................................
................................
................................
................................
......

3

Table Of Contents

................................
................................
................................
......................

4

1

Introduction

................................
................................
................................
.......................

5

2

Project Background

................................
................................
................................
............

6

2.1

ACORNS

................................
................................
................................
.......................

6

2.2

Child Language Acquisition

................................
................................
.........................

6

2.3

Speech Synthesis

................................
................................
................................
.........

7

2.4

Sp
eech Recognition

................................
................................
................................
.....

8

2.5

Machine Learning

................................
................................
................................
......

10

3

Work Plan

................................
................................
................................
.........................

12

3.1

System Architecture

................................
................................
................................
..

12

3.2

Speech Synthesis

................................
................................
................................
.......

13

3.3

Recognition And Scoring

................................
................................
...........................

14

3.4

Learning Controllers

................................
................................
................................
..

14

3.5

System Integration

................................
................................
................................
....

15

3.6

Sy
stem Evaluation

................................
................................
................................
.....

15

3.7

Timescale

................................
................................
................................
...................

17

4

Bibliography

................................
................................
................................
.....................

18

CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

5


1

I
NTRODUCTION

CLASS is a system that aims to model how children learn speech. It b
uilds significantly on the
ACORNS project (section
2.1
) which
is

a
model of

how children acqui
re human
communicative language ability.

The ACORNS system is a
passive system in that it only learns

meaning. One implementation
of ACORNS of interest here is the system developed by
Aimetti (2009)
. CLASS aims to add an
active aspect to ACORNS, combining a speech synthesiser with the
model

in order
to form a

system
th
at

actively learn
s

how to speak.

The aim of CLASS is to develop this system so that
it can gener
ate simple words and syllables based on the words discovered as part of the
ACORNS system, and requiring no other a
-
priori knowledge.

Unsurprisingly, CLASS is n
ot the first project to tackle this
problem

(Howard & Huckvale,
2005)
, but it differs from previous projects in approach and integration into the greater
ACORNS project, such as for
the
discovery of the “phonological targets”.

CLASS aims to build a baseline system as described in section
3
, alongside different “learning
controllers” which control how speech is generated over time to try
and derive an optimal
strategy for speech acquisition.
CLASS is intended to be a short project, lasting under
four

months and the work planned bears this short timeframe in mind.

CLASS is an interdisciplinary project

draw
ing

from a
number

of
fields

including computer
science, signal processing, linguistics and psychology, as well as building specifically on the
ACORNS project. A brief background of the relevant fields and topics is included in section
2

below.




CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

6


2

P
ROJECT
B
ACKGROUND

2.1

ACORNS

ACORNS (Acquisition of Communication and Recognition Skil
ls) wa
s a

project i
n

the

Information Society Technologies

6th Framework

F
uture and Emerging Technologies
p
rogramme of the European Union
. It lasted for three
year
s, starting on
the 1
st

December
2006,
running until the 30
th

November 2009
.

ACORNS aimed to build and evaluate mathematical models and computational mechanisms
to build an artificial intelligence agent
with the
ability to acquire human verbal
communication behaviour.


What distinguishes th
e ACORNS

project
from

other
artificial int
elligence

agents currently
available
within the field of language acquisition

is that this unconventional speech
recogniser
system

models how real
-
life human toddlers will acquire communication skills.
This was achieved by developing a system that recognis
es acoustic signals, guided by the
agent's intentions to fulfil basic needs
. ACORNS is a cross
-
modal system, incorporating the
raw acoustic signal and a pseudo
-
visual modality
.

Another key distinguishing factor between
ACORNS and other systems is that the
learning agent does not require a
-
priori assumptions
about the phonological units.

ACORNS
defined

the fo
llowing areas of development for the agent:

1.


representations of acoustic signals in multiple parallel temporal and spectral
resolutions,


2.

methods for pa
tterning signals into coherent structures that correspond to
potentially meaningful acoustic events,


3.

methods for building and maintaining dynamic emergent patterns in memory,


4.

methods for searching in an associative memory, and


5.

methods for handling natur
al interaction between an artificial agent and a human,
including verbal and non
-
v
erbal (affective) information.”

(Os, 2009)

2.2

C
HILD
L
ANGUAGE
A
CQUISITION

Interest in child language acquisition dates back to the times of Plato, ye
t
recently
,

scientific
interest in the area has exploded. There are two

main

competing theories of child language
acquisition:
the

nativist view and
the
empiricist view.

2.2.1

N
ATIVISM

The nativist view is dominated by the research of Noam Chomsky coming from his seminal
1959 paper
(Chomsky, 1959)
, in which he attacked the behaviourist, or empiricist, view of
child language acquisition, demanding a more
scientific approach to understanding language
acquisition.

CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

7


The basic premise of the nativist approach is that children are equipped with some language
acquisition device with an embedded

Universal Grammar

, with which the child discovers
the rules of the
language in order to build a restricted version of the Universal Grammar that
corresponds to the language

they are attempting to acquire

(Pinker, 1994)
.

2.2.2

E
MPIRICIST

The empiricist view dates back a lot further than the nativist
approach, with Locke’s
influential 1690 work “An Essay Concerning Human Understanding” stating that the mind is
born as a “tabula rasa” with powerful learning devices in use to fill the mind with the various
faculties it needs, language among them, which f
orms the basic premise of the empiricist
view of language acquisition

(Stanford Encyclopedia of Philosophy, 2007)
.

More recently, work started to emerge in the 1980s criticising the prevalent view of
nativism, stating that lang
uage learning is not itself innate, but arises from the general
learning mechanisms of children

(Bates, Elman, Johnson, Karmiloff
-
Smith, Parisi, & Plunkett,
1998)
.

ACORNS took the empiricist view in developing their system, wit
h specific attention being
paid to a statistical learning model. The statistical learning model also puts together a strong
argument for the discovery of phonemes and syllables
-

in particular, repeating patterns
-

which
is

important for our project.

As sp
eech is continuous, a child must therefore discern where the boundaries between
words lie and statistical models have been present
ed which adequately model this

(Saffran,
Aslin, & Newport, 1996)
.

2.2.3

A
NALYSIS

Much remains to be dis
covered within the realm of child language acquisition, and as of yet,
there is no dominant single theory

with these two main approaches presenting strong
arguments.
Considering that ACORNS based itself in the empiricist view, and statistical
learning lend
s itself to computer modelling easily, it therefore appears to make sense to use
this model in developing
CLASS
.

2.3

S
PEECH
S
YNTHESIS

Speech synthesis is the field concerned with the artificial production of human speech.
Technologies for
generating synthetic
speech waveforms
include

concatenative synthesis,
based on the principle of concatenation of speech fragments, and formant synthesis, which
uses an acoustic model rather than human speech samples to generate the speech.
A third
method for synthesis is that

of articulatory synthesis, based on models of the human vocal
tract.
A number of
implementations

are

available for each technology

(Holmes & Holmes,
2002)
.

Concatenative synthesis
is clearly unsuitable for CLASS as it does no
t model any known
behaviour of children


children do not simply record and play back what they have heard


CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

8


indeed there is no such capability to do so
-

but instead attempt to mimic what they have
heard.

2.3.1

H
OLMES
S
YNTHESISER

The Holmes Synthesiser

(Holmes & Holmes, 2002)

i
s a well known

parallel

formant
synthesiser that takes a parametric description of a sound in terms of 10 ms frames to form
a complete utterance.
However, due to the basis of the Holmes system as a source
-
filter
model, it can produce sounds that are impossible to be produced by the human vocal tract.


2.3.2

P
RAAT

Praat is a popular speech analysis program that contains an articulatory speech synthesis
model. The advantages of an articulatory model in CLASS are c
lear


it models the human
vocal tract, and mimicking human behaviour is a core aim of this project.

Although Praat is
less flexible in producing sound than the Holmes synthesiser, the
se

constraints limit

Praat

to
sounds that can be physically produced

by
the human vocal tract
, and as such is more
realistic. Constraining our search space to realistic sounds in this way also
facilitates

the job
for

our learning controllers easier.

One of the goals of CLASS is to model how children acquire language. As articu
latory
synthesisers, such as Praat, model the closest how humans produce sound, it therefore
seems to be the most appropriate type of synthesiser to use in CLASS.

2.4

S
PEECH
R
ECOGNITION

2.4.1

M
EL
F
REQUENC
Y
C
EPSTRUM
C
OEFFICIENTS
(MFCC
S
)

Human speech is produced by ca
using resonance in the air expelled by the lungs, causing
turbulence or impeding the vibrating air. The vocal tract shapes the sound, causing it
to
resonate at different frequencies
-

the most

important of which are called

formants. The
frequencies of the
formants and how they change with time is a crucial part of det
ermining
meaning in the sound

(Holmes & Holmes, 2002)
.

In order to better analyse a sound signal
,

it is useful to split a sound into frames, each frame
containing a

few milliseconds of the sound. Performing
a discrete cosine transformation
(DCT)

on the log magnitude of
a
Fourier transformation
of
the waveform in ea
ch frame
produces a cepstrum
-

represent
ing

the quefrency domain of the sound
. A frame of 20
-
25ms
will r
esolve the important frequencies produced by the vocal tract but also the excitation
sounds from the vocal chords.

When humans process sound, some frequency bands are weighted more than others
,

increasing the relative value of the formants that are import
ant in determining meaning in
speech. In order to mimic this behaviour, sound can be passed through a Mel filter bank (a
set of triangular filters that weight the sound based on frequency).

Mel filtering
a cepstrum

produces Mel Frequency
Cepstrum Coe
fficients (MFCCs)
-

a set of
que
fre
ncy bands for each
frame.

MFCCs have the advantage
of
normalising the vocal tract length and
the features in
an MFCC are statistically independent, making them easier to use in statistical methods.

CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

9



2.4.2

C
OMPARING
S
OUNDS

An easy way to compare two sounds is to extract the important features in MFCC format and
treat each frame as a point in multidimensional space where the number of dimensions
is
the number of MFCC features
. One frame can then be compared to the other by wo
rking out
the Euclidean distance between the two points, with smaller distances resulting from better
matching sounds. Other measures of distance can be used which will scale differently as the
distance increases between the two frames.

As speech rate vari
es both globally and in sections of words from one utterance to the next a
method of determining which frame to compare to another is needed. For the purposes of
this research project, Dynamic Time War
ping (DTW) will be used to match the
t
wo sounds
into th
e best alignment

according to whichever scoring method is chosen.

2.4.3

D
YNAMIC
P
ROGRAMMING

Dynamic programming (DP) is a technique for solving difficult problems by breaking them
down into simpler parallel components. DTW is a technique that uses DP to align tw
o
sequences globally (
a
s used for

isol
ated word template recognition;
Sakoe & Chiba, 1978)
. A
modified version of DTW allows for

local

alignment

(well matching sections aligned together
;

Aimetti, 2009;

Park & Glass, 2008)
. DTW uses the assumption that at e
very point where it is
operating the best alignment has already been found for the sequence up to that point, DP
then compares the next two objects in the sequence and decides if they should be aligned
together or a gap placed in one of the two sequences.
This process is achieved by generating
a grid of scores for alignments (either matches or gaps) together with a set of pointers which
show the best path through the grid and by working backwards through the pointers the
best alignment is determined. For gl
obal alignments the path is tracked from the final corner
of the grid back to the start, for local alignments maximal scores in a path are tracked back
to their origin (a zero scoring point)
-

this allows for multiple alignments of varying length to
be det
ermined in one scoring grid.

For example, matching the words
-
“PPAATTEERRNN” and “PATTERN”
-

may en
d with an
alignment as follows:

long word
-

"PPAATTTTEERRNN", aligned short word
-

"P
-
A
-
TT
--
E
-
R
-
N
-
"

As can be seen, the gaps placed in the shorter sequence a
llow the best matching letters to
align giving the best overall alignment. The warping effect can be across a whole word or just
a small section of it, if one utterance has a faster beginning but an elongated end compared
to another then gaps can be placed

in both sounds until they align with the best score.

For example, matching the words
-
“PPAATTERN” and “PATTEERRNN”
-

may en
d with an
alignment as follows:

aligned first word
-

"PPAATTE
-
R
-
N
-
", aligned second word
-

"P
-
A
-
TTEERRNN"

CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

10


Dynamic time warping is na
med as such because it has the effect of stretching out faster
sounds to match a slower sound by adding gaps into it. The way that an alignment will be
derived is dependent on how the alignments are scored and how much of a penalty is given
to placing gaps
. Differing scoring techniques may prefer one alignment over another or
determine the start and end points of an alignment differently, for example using the
Euclidean distance squared versus the Euclidean distance will result in scores more rapidly
droppi
ng off as the sequences become less alike and may result in smaller alignments. The
gap penalty can be adjusted allowing time warping to be punished in different ways, if small
gaps are to be expected due to natural variation in speech speed it may be desi
rable to have
a low penalty for very small gaps but increasing penalties for gaps that would normally be
too long for t
he variation in normal speech. “
Generally speaking, if the slope constraint is too
severe, then time
-
normalization would not work effecti
vely. If the slope constraint is too lax,
then discrimination between speech patterns in dif
ferent categories is degraded.”
(Sakoe &
Chiba, 1978)

In ACORNS, local alignments are identified in sentences of speech to find the mat
ching word
that relates to the flag supplied with the input.
CLASS has similar requirements for

determining the
similarity of one sound
relative
to another
, so using
DTW in
a modified
version of the implementation of
Aimetti
(
2009)

w
ould allow the matching

and scoring of
individual phonemes in the word.

2.5


M
ACHINE
L
EARNING

2.5.1

H
ILL
C
LIMBING
A
LGORITHMS

Simulated annealing is a common hill climbing algorithm that probabilistically accepts non
-
improving moves in order to avoid getting stuck at local optima. It takes

this idea from
annealing, a process of cooling metal, and as such starts with the idea of a high temperature
when the algorithm starts, and then cools down over time. Th
e

temperature is used as a
factor in the algorithm to decide when a worsening move should be accepted, and how
much of a worsening move is acceptable. Improving moves are always accepted, and the
algorithm eventually reduces to hill climbing

(Clark & Jacob
,

n.d.
)
.

2.5.2

G
ENETIC
A
LGORITHMS

Genetic algorithms (GA) are used as a search technique in optimisation and search problems
that find approximate or exact results

and are classified as a global search heuristic
. They
form a class of evolutionary algorithms

inspi
red by Darwin’s theory of evolutionary biology.

A

random population of

chromosomes
” (
members of the solution domain given some
“genetic representation”
) is
generated
, and the
fitness of each individual in the population
computed by some fitness function

(
where a higher value indicates some higher chance to
“reproduce”)
.

Random chromosomes are then selected based on its fitness function and
modified

(using the ideas of recombination and mutation from evolutionary biology)

and
used in the next iteration of t
he algorithm
.

GA aims to produce new generations of solutions
that are

better than the previous generations

until some optim
um

of the fitness
landscape

is
reached

(Goldberg, 1989
;

Mitchell M. , 1998)
.

CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

11


2.5.3

N
EURAL
N
ET
WORK
S

Artificial neural networks (ANNs)
are

s
omewhat of a
current buzzword

in computing and the
wider world, due to the attraction of modelling how the brain works,
yet
it is a very vague
term

that is difficult to define precisely.

ANNs are considered to be a good model for
representing the human bra
in


a complex dynamic system.

One common definition is that it is a network which contains many simple processing
elements, which can be tied together to become powerful and capable of des
cribing
complicated behaviours.

Mitchell gives a precise definitio
n for ANNs: “Artificial neural networks provide a general,
practical method for learning real
-
valued, discrete
-
valued, and vector
-
valued functions
examples. ANN learning is robust to errors in the training data and has been successfully
applied to problems

such as interpreting visual scenes, speech recognition, and lear
ning
robot control strategies

(Mitchell T. , 1997)
.

CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

12


3

W
ORK
P
LAN

3.1

S
YSTEM
A
RCHITECTURE

The system we propose to build
for CLASS
will be
based on
a modified version of an ACORNS
implementation.

The ACORNS project includes

a sensory store that will extract feature
vectors from an input

(Chatterjee, Koniaris, & Kleijn, 2009)
, an alignment and scoring system
(Aimetti, 2009)

plus some memory architecture

(Boves, ten Bosch, & Moore, 2007)
. All of
these features can be re
-
used for the task of speech learning with little modification and the
addition of a speech
synthesiser and a controller. The controller will feature each of the
learning algorithms we wish to test
,

and
these

will be enabled
in turn to learn the production
of an utterance
.


















FIGURE
1

-

HIGH LEVEL OVERVIEW
OF SYSTEM

As figure 1 shows, our system includes a carer. One important aspect of language acquisition
is in feedback and social cues from a child’s caregiver. By incorporating a caregiver into our

Learning Controller


Speech Synthesiser

Working Memory

Scoring


Sensory Store


Long Term Memory


Very Long Term Memory

Parameters

Processed
inputs

Ma
tch score

Speech

Storing and
retrieval


Carer

Iconic feedback

CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

13


system in this way we model this be
haviour by giving the scorer cues based on how a human
perceives the quality of the sound that is produced.

Our project aims to
compare

the
performance of the different machine learning algorithms
within

CLASS to discover which learning controller is optim
al. We must therefore
implement
different learning controller modules for each of
the different algorithms
to be
evaluated
.

Our first task will be to build a baseline version of the essential modules. This has been split
into work units and expanded upon b
elow.

3.2

S
PEECH
S
YNTHESIS

3.2.1

S
YNTHESISER
I
NTEGRATION

Although Praat contains built in speech synthesis capabilities which we will be using, the
synthesiser will require an interface between it and the
modified
ACORNS system
implemented in MATLAB. The interface m
ust be able to receive variables for the word to be
generated from the learning controller and output the word produced to the sensory store
for processing.

3.2.2

D
ATASET
C
REATION

Part of the synthesiser work unit involves the creation of the evaluation dataset
.

As
discussed below, the system will be tested on monosyllabic synthesised and “real” voices,
and these need to be created.

The synthesised sounds will be created using Praat, to allow for closer alignments between
the output sounds and the input one

in th
e first stage of testing. A dataset cons
isting of
monosyllabic Praat
-
generated utterances

will have to be created and learnt by the initial
CLASS recogniser

for the initial experiments
.

This dataset will consist of utterances
of
different complexities,
ranging from simple vowel sounds to more complex consonant
structures and simple consonant
-
vowel syllables.

A second set of files will incorporate

human speech using the same complexity classes as
before. This could either be produced by manually recording

new sounds, or extra
cting
individual sounds from an existing

dataset, such as the ACORNS dataset.

3.2.3

F
UNCTIONAL
T
ESTING

A simple
way
to test will be by checking if a sound created using the interface is
similar to

one generated using the same input parameters in Praat directly. Judgement of whether the
sounds are identical could be subjectively done by ear or by aligning perfectly using the
scorer.

3.2.4

W
ORKLOAD

Task Name

Estimated Workload

Assigned Person

Creation of
interface for speech synthesiser and
integration into overall system

2 weeks

Jennifer Pandian

Xingyi Song

Functional testing and bug fix

1 week

Jennifer Pandian

CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

14


Dataset creation
-

synthesised

1 week

Jennifer Pandian

Dataset creation
-

human

1 week

Jennifer Pandian

3.3

R
ECOGNITION
A
ND
S
CORING

3.3.1

O
VERVIEW

The alignment system of ACORNS will be modified to output alignment scores to the learning
controller, which can then evaluate the score and decide on how to generate the next word.
The long term memory wi
ll need to store a representation and score for the current best
match for a generated word which can be updated as and when the learning process
produces a better version of the word to be learnt. The best matching word should be
generated and output to t
he carer when requested for a subjective analysis of how well the
sound matches the training word.

Initial scoring will be based on either the percentage of the target word that aligns with the
synthesised word (how much of the target word has been produce
d), the average Euclidean
distance between the frames of the alignment (how well the parts that align match the
target word) or preferably a combination of both.

3.3.2

F
UNCTIONAL
T
ESTING

To test the recogniser and scoring system, a
sound
will

be produced
in Praa
t as the word to
be learnt,
and

that sound plus several other random sounds
will

be supplied
in random order
to the scoring system. The alignment system should correctly match the identical sound as
the best match and store that in memory.

3.3.3

W
ORKLOAD

Task Na
me

Estimated Workload

Assigned Person

Modification of alignment scoring system

3 weeks

David Rhodes

Alignment functionality testing and bug fix

1 week

David Rhodes

3.4

L
EARNING
C
ONTROLLERS

3.4.1

O
VERVIEW

The learning controller must be able to initiate the
production of a word by sending
appropriate input values to the synthesiser and formulate new inputs when the alignment
score is received from working memory.

An initial learning controller should be generated which provides a baseline for the other
learni
ng controllers to be compared again
st



that is, random
ly generating the outputs to
the synthesiser module with disregard to the changing input over time.

The three individual learning controllers we’ve chosen to perform ou
r

experiments with are
a simulate
d annealing hill climbing approach, genetic algorithms and an artificial neural
network system. These learning controllers should respond to the input from the
scoring/recogniser module and vary
their

output to the speech synthesiser as appropriate,
thus

c
onverging to some optimum score.

CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

15


3.4.2

F
UNCTIONAL
T
ESTING

For the initial controller, o
ne test to make sure the learning loop functions correctly will be
to produce sounds in a systematic fashion by iterating through values for the different inputs
to Praat and
testing each sound. The alignment score received from the scorer would be the
trigger to generate the next sound but not be evaluated by the learning controller.

3.4.3

W
ORKLOAD

Task Name

Estimated Workload

Assigned Person

Initial baseline learning controller

1.5 weeks

Zhan Peng

Chris Northwood

Xingyi Song

Xu Zhai

Yang Zhao

Baseline functionality testing

0.5 weeks

Zhan Peng

Chris Northwood

Xingyi Song

Xu Zhai

Yang Zhao

Neural network learning controller

3 weeks + 1 week
testing/bug fix

Yang Zhao

Zhan Peng

Genetic algorithm learning controller

3 weeks + 1 week
testing/bug fix

Xingyi Song

Xu Zhai

Simulated annealing learning controller

3 weeks + 1 week
testing/bug fix

Chris Northwood

3.5

S
YSTEM
I
NTEGRATION

In order for the new modules to function
,

the ACORNS system will need to be modified to
accommodate the learning loop.

Once these features are implemented it should be possible
to initiate a learning loop for a word, changing the generated word on each loop and storing
the best scoring alignment

in long term memory for retrieval whenever the carer requests it.
Testing of the system would then involve using different learning methods and possibly
different scoring systems to find how each performs.

3.5.1

W
ORKLOAD

Task Name

Estimated Workload

Assigned Pe
rson

Integration

1 week

David Rhodes

3.6

S
YSTEM
E
VALUATION

3.6.1

E
VALUATION
M
ETHODS

The first set of experiments will be to try to learn a monosyllabi
c sound previously generated
by Praat
as this should be the easiest to replicate and match. This will test both the
functionality of different learning algorithms and allow the scoring system to be varied to
see if one produces better or faster
matches
than the others. Provid
ed

that these simp
le
words can be replicated to a reasonable degree
,

further experiment
s

will

then proceed with
more difficult words.

CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

16


Once the experiments involving
synthesised
monosyllabic words are complete
,

the difficulty
of the words to be learnt will be increased to

an
alyse the performance of each learning
algorithm
. The next difficulty step will be to learn human produced monosyllabic words.
The
full details of this dataset are discussed further above.

It is likely that the time constraints of this research will not al
low us to perform any more
complex tests
,

however
,

if possible
,

the next stages of testing would be to increase word
difficulty further, first with multi
-
syllable machine produced words and finally with multi
-
syllable human produced words.

3.6.2

E
VALUATION
M
ETRI
CS

There are two main methods for evaluating how accurate a learnt word is compared to the
training word. The first is to use the scoring system already implemented as part of the
learning process. The learnt word will have the highest score
of

all the inp
uts generated by
the learner
, but may not be an exact match.

B
y comparing its score to the score of aligning
the training word with itself, an objective measure of how similar the sounds are can be
produced. The drawback of this is that the scoring system
may
attribute high scores

based
on features in the sound that are not of phonetic importance to the word being produced.

The second method of evaluation
is

to play

back the learning and learnt words to a human
listener for them to evaluate. Human listening

tests
will score similarity based on how well
the learnt word replicates the meaning of the learning word and not value parts of the
sound

that are

of no phonetic importance

(that is only those parts which convey

meaning).
The drawback of this measure is
that it is subjective and different listeners may score the
same sounds differently.

The last cons
ideration is that of efficiency:

the time and memory requirements of the system
when learning a word. This is not of major importance as the system does not n
eed to run in
real time or on device
s

with very limited memory. However
,

if the time or space complexity
become so high that it will be infeasible to carry out the tests required for this research then
the learner will either need to be adjusted
,

or remove
d from the research program
me
.

3.6.3

W
ORKLOAD

Task Name

Estimated Workload

Assigned Person

Evaluation

2

weeks

All

Report creation

2 weeks

All

Poster creation

2 days

All

Journal paper creation

1 week

All



CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

17


3.7

T
IMESCALE


The above timescale shows the dependencies and computed timescale of the various work plans and workloads computed above, alo
ng
with the
responsible persons for each area. The grey area shows the University
’s

Easter break.


CN



CU物r No牴Uw潯T

JP



䩥nn楦敲 偡nT楡i

ZP



RU慮 健ng

DR



D慶aT 剨RT敳

XS



塩X杹椠S潮g

XZ



Xu RU慩

YZ



Y慮朠婨慯

KEY

CLASS: Child Language Acquisition and Speech Synthesis



A Proposal

18


4

B
IBLIOGRAPHY

Aimetti, G. (2009). Modelling Early Language Acquisition Skills: Towards a General Statistical
Learning Mechanism.
Proceedings of the 12th Conference of the European Chapter of the
Association for Computational Linguistics.


Bates, E., Elman, J., Johnson,
M., Karmiloff
-
Smith, A., Parisi, D., & Plunkett, K. (1998).
Innateness and Emergentism. In W. Bechtel, & G. Graham,
Companion to Cognitive Science

(pp. 590
-
601). Oxford: Basil Blackwell.

Boves, L., ten Bosch, L., &

Moore, R. (2007). ACORNS
--

towards computational modeling of
communication and recognition skills.
The 6th IEEE International Conference on Cognitive
Informatics.


Chatterjee, S., Koniaris, C., & Kleijn, W. B. (2009). Auditory Model Based Optimization of

MFCCs Improves Automatic Speech Recognition Performance.
10th Annual Conference of the
International Speech Communication Association.

Brighton.

Chomsky, N. (1959). A Review of B. F. Skinner's Verbal Behavior.
Language

, 35

(1), 26
-
58.

Clark, J. A., & Jac
ob, J. L. (n.d.).
Two
-
Stage Optimisation in the Design of Boolean Functions.

Retrieved from http://www
-
users.cs.york.ac.uk/~jac/PublishedPapers/Presentations/Aust1.ppt

Goldberg, D. E. (1989).
Genetic Algorithms in Search, Optimization and Machine Learning,

1st
edition.

Addison
-
Wesley Longman Publishing Co., Inc.

Holmes, J., & Holmes, W. (2002).
Speech Synthesis and Recognition, 2nd Edition.

Taylor &
Francis, Inc.

Howard, I. S., & Huckvale, M. A. (2005). Learning to Control an Articulator Synthesizer by
Imit
ating Real Speech.
ZAS Papers in Linguistics

, 40
, 63
-
78.

Mitchell, M. (1998).
An Introduction to Genetic Algorithms.

The MIT Press.

Mitchell, T. (1997).
Machine Learning.

McGraw Hill.

Os, E. d. (2009, January 6).
Project Description
. Retrieved December 15
, 2009, from
Acquisition of Communication and Recognition Skills
--

ACORNS: http://www.acorns
-
project.org/about.html

Park, A. S., & Glass, J. R. (2008). Unsupervised Pattern Discovery in Speech.
IEEE transactions
on audio, speech, and language processing

,

16

(1), 186
-
197.

Pinker, S. (1994).
The language instinct: the new science of language and mind.

London: The
Penguin Press.

CLASS: Chi
ld Language Acquisition and Speech Synthesis



A Proposal

19


Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996, December 13). Statistical Learning by 8
-
Month
-
Old Infants.
Science

, 1926
-

1
928.

Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word
recognition.
IEEE Transactions on Acoustics, Speech, and Signal Processing

, 26
, 43
-
49.

Stanford Encyclopedia of Philosophy. (2007, May 5).
John Locke.

Retrieved

December 15,
2009, from Stanford Encyclopedia of Philosophy: http://plato.stanford.edu/entries/locke/