Learning of Word Boundaries in

cracklegulleyAI and Robotics

Oct 19, 2013 (3 years and 11 months ago)

88 views

Learning of Word Boundaries in
Continuous Speech using Time Delay
Neural Networks

Colin Tan

School of Computing,

National University of Singapore.

ctank@comp.nus.edu.sg


Motivations


Humans are able to automatically segment
words and sounds in speech with little
difficulty.


The ability to automatically segment words
and phonemes also useful in training speech
recognition engines.


Principle


Time
-
Delay Neural Network



Input nodes have shift registers that allow the
TDNN to generalize not only between discrete
input
-
output pairs, but also over time.


Ability to learn true word boundaries given
reasonably good initial estimations.


We make use of this property for our work.

Why TDNN?


Representational simplicity


Intuitively easy to understand what inputs to
TDNN and outputs to TDNN represent.


Ability to generalize over time


Hidden Markov Models have been left out
of this work for now.

Time Delay Neural Networks


Diagram shows a 2
-
input TDNN node.


Constrained weights
allow generalization
over time.


Boundary Shift Algorithm


Initially:


The TDNN is trained on a small manually segmented set of data.


Given the expected number of words in a new, unseen utterance, the
cepstral frames in the utterance is distributed evenly over all the words.


For example, if there are 2,000 frames and 10 expected words, each word is
allocated 200 frames.


Convex
-
Hull and Spectral Variation Function methods may be used to
estimate the number of words in the utterance.


For our experiments we manually counted the number of words in each
utterance.




Boundary Shift Algorithm

1.
The minimally trained TDNN is retrained using both its
original data and the new unseen data.

2.
After retraining, a variable
-
sized window is placed
around each boundary.


Window is initially +/
-

16 frames

3.
A search is made within the window for the highest
scoring frame. The boundary is shifted to that frame.


This search is allowed to search past boundaries into
neighboring words.

4.
TDNN is retrained using new boundaries.

Boundary Shift Algorithm

5.
Windows are adjusted by +/
-

2 frames (i.e. reduced by a
total of 4 frames), and steps 3 to 5 are repeated.

6.
Algorithm ends when boundary shifts are negligible, or
windows shrink to 0 frames.

Network Pruning


Limited training data lead to the problem of over
-
fitting.


Three parameters are used to decide which TDNN
nodes to prune.


Significance

j(max),
, which measures how much a particular

node contributes to the final answer. A node with a
small Significance value contributes little to the final
answer and can be pruned.

Network Pruning


Three parameters are used to prune the TDNN:


The variance

j
, which measures how much a particular
node changes over all the inputs. A node that changes very
little over all the inputs is not contributing to the learning,
and can be removed.


Pairwise node distance

ji
, which measures how node
changes with respect to another.


A node that follows another
node closely in value is redundant and can be removed.


Network Pruning


Thresholds are set for each parameter. Nodes with
parameters falling below these thresholds are pruned.


Selection of thresholds is critical.


Pruning is performed after the TDNN has been trained on
the initial set for about 200 cycles.

Experiments


TDNN Architecture


27 Inputs


13 dcep coefficients, 13 ddcep coefficients, power.


5 input delays


96 Hidden Nodes


Arbitrarily chosen, to be pruned later.


2 Binary Output Nodes


Represents word start and end boundaries.

Experiments


Data gathered from 6 speakers


3 male, 3 female.


Solving task similar to CISD Trains Scheduling
Problem (Ferguson 96).


About 20
-
30 minutes of speech used to train
TDNN.


20 utterances, previously unseen, chosen to
evaluate performance.



Experiment Results

Performance Before Pruning


Results shown relative to hand
-
labeled samples.

Inside Test


Outside Test


Precision
:


66.88%


Precision
:


56.22%


Recall
:


67.33%


Recall
:


76.69%


F
-
Number
:


67.07%


F
-
Number
:


64.88%


Experiment Results

Performance After Pruning

Inside Test


Outside Test


Precision
:


66.03%


Precision
:


57.10%


Recall
:


61.41%


Recall
:


72.16%


F
-
Number
:


63.61%


F
-
Number
:


61.71%


Example Utterances

Subject
:

CK



Utterance
:

Ok

thanks,

now

I

need

to

find

out

how

long

does

it

need

to

travel

from

Elmira

to

Corning




(okay)

(th
-
)

(
-
anks)

(now)

(i

need)

(to)

(find)

(out

how)

(long)

(does

it

need)

(to)

(travel)

(f
-
)

(
-
om)

(emira)

(to

c
-
)

(orning)


Example Utterances

Subject
:

CT



Utterance
:

May

I

know

how

long

it

takes

to

travel

from

Elmira

to

Corning?




(may

i)

(know

how)

(long)

(does

it)

(take)

(to

tr
-
)

(
-
avel)

(from)

(el
-
)

(
-
mira)

(to)

(corn
-
)

(
-
ning)


Deletion Errors


Most prominent in places framed by plosives.


Algorithm able to detect boundaries at ends of the phrase
but not in middle, due to presence of ‘d’ plosives at the
ends.

Insertion Errors


Most prominent in places where a vowel is stretched.

Recommendations for Further
Work


Results presented are early research results,
and are promising.


Future work will combine TDNN with other
statistical methods like Expectation
Maximization.