Human Speech Communication

spectacularscarecrowΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

66 εμφανίσεις

message

linguistic code (< 50 b/s)

motor control

speech production

SPEECH SIGNAL (> 50 kb/s)

auditory processing

speech perception processes

linguistic code (< 50 b/s)

message

Human Speech Communication

dialogue

interaction

self
-
control

adaptation

speaker

listener

shared

knowledge

high bit rate

u

h

e

l

o

w

r

d

l

o

low bit rate

message

very low bit rate

speaker

knowledge

u

h

e

l

o

w

r

d

l

o

low bit rate

message

very low bit rate

listener

Machine recognition of speech

u

h

e

l

o

w

r

d

l

o

high bit rate

u

h

e

l

o

w

r

d

l

o

low bit rate

message

word

another word

machine

recognition


of speech

u

o

COARTICULATION


coarticulation+ talker idiosyncrasies + environmental variability = a big mess

u

h

e

l

o

w

r

d

l

o

hello world

u

h

e

l

o

w

r

d

l

o

Two dominant sources of variability in speech


1.
FEATURE VARIABILITY

different people sound different, communication environment different,
coarticulation effects, …

2.
TEMPORAL VARIABILITY

people can say the same thing with different speeds



“Doubly stochastic” process (Hidden Markov Model)


Speech as a sequence of hidden states (phonemes)
-

recover the sequence


1.
never know for sure which data will be generated from a given state

2.
never know for sure in which state we are


wall

fire

activity

shadows

echoes

already old Greeks ……..

f
0
=195 125 140 120 185 130 145 190 245 155 130 Hz

hi

hi

hi

hi

hi

hi

hi

hi

hi

hi

hi


Know


what are the typical ranges of boy’s and girl’s voices ?


how likely a boy walks first?


how many boys and girls go typically together?


how many more boys is typically there?


Want to know


where are the boys (girls) ?

the model

m

p
m

f

p
f

1
-
p
m

1
-
p
f

m

f

P(sound|gender)

f
0


p
1m

p
m
p
f

P(gender)

Given this knowledge, generate all possible sequences of boys and
girls and find which among them could most likely generate the
observed sequence

f
0
=140 120 190 125 155 130 145 160 245 165 150 Hz

boys

compute distributions

of parameters for each state

girls

boys

girls

boys

find the best alignment

of states given the parameters

compute distributions

of parameters for each state

find the best alignment

of states given the parameters

Getting the parameters (training of the model)

hi

hi

hi

hi

hi

hi

hi

hi

hi

hi

hi

“Forced alingnment” of the
model with the data

Machine recognition of speech

more complex model architecture

finding boys and girls

speech recognition

people’s parade

speech utterance

gender groups

speech sounds

voice pitch

vector of features derived from the signal

prior probabilities of
gender occurrence

language model

How to find
w

(efficiently
)

?

Form of the model
M ( w
i
)

?

What is the data
x

?


Data
x

?


Describes changes in
acoustic pressure


original purpose is
reconstruction of speech


rather high bit
-
rate


additional processing is
necessary to alleviate the
irrelevant information

Speech signal ?

Machine for recognition of speech

speech

signal

pre
-
processing

acoustic

processing

decoding

(search)

best matching

utterance

prior

knowledge

acoustic

training

data

time

time

frequency

time

frequency

time

frequency

/
j/ /u/ /a
r
/ /j/ /o/ /j/ /o/

10
-
20 ms

get spectral components


Short
-
term Spectrum

time

Spectrogram


2D representation of sound

Short
-
term Fourier analysis

0

p

frequency [rad/s]

log gain

Spectral resolution of hearing

spectral resolution of hearing decreases with frequency
(critical bands of hearing, perception of pitch,…)

critical bandwidth [Hz]

frequency [Hz]

100

50

500

1000

2000

5000

10000

50

500

5000

1000

10000

100

frequency

energies in “critical bands”

Sensitivity of hearing depends on frequency

loudness = intensity
0.33

intensity ≈ signal

2

[w/m
2
]

loudness [Sones]

intensity

(power spectrum)

loudness

|.|
0.33

Not all spectral details are important

a) compute Fourier transform of the logarithmic auditory
spectrum and truncate it (Mel cepstrum)

b) approximate the auditory spectrum by an autoregressive model
(Perceptual Linear Prediction


PLP)

6
th

order AR model

frequency (tonality)

power (loudness)

14
th

order AR model

frequency (tonality)

power (loudness)

Current state
-
of
-
the
-
art speech recognizers typically use high model order PLP

It’s about time

(to talk about TIME)

Masking in Time


suggests ~200 ms buffer (critical interval) in auditory system

signal

t

masker

time

stronger masker

increase

in threshold

0

t

200 ms

time trajectories of the spectrum

in critical bands of hearing

filter with time constant > 200 ms

(temporal buffer > 200 ms)


time [s]

spectrogram (short
-
term Fourier spectrum)

Perceptual Linear Prediction (PLP) (12
th

order model)

RASTA
-
PLP

filter

spectrogram

spectrum from RASTA
-
PLP

Data
-
guided feature extraction

time

frequency

data

preprocessing

artificial neural

network


trained on

large amounts

of labeled data

/f/

/ay/

/v/

time

Spectrogram




Posteriogram

what happens outside the critical
interval, does not affect detection
of signal within the critical
interval

stronger masker

increase

in threshold of
perception of
the target

0

t

200 ms

Masking in time

increase in
threshold of

perception

of the target

noise bandwidth

critical

bandwidth

what happens outside the critical
band does not affect decoding of the
sound in the critical band

Masking in frequency

Signal components inside the critical time
-
frequency window interact

frequency

16 x 14 bands = 448 projections

Emulation of cortical processing

(MRASTA)

data
1

t
0

32 2
-
D projections

with variable resolutions

time

data
2

data
N

32 2
-
D projections

with variable resolutions

peripheral processing

(critical
-
band spectral analysis)

Multi
-
resolution RASTA (MRASTA)

(
Interspeech 05)

0

-
500

500

time [ms]

Spectro
-
temporal basis formed by outer products of

time

central

band

frequency

derivative

3 critical

bands

frequency

time [ms]

frequency

example

-
500 0 500

Bank of 2
-
D (time
-
frequency) filters

(band
-
pass in time, high
-
pass in frequency)


1.
RASTA
-
like: alleviates stationary components

2.
multi
-
resolution in time

Spectral dynamics (much) more interesting than spectral shape

Old way of getting spectral dynamics

time

t
0

short
-
term spectral

components

Older way of getting spectral dynamics (Spectrograph™)

f
0

f
0

t
0

time

spectral

components

critical
-
band spectrum from all
-
pole models of temporal envelopes of the
auditory
-
like spectrum (FDPLP)

time

frequency

time

frequency

critical
-
band spectrum from all
-
pole models of auditory
-
like
spectrum (PLP)


Digit recognition accuracy [%]

-

ICSI Meeting Room Digit Corpus



clean

reverberated

PLP

99.7

71.6

FDPLP

99.2

87.0


Improvements on real reverberations
similar (IEEE Signal Proc.Letters 08)

Reverberant speech

Gain excluded

Gain included

Telephone speech

Phoneme recognition accuracy [%]





TIMIT HTIMIT



PLP
-
MRASTA 67.6 47.8

FDPLP 68.1 53.5

FDPLP with static and dynamic compression

Hilbert envelope

FDLP fit to Hilbert envelope

logarithmically compressed FDLP fit

FDLP fit compressed by PEMO model

Recognition accuracy [%] on
TIMIT, HTIMIT, CTS and
NIST RT05 meeting tasks



PLP FDPLP

TIMIT 64.9 65.4

HTIMIT 34.4 52.7

CTS 52.3 59.3

RT05 60.4 64.1

principal

component

projection

to HMM

(Gaussian mixture based)

classifier

histogram of

one element

correlation


matrix

of features

posteriors of
speech sounds

pre
-
softmax

outputs

TANDEM

(Hermansky et al., ICASSP 2000)


features for conventional speech recognizer should be
Normally distributed and uncorrelated

Summary


Alternatives to short
-
term spectrum based attributes
could be beneficial


data
-
driven phoneme posterior based


extract speech
-
specific knowledge from large out
-
of
-
domain corpora


larger temporal spans


exploit coarticulation patterns of individual speech sounds


models of temporal trajectories


improved modeling of fine temporal details


allows for partial alleviation of channel distortions and reverberation effects


u

h

e

l

o

w

r

d

l

o

coarticulation

human speech production

Coarticulation

human auditory perception

u

h

e

l

o

w

r

d

l

o

Hierarchical bottom
-
up event
-
based
recognition ?

high bit rate

pre
-
processing to emulate known
properties of peripheral and
cortical auditory processes

equally distributed posterior
probabilities of speech sounds

u

h

e

l

o

w

r

d

l

o

low bit rate

unequally distributed identities of
individual speech sounds (phonemes)

One way of going from phoneme
posteriors to phonemes

matched

filtering

/n/

/ay/

/n/

/n/

/ay/

/n/

phoneme

posterior

probability

(some of) the Issues


Perceptual processes, involved in
decoding of message in speech ?


where and how ?


higher levels (cortical)
probably most relevant


acoustic “events” for speech ?



Cognitive issues


what to “listen for” ?


roles of “bottom
-
up” and
“top
-
down” channels ?


coding alphabet (phonemes) ?


category forming


invariants


when to make decision ?



e.t.c. ????????????????

SPEECH SIGNAL

(high bit
-
rate)


auditory perception


acoustic “events”


??? cognitive processes ???


linguistic code

(low bit rate)


??? cognitive processes ???


message

(even lower bit rate)

Speculation


Improvements in acoustic processing could
make domain
-
independent ASR feasible