Time Frequency Analysis and Wavelet Transform Tutorial

weepingwaterpickSécurité

23 févr. 2014 (il y a 3 années et 5 mois)

252 vue(s)

Time Frequency Analysis and
Wavelet

Transform

Tutorial

Time
-
Frequency Analysis for Voiceprint
(Speaker) Recognition




Wen
-
Wen

Chang
(
張雯雯
)

E
-
mail: r
01
942
035
@ntu.edu.tw

Graduate
Institute of Communication Engineering

National Taiwan University, Taipei, Taiwan, ROC


















Contents






Abstract


1. Speaker Recognition

1.1 Introduction of speaker recognition

1.
2

General Structure of speaker recognition system


2. Human
Speech Production

2.1 Vocal Folds

2.1 Vocal tract


3.
Feature Extraction

3.1
Short
-
term spectral features


3.1.1 Mel
-
Frequency Cepstral Coefficients (MFCC)

3.1.2
Linear Predict
ive Cepstral Coefficients (LPCC
)



3.1.3
Mel
-
Frequency Discrete

Wavelet
Coefficient

(MFDWC)

3.2
Voice source features

3.2.1
Wavelet Octave Coefficients of

Residues (WOCOR)

3.3

Spectral
-
temporal features

3.4

Prosodic features

3.5

High
-
level features


4. Speaker Modeling


References









Abstract



Speaker recognition is aim to identify a speaker by his/her speech samples. By
extracting the speaker
-
specific
features from the speech samples, and use these
features to model the speaker, the recognition task can be done.

Time frequency
analysis and wave
let transform is important for feature extraction in speaker
recognition
, because the
frequency content of the speech signal varies with time
.

In
this Tutorial, the
structure

of speaker recognition systems and several feature
extraction techniques are intr
oduced.






























1. Speaker Recognition


1.1 Introduction

of speaker recognition


T
he
goal
of automatic speaker recognition is to identity

the speaker by extraction,
characterization and recognition of the
speaker
-
specific
information
contained in the

speech signal.


Generally
it

involves

two

major

tasks, speaker identification and speaker
verification.

The speaker i
dentification
task
is to determine who the speaker is among
a group of
known

speakers.
T
he
voice sample

of the test speaker is compared to all
the known speaker models, to find the model with the closest match.

T
he speaker
v
erification

(authentication)
task

is to
determine

wheth
er the speaker is the person he
or
she claims to be.

The
voice sample
of the tes
t speaker is compared to the target
speaker model, if the likelihood is above the threshold, the test speaker is
accepted
.

Figure 1 illustrates these two tasks.

Speaker recognition methods can also be divide
d

into
text
-
dependent
and
text
-
independent
methods.

In a
text
-
dependent system,
the recognition system has pr
ior
knowledge of the text to be

spoken
(
a user specific pass
-
phase or a

system prompted
phrase)

and
expects the user to be

cooperative
.

The performance of a recognition
system
is

often bette
r than
text
-
independent system

because of the prior knowledge of
the text.

In a text
-
independent

system
,
the system does not know
the text

to be spoken

by the user
.

Text
-
independent recognition is more difficult but also more flexible
.

For
example,
the

speaker recognition
task can be done
while the test speaker is conducting
other speech interactions (background verification).

Of all the
biometrics
for the recognition of

individuals

(DNA, fingerprint, face
recognition),
v
oice is a
compelling
biometric
.
B
ecause speech is a
natural signal to
produce
, and it
does not require a specialized input device
.

Also, the telephone system
provides a ubiquitous, familiar network of sensors for obtaining and delivering the
speech signal
.



Figure 1


1.
2

General Structure of speaker recognition system


The basic structure for speaker
identification

system

and
speaker verification
system are

shown in Figure
2
.

The front
-
end proce
ssing module generally
includes
silence detection,
pre
-
emphasis, and feature e
xtraction.

S
ilence detection is performed to

remove
non
-
speech portions from the speech signal
.
P
re
-
emphasis is needed because h
igh
frequency
components of the speech signal

have small amplitude with respect to low
frequency
components
.
A high pass filter is utilized to emphasis the high frequency
components.
The
feature extraction
process
transforms the raw signa
l into feature
vectors in which

speaker
-
specific

properties are emphasized and statistical
redundancies
are
suppressed.

The spe
aker modeling is based on the feature vectors extracted in the front
-
end
processing.
In the enrollment mode, a
speaker

model
is trained using the feature
vectors of the target

speaker. In the recognition mode, the feature vectors

extracted
from the
test sp
eaker

s speech

are compared

against the model
s

in the system
database
and a
score

is computed

for
decision

making
.






Figure 2




2.
Human
Speech
Production


Normal human speech is produced
when air is exhaled from the lungs
, and the

oscillat
ing
of the
vocal
folds

modulates this air flow to a pulsed air stream
, called

the
glottal pulses. T
hen
this pulsed wave

passes through the vocal cords and its frequency
content is modified by the resonances of the vocal tract.
The vocal
folds

and vocal
tract

a
re two important

parts
in speech production.





Figure

3


Figure
4


2.1 Vocal Folds


The vocal
folds are

the source
s

for speech production in humans. It generates
two

kinds of speech sounds
,
voiced and unvoiced.

T
he
vocal folds

vibrate

when
creating a voiced sound
,
while they do not for an unvoiced sound
.

The frequency
of
the vocal folds vibration is called the fundamental frequency.
T
he vibration

of the
vocal folds
d
epends on

the te
nsion exerted by the muscle,
and the
mass and len
g
t
h

of
the vocal folds
. These
characteristics

vary
between speakers, thus can be utilized for
speaker recognition
.

The features
characterize the
vocal folds oscillation are called
voice source features.


2.2 Vocal tract


The vocal tract is the speech product
ion organs above the vocal folds
, which
consist of the oral tract (
tongue, pharynx, plate, lips, and jaw
) and the
nasal tract
.
W
hen t
he glottal pulses
signal
generated by the vibration of the vocal folds pass
es

through the vocal tract
,

it is modified. The
vocal tract works as a filter, and its
frequency response depends on the resonances of the vocal tract.
T
he vocal tract
shape can be estimated from the

spectral shape
such as

for
mant location and spectral
tilt

of

the
speech

signal.

Vo
cal tract resonances
, also
called

formants

are the peaks of the
spectr
al
envelope
.

Figure 5 shows a spectral envelope and its formants.
The lowest resonance
frequency

is called the first forma
nt

and
next
the second formant frequency and so on.

T
he resonance

frequencies (forma
nts) are
inversely proportional to the

vocal tract
length.
M
ale speakers usually have longer
vocal tract length
, thus the formants are
lower. For female speakers and children, the vocal tract length is shorter, thus the
formants are higher.
Figure
6

demonstrate this phenomenon.

In speaker recognition, the features derived from the vocal tract characteristic are
most commonly used.

These features can be obtained from the spectrogram of the
speech signal, thus are
categorized as

Short
-
Term Spectral Fea
tures
.



Figure

5


Figure

6




3.
Feature Extraction


The speaker
-
specific characteristics of speech can be categorized into physical
and learned. The physical characteristics are the shapes and sizes of the speech
production organs, like vocal folds and vocal tract. The learned characteristics
include
rhythm
, intonation style, accent, choice of vocabulary and so on.

Figure
7
shows a
summary of features from viewpoint of their physical interpretation.

For speaker recognition, good features should have large between
-
speaker
variability and small within
-
speaker
variability, and be robust against noise and
distortion. Also the dimension of features should be low, because otherwise the
computation cost would be high, and statistical models such as the Gaussian mixture
model (GMM) cannot handle high
-
dimensional data
.


The features for speaker recognition can be divided into:

(1)
S
hort
-
term spectral features


(2)
V
oice source features

(3)
Spectral
-
temporal features

(4)
P
rosodic features

(5)
H
igh
-
level features


The short
-
term spectral features are the simplest, and most discriminative, so is
most commonly used in speaker recognition.
State
-
of
-
the
-
art speaker recognition
systems
often combine these features
, attempting to achieve more accurate
recognition

results
.



Figure

7


3.1
Short
-
Term S
pectral
F
eatures



The short
-
term spectral features convey information of the spectral envelope.
T
he
spectral envelope

contain information of the speaker’s vocal tract characteristics,

like the location and magnitude of the peaks (formants) in the spectrum,

hence is

commonly used for
speaker

recognition
.

Figure 8 shows the spectral envelopes of two
different speakers (one male, one female).



Figure

8


In most of the spectral analysis

of the speech signal, short
-
term spectral analysis is
used to obtain the spectrogram.

I
t is quite similar to the Short
-
Time Fourier Transform
(STFT).

Short
-
term spectral analysis is done by framing the speech signal; the frame
width is about 20

30 microse
conds, and the frames are shifted by about 10
microseconds.
It is assumed that although the speech signal is non
-
stationary, but is
stationary for a short duration of time
.
The
process of
short
-
term spectral analysis is
illustrated in Figure 9.




Figure
9




F
raming

the
S
ignal

The speech signal is broken down into short frames.
The
width of the frame

is
generally about 30
ms
with an overlap

of about 20
ms
(10
ms
shift).

E
ach frame
contains N sample points of the speech signal.




Windowing

T
he framed signal is multiplied by a window function. The window function is
used to smooth the signal for the
computation

of the DFT.
The
D
FT computation
makes an assumption that the input
signal

repeats over and over.

If there is a
discontinuity between t
he first point and the last point of the signal, artifacts occur in
the DFT spectrum.
B
y multiplying a
window
function to
smoothly attenuat
e

both ends
of the
signal

towards zero
, this unwanted artifacts can be avoided
.

The hamming
window, as shown in Figur
e

10
, is usually used in speech signal spectral analysis,
because its
spectrum falls off rather quickly

so the resulting frequency resolution is
better, which is suitable for detecting formants.




Figure 10


Figure 11


There are many
short
-
term spectral

features

that convey information about the
spectral envelope of the speech signal, such as MFCC

(
Mel
-
Frequency Cepstral
Coefficients
)
, LPCC

(
Linear Predictive Cepstral Coefficients
), and

MFDWC

(
Mel
-
Frequency Discrete

Wavelet Coefficient
)
.

Figure 12 shows
the estimation of the
spectral envelope using cepstral analysis and linear prediction separately.



Figure
12


3.1.1
Mel
-
Frequency Cepstral Coefficients (MFCC)


The Mel
-
Frequency Cepstral Coefficients (MFCC) features is the most commonly
used features in speaker recognition.
It combines the advantages of the cepstrum
analysis with a perceptual frequency scale based on critical bands.
The steps for
computing the Me
l
-
Frequency Cepstral Coefficients
from the speech signal
are as
follows, and this algorithm

is shown in Figure 13
.


1.

Framing the Signal

2.

Windowing

3.

FFT

4.

Mel
-

Frequency Warping

5.

Computing the Cepstral Coefficients



Figure
13


After segmenting the speech signal into overlapping frames, the frequency
response of each frame is computed by Discrete Fourier Transform (DFT).
Then the
spectrogram

of the speech signal is obtained. Figure 14 illustrates the computation of
the spectrogram by short
-
term spectral analysis.




Figure 14



Mel
-
Frequency Warping

Mel (melody) is a unit of pitch
.
Mel
-
f
requency
scale
,

based on human
auditory
perception
experiments
, is approximately
linear up to the frequency of 1000 Hz and
then becomes close

to logarithmic for the higher frequencies.

Figure 15 shows the plot
of p
itch (Mel) versus
f
requency
.



Figure

15



It is observed that human ear acts as filter
s that
concent
rate on only certain
frequency

components
.
Thus the human auditory system can be modeled by
a

set

of
band
-
pass filters, which are

uniformly spaced on

the
Mel
-
frequency
scale. Since the
relationship between frequency scale and Mel
-
frequency sc
ale is nonlinear, these
filters are
non
-
uniformly spaced on

the frequency
scale, with more
filters in the low
frequency regions

and less filters in the h
igh frequency regions. Figure 16

shows a
24
-
band Mel
-
frequency filter

bank
.
Figure 17 shows the power
spectrum of the frame
passing though the 24
-
band Mel
-
frequency filter bank.



Figure 16


Figure 17




Computing the Cepstral Coefficients

Our goal is

to obtain the spectral envelope
because it conveys the information
about formants. Cepstrum analysis can
be used to
extract the spectral envelope from
the spectrum.

Cepstrum can be considered as the spectrum of the log spectrum.

The spectrum X[k] equals the spectral envelope H[k]
multiplied by the spectral details
E[k]


[ ] [ ] [ ]
X k H k E k


In order to
separate

the spectral envelope and the spectral details from the spectrum,
take the log of both sides

log [ ] log [ ] log [ ]
X k H k E k
 

Then the log spectru
m is the sum of a smooth
si
gnal

(the spectral envelope) and
a fast varying signal (the spectral details). Thus the spectral envelope can be obtained
by the low frequency components of the spectrum of the log spectrum, i.e. the low
frequency Cepstrum

Coefficients
.

The concepts of cep
strum analysis is illustrated in
Figure 18




Mel
-
Frequency Cepstral Coefficients (MFCC)

The MFCC

features

are obtained by taking
log of
the
outputs of a
Me
l
-

frequency

filter

bank
. And conduct the Discrete Cosine Transform (DCT).
The final
MFCC
feature
vector
s

are

obtained by retaining about 12
-
15
lowest DCT co
effi
cients.



Figure

18


3.1.2
Linear Predict
ive Cepstral Coefficients (LPCC
)


Linear prediction

Coding

(LP
C
) is an alternative method

for spectral envelope
estimation
.
This method is also known
by the names, all
-
pole model, or the
autoregressive (AR) model
.

It
has good intuitive

interpretation both in time domain
(adjacent samples are correlated) and in frequency domain (all
-
pole spectrum
corresponding to the resonance structure).


T
he

signal

s
[
n
]

is predicted
by a linear combination of its past values
. The
predictor
equation is defined as

1
[ ] [ ]
p
k
k
s n a s n k

 


Here

[ ]
s n

is the signal,
k
a

are the
predictor coe
ffi
cients
and
[ ]
s n

is the predicted
signal.

The prediction error signal, or
residual
, is defined as


[ ] [ ] [ ]
e n s n s n
 

The coefficients
k
a

are determined by minimizing the residual energy
2
[( [ ])
E e n

using the
Levinson
-
Durbin
algorithm
.


As shown in Figure

19
,
[ ]
s n

is the
speech
signal
,
[ ]
e n

is the voice source (glottal
pulses),

and

( )
H z

is the response of the vocal tract filter.


Figure

19


1
[ ] [ ] [ ]
[ ] [ ]
p
k
k
e n s n s n
s n a s n k

 
  


1
( ) ( )[1 ]
p
k
k
k
E z S z a z


 


( )
( )
( )
S z
H z
E z


Th
us the

spectral model
, representing the vocal tract
is

1
1
( )
1
p
k
k
k
H z
a z






The predictor coefficients

k
a

are rarely used as features but they are transformed
into the more robust
L
inear
P
redictive

C
epstral
C
oe
ffi
cients
(LPCC) features
.

A

recursive algorithm
proposed by
Rabiner and Juang

can be used
for computing the
c
epstral coefficients from the LPC
coefficients.


However,
unlike MFCC,
the LPCC are not based on perceptual frequency scale
,
such as
Mel
-
frequency scale.

This led to the development of the Perceptual Linear
Predictive (PLP) analysis
.


3.1.3
Mel
-
Frequency Discrete

Wavelet Coefficient

(MFDWC)


Mel
-
Frequency Discr
ete Wavelet Coefficients

are

computed in th
e
similar
way
as the MFCC features
.
The only difference is that a Discrete Wavelet

Transform
(DWT)
is used to replace the DCT in the last step.

Figure
20
shows the algorithms for
MFCC a
nd MFDWC
.


Figure

20


MFDWCs were

used in speaker verification
, and it was shown that they
give
better performance than the MFCCs

in noisy environments.
A
n

explanation for this
improvement is DWT
a
llows good locali
zation both in time and frequency domain
.



3.2

V
oice source features


Voice source

features char
acterize the

voice source (glottal
pulses signal
)
,

such
as glottal pulse shape
and fundamental frequency
.
These features can not be directly
measured from the speech signal, because
the voice source signal is modified when
passing though the vocal tract.
The
v
oice source

signal is

extracted from the speech
signal by assuming the
voice source and the vocal tract are independent of each other.
Then the vocal tract filter can be first es
timated using the linear prediction model,
described in
sub
section 3.1.2
. The voice source signal
,
which is
the residual of the
linear prediction model
,

can be estimated by inverse filtering the speech signal
.
Here
( )
S z

is the
speech
signal
,
( )
E z

is the voice source signal, and
( )
H z

is the
response of the vocal tract filter.


1
( ) ( )
( )
E z S z
H z
 

Figure 21 shows the voice source signal extracted from the speech signal using
linear
prediction inverse filtering.

The voice source features are extracted from the voice source signal.
T
hese

features depend on the source of the speech,
namely
the pitch generated by the vocal
folds, so they are less sensitive to the content of speech than
s
hort
-
t
erm
s
pectral
f
eatures
, like MFCCs features
.

The voice source features are not as discriminative as
vocal tract features, but fusing these two complementary features (
short
-
term spectral
features

and voice source features) can improve
recognition
accu
racy
.


Figure

21


3.2.1
Wavelet Octave Coefficients of

Residues (WOCOR)


The algorithm for computing

Wavelet Octave Coefficients of

Residues
(WOCOR)

features is shown in Figure 22.
First, the p
itch
of the

speech

signal
is
estimated using cepstrum analysis
. The speech signal is divided into overlapping
frames, as in short
-
term spectral analysis. Then perform the linear prediction of each
frame, and obtain the voice source signal by inverse filtering.
With

pitch synchronous

analysis, wavelet transform is app
lied to every two

pitch cycles of the
linear
prediction

residual signal to
WOCOR

features.


Figure

22


3.3

S
pectr
o
-
temporal features


S
pectro
-
temporal

features
refer to the features

extracted from the
frequency content of
the subband

of the speech signal spectrogram
, an example is shown in Figure 23.
S
pectro
-
temporal

signal details contain useful speaker
-
specific information such as
formant transitions and energy

modulations.

Recently,

some research
proposed the
modulation features,
w
hich are
extracted by
represent the non
-
stationary speech signal
as a

sum of amplitude modulated (AM) and frequency modulation (FM) signals
.



Figure 23


A common way to incorporate some temporal information to
short
-
term spectral
features

is by using the first and second order differences of the feature vectors.

The first and second order differences of the MFCC are called Delta and Delta
-
Delta
Cepstral Coefficients. These coefficients are usually appended with the original
MFCC coefficien
ts on the frame level (e.g. 12 MFCCs with delta and delta
-
delta
coefficients, implying 36 features per frame).

Figure 24 shows the MFCC and its
first
and second order differences
.



Figure 24

3.4
Prosodic features


In

linguistics
,

prosody

refers to syllable stress, intonation patterns, speaking rate
and rhythm of speech.
Prosody may
convey
various
features
of the speaker

like
differences in speaking style, language background, sentence type, and emotions to
mention a few.

In order

to obtai
n
long
-
term information
(
syllable stress, intonation
patterns, speaking rate
, etc) from the speech signal, the prosodic features
spans over
long segments like syllables, words, and utterances
,

unlike the

short
-
term spectral
features
.


3.5

H
igh
-
level featur
es


High
-
level features attempt to capture conversation
-
level
features

of speakers, such as
speaker’s characteristic vocabulary,

the kind of words the speakers tend to use in their
conversations
,

called
idiolect
. For example

the phrases
frequently

used by a speaker,
like

‘uh
-
huh”, “you know”, “oh yeah”

can be used for recognition.
The idea in
high
-
level modeling is to convert each utterance into a sequence of
tokens
where the
co
-
occurrence patterns of tokens characterize speaker differences.





4.
Speaker Modeling


During enrollment, speech from a speaker is passed through the front
-
end
processing steps described above and the feature vectors are used to create a speaker
model.

A brief description about some of the most

prevalent speaker modeling

techniques

is as follow
.





Hidden Markov Models

(
HMM
)

For text
-
dependent applications, whole phrases or phonemes may be modeled
using
multi
-
state left
-
to
-
right HMMs
; w
hile for

text
-
independent applications, single
state HMMs, also known as
Gaussian
Mixture Models (GMMs)

are used.




Neural Networks (NN)

A potent
ial advantage of

neural networks

is that feature extraction and speaker
modeling can be combined into a single network, enabling joint optimization of the
(speaker
-
dependent) feature extractor and the speaker model.




Support vector machine

(SVM)

The
Support vector machine

(SVM), as ill
ustrated in Figure
25
, is a
binary
classifier which models the decision boundary between two classes
, and can be used
as

classifiers in speaker verification
.

In speaker verification,

one class consists of the
target speaker training

feature
vectors (labele
d as +1), and the other class consists of

the training
feature
vectors from an impostor (background)

population (labeled as
-
1).

Using the labeled training feature vectors, SVM finds a boundary that maximizes the
margin
of separation between these two clas
ses.


Figure

25


Conclusion



T
he fundamentals of automatic speaker recognition, concerning

f
eature
extraction and speaker modeling

are

briefly
introduced in this tutorial.


The actual speaker recognition system
s are

very com
plicated
. S
ome factors like noise
and channel effects also need to be considered.
Research on
speaker recognition
methods and

techniques has been undertaken for over four decade and it

continues to
be an active area.










Reference


[1]

Beigi, H. (2011).

Fundamentals of Speaker Recognition
,


Springer, New York.

ISBN:978
-
0
-
387
-
77591
-
3.

[2]

Tomi Kinnunen, H
aizhou Li, “An
O
verview of
T
ext
-
I
ndependent
S
peaker
R
ecognition: From
F
eatures to
S
upervectors,


Speech Communication, Volume
52, Issue 1, January 2010, Pages 12
-
40, ISSN 0167
-
6393

[3]

Reynolds, Douglas A.
;
,


An overview of automatic speaker recognition
technology,


Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE
International Conferenc
e on
, vol.4, no., pp.IV
-
4072
-
IV
-
4075, 13
-
17 May 2002

[4]

Campbell, J.P., Jr.; , "Speaker recognition: a tutorial,"

Proceedings of the IEEE

,

vol.85, no.9, pp.1437
-
1462, Sep 1997

[5]

李琳山,「數位語音處理」課程投影片

[Online].


Available

:
http://speech.ee.ntu.edu.tw/

[6]

N. Zheng, P.C. Ching, and T. Lee, "Time
-
frequency analysis of vocal source

signal for speaker recognition", ;in Proc. INTERSPEECH, 2004.