Speaker Recognition Systems:

photohomoeopathAI and Robotics

Nov 24, 2013 (3 years and 6 months ago)

68 views


Center for Speech and Language Technologies, Tsinghua University

Asia
-
Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Plan 2012
-
2013

Speaker Recognition Systems:

Paradigms and Challenges

Thomas Fang Zheng


Co
-
work with:
Linlin Wang and Xiaojun Wu


<Date>, <Venue>

APSIPA Distinguished Lecture Plan 2012
-
2013

2

About APSIPA


A
sia
-
P
acific
S
ignal &
I
nformation
P
rocessing
A
ssociation



An emerging association

to promote broad spectrum of research and
education activities in SIP



Mission
: non
-
profit organization with the following objectives:


Providing education, research and development exchange platforms for both
academia and industry;


Organizing common
-
interest activities for researchers and practitioners;


Facilitating collaboration with region
-
specific focuses and promoting leadership for
worldwide events;


Disseminating research results and educational material via publications,
presentations, and electronic media;


Offering personal and professional career opportunities with development
information and networking



Established

on October 5, 2009, officially registered in Hong Kong


APSIPA ASC (Annual Summit and Conference) starting from 2009


APSIPA Transactions on Signal & Information Processing


APSIPA Distinguished Lecture Program starting from Jan. 2012



http://www.apsipa.org

APSIPA Distinguished Lecture Plan 2012
-
2013

3

Outline


Introduction



Creation of Time
-
varying Voiceprint Database



The Discrimination
-
emphasized Mel
-
frequency
-
warping Method



Experimental Results



Conclusions & Future Work

APSIPA Distinguished Lecture Plan 2012
-
2013

Biometric Recognition


Technologies for measuring and analyzing a
person's physiological or behavioral
characteristics. These can be used to verify or
identify a person.



The term "
biometrics
" is derived from the Greek
words
bio
(life) and
metric
(to measure).

4

APSIPA Distinguished Lecture Plan 2012
-
2013

Examples of Biometrics


Face


Fingerprint


Palmprint


Hand Geometry


Iris


Retina Scan


DNA


Signatures


Gait


Keystroke


Voiceprint

5

APSIPA Distinguished Lecture Plan 2012
-
2013

Rich Information Contained in Speech

Language Recognition

What language was spoken?

Accent Recognition

Where is he/she from?

Speech Recognition

What was spoken?

Gender Recognition

Male or Female?

Emotion Recognition

Positive? Negative?

Happy? Sad?

Speaker Recognition

Who spoke?

6

APSIPA Distinguished Lecture Plan 2012
-
2013


Speaker recognition

(or
Voiceprint recognition
) is the
process of automatically identifying or verifying the identity
of a person from his/her voice, using the characteristic
vocal information included in speech. It enables access
control of various services by voice.
[Kunzel 94][Furui 97]


Various applications:


Access control

(e.g.: security control for confidential information,
remote access of computers, information and reservation services);


Transaction authentication

(e.g.: telephone banking, telephone
shopping);


Security and forensic prospects
(e.g.: public security, criminal
verification);


Rich Transcription for Conference Meeting
(e.g.: "
Who Spoke When
"
and "
Who Spoke What
" speaker diarization);


etc.

Speaker Recognition / Voiceprint Recognition

7

APSIPA Distinguished Lecture Plan 2012
-
2013


Speaker Identification


Determining which identity in a specified speaker set is
speaking during a given speech segment.


Closed
-
Set / Open
-
Set


Speaker Verification


Determining whether a claimed identity is speaking
during a speech segment. It is a binary decision task.


Speaker Detection


Determining whether a specified target speaker is
speaking during a given speech segment.


Speaker Tracking
(
Speaker Diarization =
Who Spoke When
)


Performing speaker detection as a function of time,
giving the timing index of the specified speaker.

Speaker Recognition Categories

8

APSIPA Distinguished Lecture Plan 2012
-
2013


Detection Error Trade
-
off (
DET
) Curve


A plot of error rates for binary classification systems,
plotting false rejection rate (FRR) vs. false acceptance rate
(FAR).



Equal Error Rate (
EER
)


The error rate corresponding to the location on a DET
curve where FAR and FRR are equal.



Minimum Detection Cost Function (
MinDCF
)


C
det
=C
miss

X P
miss

X P
Target

+ C
FalseAlarm

X P
FalseAlarm

X (1
-
P
target
)


Performance Evaluation
(for verification and open
-
set identification)

9

APSIPA Distinguished Lecture Plan 2012
-
2013

Open Issues for Speaker Recognition Research

[Furui 1997]


1. How can human beings correctly recognize speakers?


2. Is it useful to study the mechanism of speaker recognition by human beings?


3. Is it useful to study the physiological mechanism of speech production to get
new ideas for speaker recognition?


4. What feature parameters are appropriate for speaker recognition?


5. How can we fully exploit the clearly evident encoding of identity in prosody and
other supra
-
segmental features of speech?


6. Is there any feature that can separate speakers whose voices sound identical,
such as twins or imitators?


7. How do we deal with long term variability in people's voices (ageing)?


8. How do we deal with short term alteration due to illness, emotion, fatigue, …?


9. What are the conditions that speaker recognition must satisfy to be practical?


10. What about combing speech and speaker recognition?


Furui, S., "Recent Advances in Speaker Recognition," Pattern Recognition Letters 18
(1997) 859
-
872

10

APSIPA Distinguished Lecture Plan 2012
-
2013

Performance Factors for Speaker Recognition


Factors affecting the speaker recognition system
performance:


The quality of the speech signal


The length of the training speech signal


The length of the testing speech signal


The size of the population tested by the system


The phonetic content of the speech signal

11

APSIPA Distinguished Lecture Plan 2012
-
2013

Key Issues for Robust Speaker Recognition


Cross Channel



Multiple Speakers



Background Noise



Emotions



Short Utterance



Time
-
Varying (or Ageing)

12

APSIPA Distinguished Lecture Plan 2012
-
2013

13

Time
-
Varying (or Ageing) Issue


In all these typical situations, training and testing
are usually separated by some period of time,
which poses a possible threat to speaker
recognition systems.

TIME GAP

APSIPA Distinguished Lecture Plan 2012
-
2013

14


“Ever
-
newer waters flow on those who step into the same rivers.”


——

Heraclitus

APSIPA Distinguished Lecture Plan 2012
-
2013

15

Open Questions



Does the voice of an adult change significantly with time
?

If so
,

how
?”
[Kersta 1962]




How to deal with the long
-
term variability in people

s voice
?

Whether there was any systematic long
-
term variation that
helped update speaker models to cope with the gradual
changes in people

s voices
? ”
[Furui 1997]




Voice changes over time
,

either in the short
-
term
(
at
different times of day
),
the medium
-
term
(
times of the year
),

or in the long
-
term
(
with age
).”
[Bonastre
et al
. 2003]

APSIPA Distinguished Lecture Plan 2012
-
2013

16

Observations


Performance degradation in presence of time
intervals


The longer the separation between the training and the
testing recordings, the worse the performance.

[Soong
et al
.
1985]


A significant loss in accuracy (4~5% in EER)

between two
sessions separated by 3 months was reported
[Kato &
Shimizu 2003]

and ageing was considered to be the cause
[Hebert 2008]
.



Few researchers have figured out reasons behind
this time
-
varying phenomenon exactly.

APSIPA Distinguished Lecture Plan 2012
-
2013

17

More enrollment data
--

a solution?


Using training data with a larger time span

[Markel
1979]


Performance can be improved.


The enrollment is quite time
-
consuming!


In some situation, it is impractical to obtain such data!



Accepted testing/recognition speech segments be
augmented to previous enrollment data to retrain
the speaker model

[Beigi 2009, Beigi 2010]


Performance can be improved.


Initial training data should be kept for later use (storage
-
consuming)!

APSIPA Distinguished Lecture Plan 2012
-
2013

18

Ageing
-
dependent decision boundary
--

a solution?


Using ageing
-
dependent decision boundary in the
score domain

[Kelly 2011, Kelly 2012]


Performance can be improved.


How to determine the time lapse practically?


APSIPA Distinguished Lecture Plan 2012
-
2013

19

Model
-
updating (adaptation)
--

a solution?


A simple and straightforward way

[Lamel 2000, Beigi
2009, Beigi 2010]
:


to update speaker models from time to time



It is effective to maintain representativeness.



However, it is costly, user
-
unfriendly, and
sometimes, perhaps unrealistic.



And feature matters.


APSIPA Distinguished Lecture Plan 2012
-
2013

20

Efforts in frequency domain …


The most essential way to stabilize performance is
to extract exact acoustic features that are
speaker
-
specific and further, stable across
sessions.



This is more like a dream for a long period!



To take some findings into existing techniques…


NUFCC
[Lu & Dang 2007]
: assign frequency bands with different
resolution according to their discrimination sensitivity for
speaker
-
specific information.

APSIPA Distinguished Lecture Plan 2012
-
2013

21

The idea of mel
-
frequency
-
warping!


To emphasize frequency bands that are more sensitive to
speaker
-
specific information, yet not so sensitive to time
-
related session
-
specific information.



Identify frequency bands that reveal high discrimination
sensitivity for speaker
-
specific information but low
discrimination sensitivity for session
-
specific information.



Once these frequency bands are identified, more features
can be extracted within them by means of frequency
warping.



The Discrimination
-
emphasized Mel
-
frequency
-
warping
method.

APSIPA Distinguished Lecture Plan 2012
-
2013

22

Outline


Introduction



Creation of Time
-
varying Voiceprint Database



The Discrimination
-
emphasized Mel
-
frequency
-
warping Method



Experimental Results



Conclusions & Future Work

APSIPA Distinguished Lecture Plan 2012
-
2013

MARP Corpus


A proper longitudinal database is necessary.


Time
-
related variability is the only focus.


The MARP corpus has been the only one published so far
[Lawson
2009]
, though there were more variabilities.



The MARP corpus


32 participants, 672 sessions from June 2005 to March 2008


10 minutes of free
-
flowing conversations for each session



While the impact on speaker recognition accuracy between any two
sessions is considerable
,
the long
-
term trend is statistically quite
small
.”



The detrimental impact is clearly not a function of ageing or of the
voice changing within this timeframe
.”

23

APSIPA Distinguished Lecture Plan 2012
-
2013


In free
-
flowing conversations, speech contents are
not fixed and a speaker’s emotion, speaking style,
or engagement can be easily influenced by
his/her partner.



Hence, creation of a voiceprint database which
specially focuses on the time
-
varying effect in
speaker recognition is imperative for both
research and practical applications.

24

APSIPA Distinguished Lecture Plan 2012
-
2013

Database Design Principles


The time
-
varying effect is the only focus, therefore
other factors should be kept as constant as
possible throughout all recording sessions.


recording equipments, software, conditions,
environment, and so on



In the database design, two major factors were
well considered:


prompt texts design, and


time intervals design.

25

APSIPA Distinguished Lecture Plan 2012
-
2013

Fixed Prompt Texts


Speakers were requested to utter in a reading way
with fixed prompt texts instead of free
-
style
conversations.



Prompt texts were designed to remain unchanged
throughout all recording sessions.


To avoid or at least reduce the impact of speech
contents on speaker recognition accuracy.


In form of sentences and isolated words.

26

APSIPA Distinguished Lecture Plan 2012
-
2013


100 Chinese sentences and 10 isolated Chinese words



The length of each sentence ranges from 8 to 30 Chinese
characters with an average of 15.



Each isolated Chinese word contained 2 to 5 Chinese
characters and was read five times in each session.


Of the 10 isolated words, 5 were unchanged throughout all
sessions just like the sentences, while


the other 5 changed from session to session and reserved for
future research of other purpose.

27

APSIPA Distinguished Lecture Plan 2012
-
2013

Number covered in
prompt texts

Total
number

Percentage

(%)

Initials

23

23

100

Finals

38

38

100

di
-
IFs

1,183

1,523

78

Table 1. Acoustic coverage of prompt texts

28

APSIPA Distinguished Lecture Plan 2012
-
2013

Gradient Time Intervals


Gradient time intervals were used.


no precedent reference of time
-
interval design.


costly and perhaps unnecessary to record in a fixed
-
length time interval for more than 10 times to obtain a
possible trend.



Initial sessions can be of shorter time intervals,
while following sessions of longer and longer time
intervals.


impacts of different time intervals can be easily
analyzed.

29

APSIPA Distinguished Lecture Plan 2012
-
2013


16 sessions from January 2010 to 2012


Five different time intervals are used: one week, one
month, two months, four months and half a year, as
illustrated in the figure below.


The design of time intervals exactly voids the recordings in
summer or winter vacations.


In actual recording it is unrealistic to make all speakers
record exactly on one specific day, so the session day is
made flexible to a session interval.

time

sessions

Figure 1. Illustration of different time intervals and session days

30

APSIPA Distinguished Lecture Plan 2012
-
2013

Speakers


60 fresh students, w/ 30M + 30F.



Born in years between 1989 and 1993 with a
majority in year 1990.



From various departments


such as computer science, biology, English,
humanities, and journalism



All of them speak standard Chinese well.

31

APSIPA Distinguished Lecture Plan 2012
-
2013

Recording conditions


An ordinary room in the laboratory for recording.


no burst noise but environmental noise in a low level.


Prompt texts were requested to read in a normal
speaking rate, while the volume can be controlled
by the recording software.


Most of the speakers could complete a session in about
25 minutes smoothly.


Speech signals are digitalized at 8 kHz / 16 kHz
sampling rates simultaneously in 16
-
bit
precision.



10 recording sessions had been finished so far.

32

APSIPA Distinguished Lecture Plan 2012
-
2013

Database evaluation
--

a first and quick look


Experimental setup


1024
-
mixture GMM
-
UBM system with 32
-
dim MFCCs



Experimental results


The system performs best when training and testing
utterances are taken from the same session.


However, performance gets worse and worse with the
recording date difference between training and testing
gets bigger.

Figure 2. EER curves when using different sessions for model training

33

APSIPA Distinguished Lecture Plan 2012
-
2013

34

Outline


Introduction



Creation of Time
-
varying Voiceprint Database



The Discrimination
-
emphasized Mel
-
frequency
-
warping Method



Experimental Results



Conclusions & Future Work

APSIPA Distinguished Lecture Plan 2012
-
2013

35

How to find IMPORTANT frequency bands?


The proposed solution is to highlight in feature
extraction the frequency bands


that reveal high discrimination sensitivity for speaker
-
specific information while low discrimination sensitivity
for session
-
specific information.



How to determine the discrimination sensitivity of each
frequency band?


F
-
ratio serves as a criterion to produce the discrimination
scores


How to perform frequency warping to highlight target
frequency bands?


Frequency warping on the basis of mel
-
scale

APSIPA Distinguished Lecture Plan 2012
-
2013

36


F
-
ratio
[Wolf 1972]


The ratio of the between
-
group variance to the within
-
group variance.



A higher F
-
ratio value means better feature selection for
the target grouping.



That is to say, the feature selection with a higher F
-
ratio possesses higher discrimination sensitivity against
the target grouping.

APSIPA Distinguished Lecture Plan 2012
-
2013

37


F
-
ratio in time
-
varying speaker recognition tasks



There exist two kinds of grouping: by speakers for each
session and by sessions for each speaker.


The whole frequency range in divided into
K

frequency bands
uniformly.


Linear frequency scale triangle filters are used to process the
power spectrum of utterances.



Two F
-
ratio values are obtained for each frequency band

APSIPA Distinguished Lecture Plan 2012
-
2013

38

……

……

……

……

……

……

……

……

……

……

……

Figure 3. An illustration of two kinds of grouping





,
2
,
1
2
,
,,
1 1
,
,
1
i s
M
i s s
k
i
s
N
M
k j
i s i s
i j
i s
F ratio spk
x
N
 


 




 
- -




,
2
,
1
2
,
,,
1 1
,
.
1
i s
S
i s i
k
s
i
N
S
k j
i s i s
s j
i s
F ratio ssn
x
N
 


 




 
- -
1
1
.
S
k k
s
s
F ratio spk F ratio spk
S



- - - -
1
1
.
M
k k
i
i
F ratio ssn F ratio ssn
M



- - - -
APSIPA Distinguished Lecture Plan 2012
-
2013

39


For each frequency band
k
, a discrimination
score is defined as:






Target frequency bands with higher discrimination
scores should be assigned with a proper warping
-
factor, neither too small to emphasize them, nor too
big, to increase the frequency resolution.

( )
( )
( )
_ _
_.
_ _
k
k
k
F ratio spk
discrim score
F ratio ssn

(1)

APSIPA Distinguished Lecture Plan 2012
-
2013

40

How to EMPHASIZE?
Mel frequency warping

(MFW)!


Warping strategies:


Uniformly warping of those target frequency bands
with discrimination scores above a threshold.








Non
-
uniformly warping of the whole frequency range
according to their discrimination scores.

Figure 4. The relationship between Hz, Mel scale, and MFW scale

APSIPA Distinguished Lecture Plan 2012
-
2013

41

Figure 5. A comparison of MFCC and WMFCC extraction procedures

APSIPA Distinguished Lecture Plan 2012
-
2013

42

Outline


Introduction



Creation of Time
-
varying Voiceprint Database



The Discrimination
-
emphasized Mel
-
frequency
-
warping Method



Experimental Results



Conclusions & Future Work

APSIPA Distinguished Lecture Plan 2012
-
2013

43

The discrimination for different bands …

Figure 7. Discrimination scores of frequency bands

Warping factor

1

2

3

4

5

EER (%)

10.06

8.69

8.14

8.22

8.36

Table 3. Performance comparison of WMFCC with different warping factors in average EER

APSIPA Distinguished Lecture Plan 2012
-
2013

44

Comparison

0
2
4
6
8
10
12
14
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
Avg.
Session
EER
MFCC
WMFCC
2
nd
-
session EER

(%)

Average EER

(%)

Degradation Degree

(%)

Standard


Deviation

MFCC

6.45

10.06

55.97

1.83

WMFCC

5.38

8.14

51.30

1.32

Reduction Rate
(
%
)

16.6

19.1

8.9

27.9

Figure 7. Performance comparison between MFCC and WMFCC in EER

Table 3. Performance comparison between MFCC and WMFCC in degradation degree

APSIPA Distinguished Lecture Plan 2012
-
2013

45

Outline


Introduction



Creation of Time
-
varying Voiceprint Database



The Discrimination
-
emphasized Mel
-
frequency
-
warping Method



Experimental Results



Conclusions & Future Work

APSIPA Distinguished Lecture Plan 2012
-
2013

46


A Discrimination
-
emphasized Mel
-
frequency
-
warping
method is proposed for time
-
varying speaker recognition
.



Experimental results show that in the time
-
varying
voiceprint database, this method can not only improve
speaker recognition performance in average EER with a
reduction of 19.1%, but also alleviate performance
degradation brought by time varying with a reduction of
8.9%. [
WANG 2011,
APSIPA ASC 2011 Excellent Student Paper Award
]



Future work


Further experiments are needed to test the data
-
dependency by
using other databases.


It requires more speculation and experimentation whether the
discrimination
-
emphasized idea could be applied to other speech
features, and further, speaker modeling techniques.

APSIPA Distinguished Lecture Plan 2012
-
2013

47

Thanks!


http://cslt.riit.tsinghua.edu.cn

http://www.apsipa.org

fzheng@tsinghua.edu.cn

APSIPA Distinguished Lecture Plan 2012
-
2013

48

Update ...
Telephone banking application

2009


得意音通与清华大学共同承担中国建设
银行

95533
电话银行声纹身份认证系统

项目

2010
年该项目完成验收

2011

11
月建行确


已正常运行满一年


建设银行成为中国金
融领域首家应用声纹身份认证的银行



招商银行

--

下一个