IMPROVING AUTOMATIC SPEAKER

zoomzurichΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

348 εμφανίσεις

Speech and Audio Research Laboratory
School of Engineering Systems
IMPROVING AUTOMATIC SPEAKER
VERIFICATION USING SVM TECHNIQUES
Mitchell McLaren
B.CompSysEng(Hons)
SUBMITTED AS A REQUIREMENT OF
THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT THE
QUEENSLAND UNIVERSITY OF TECHNOLOGY
BRISBANE,QUEENSLAND
OCTOBER 2009
Keywords
Speaker recognition,speaker verication,support vector machines,Gaussian mix-
ture models,session variation,unsupervised adaptation,data selection.
i
ii
Abstract
Automatic recognition of people is an active eld of research with important
forensic and security applications.In these applications,it is not always possible
for the subject to be in close proximity to the system.Voice represents a human
behavioural trait which can be used to recognise people in such situations.Au-
tomatic Speaker Verication (ASV) is the process of verifying a persons identity
through the analysis of their speech and enables recognition of a subject at a
distance over a telephone channel { wired or wireless.
A signicant amount of research has focussed on the application of Gaussian
mixture model (GMM) techniques to speaker verication systems providing state-
of-the-art performance.GMM's are a type of generative classier trained to model
the probability distribution of the features used to represent a speaker.
Recently introduced to the eld of ASV research is the support vector ma-
chine (SVM).An SVMis a discriminative classier requiring examples from both
positive and negative classes to train a speaker model.The SVM is based on
margin maximisation whereby a hyperplane attempts to separate classes in a
high dimensional space.SVMs applied to the task of speaker verication have
shown high potential,particularly when used to complement current GMM-based
techniques in hybrid systems.
This work aims to improve the performance of ASV systems using novel and
innovative SVM-based techniques.Research was divided into three main themes:
session variability compensation for SVMs;unsupervised model adaptation;and
impostor dataset selection.
The rst theme investigated the dierences between the GMM and SVM do-
mains for the modelling of session variability | an aspect crucial for robust
iii
speaker verication.Techniques developed to improve the robustness of GMM-
based classication were shown to bring about similar benets to discriminative
SVMclassication through their integration in the hybrid GMMmean supervec-
tor SVM classier.Further,the domains for the modelling of session variation
were contrasted to nd a number of common factors,however,the SVM-domain
consistently provided marginally better session variation compensation.Mini-
mal complementary information was found between the techniques due to the
similarities in how they achieved their objectives.
The second theme saw the proposal of a novel model for the purpose of session
variation compensation in ASV systems.Continuous progressive model adapta-
tion attempts to improve speaker models by retraining them after exploiting all
encountered test utterances during normal use of the system.The introduction
of the weight-based factor analysis model provided signicant performance im-
provements of over 60% in an unsupervised scenario.SVM-based classication
was then integrated into the progressive system providing further benets in per-
formance over the GMM counterpart.Analysis demonstrated that SVMs also
hold several benecial characteristics to the task of unsupervised model adapta-
tion prompting further research in the area.
In pursuing the nal theme,an innovative background dataset selection tech-
nique was developed.This technique selects the most appropriate subset of ex-
amples from a large and diverse set of candidate impostor observations for use
as the SVM background by exploiting the SVM training process.This selection
was performed on a per-observation basis so as to overcome the shortcoming of
the traditional heuristic-based approach to dataset selection.Results demon-
strate the approach to provide performance improvements over both the use of
the complete candidate dataset and the best heuristically-selected dataset whilst
being only a fraction of the size.The rened dataset was also shown to gener-
alise well to unseen corpora and be highly applicable to the selection of impostor
cohorts required in alternate techniques for speaker verication.
iv
Contents
List of Tables xii
List of Figures xvii
List of Abbreviations xxi
Authorship xxv
Acknowledgements xxvii
1 Introduction 1
1.1 The Voice as a Biometric.......................1
1.1.1 Speaker Verication......................2
1.2 Aims and Scope............................3
1.3 Thesis Structure............................5
1.4 Original Contributions........................6
1.4.1 Session Variability Compensation for SVMs........7
1.4.2 Improving Continuous Unsupervised Speaker Model Adap-
tation.............................7
1.4.3 Impostor Dataset Selection..................8
1.5 Publications..............................10
2 An Overview of Speaker Verication Technology 13
2.1 Introduction..............................13
2.2 Evaluation of Speaker Verication..................14
2.2.1 The Switchboard Series of Corpora.............15
v
2.2.2 The Fisher English Speech Corpus.............16
2.2.3 The Mixer Corpora......................17
2.2.4 NIST Speaker Recognition Evaluation Protocols......18
2.2.5 Performance Measures....................20
2.3 Audio Speech Processing.......................23
2.3.1 The Short-time Cepstrum..................23
2.3.2 Robust Acoustic Feature Extraction.............26
2.3.3 Speech Activity Detection..................28
2.3.4 High-Level (Long-term) Features..............29
2.4 Gaussian Mixture Speaker Modelling................32
2.4.1 Maximum Likelihood Estimation..............34
2.4.2 Maximum A Posteriori Estimation.............35
2.4.3 The GMM-UBM Verication System............37
2.5 Normalisation Techniques......................38
2.5.1 Model-based Normalisation.................38
2.5.2 Score-based Normalisation..................39
2.6 Session Variability Compensation in GMMs............42
2.6.1 Inter-Session Variability Modelling.............42
2.6.2 The Factor Analysis Model..................43
2.7 Baseline GMM-UBM Speaker Verication System.........44
2.8 Summary...............................45
3 Support Vector Machines for Speaker Verication 47
3.1 Introduction..............................47
3.2 Support Vector Machines.......................48
3.2.1 The Linearly Separable Case.................49
3.2.2 Hyperplane-based Classication...............50
3.2.3 Non-Separable Data.....................51
3.2.4 The SVM Kernel.......................52
3.3 Speaker Modelling with SVMs....................54
3.3.1 The Background Dataset...................54
vi
3.3.2 Frame-Based Kernels.....................55
3.3.3 Sequence Kernels.......................57
3.4 Normalisation Techniques......................64
3.4.1 Score-based Normalisation..................64
3.4.2 Kernel-based Normalisation.................65
3.5 Session Variability Compensation in SVMs.............67
3.5.1 Nuisance Attribute Projection................67
3.5.2 Scatter-Dierence NAP....................68
3.5.3 Within-Class Covariance Normalisation...........69
3.6 Fusion of Speaker Verication Systems...............69
3.6.1 Score-level Fusion.......................70
3.6.2 System-level Fusion......................71
3.6.3 Kernel-based Fusion.....................71
3.7 Baseline SVM Speaker Verication System.............72
3.8 Summary...............................74
4 Comparing Session Variation Modelling Domains for GMM
Mean Supervector SVMs 77
4.1 Introduction..............................77
4.2 Speaker Modelling in the GMM Mean Supervector Space.....79
4.2.1 Gaussian Mixture Modelling.................79
4.2.2 The GMM Mean Supervector................80
4.2.3 SVM-based Classication using GMM Mean Supervectors 80
4.2.4 The Background Data Scaling Kernel............81
4.3 Modelling Session Variation.....................83
4.3.1 Inter-session Variability Modelling..............83
4.3.2 Nuisance Attribute Projection................84
4.3.3 Relationship between ISV modelling and NAP.......85
4.4 Experiments..............................86
4.4.1 Protocol............................86
4.4.2 Modelling Session Variation in the GMM Space......87
vii
4.4.3 Comparison of the GMM and SVM Space.........91
4.4.4 Combined System and Score Fusion.............92
4.5 Summary...............................93
5 Improving Continuous Unsupervised Speaker Model Adaptation 95
5.1 Introduction..............................95
5.2 Relation to Previous Work......................97
5.2.1 Progressive Model Adaptation................97
5.2.2 Condence-based Unsupervised Model Adaptation.....98
5.2.3 Session Variability Compensation..............99
5.2.4 Score Shift...........................100
5.3 Continuous Progressive Speaker Adaptation in GMMs.......101
5.3.1 Condence Measure Estimation...............101
5.3.2 Speaker Model Adaptation..................103
5.3.3 System Architecture.....................104
5.4 Reducing of Score Shift........................104
5.5 The Weight-based Factor Analysis Model..............105
5.6 Continuous Progressive SVMClassication using GMMSupervectors107
5.6.1 Weighted Data in SVMs...................107
5.6.2 Single Target Supervector..................108
5.6.3 Multiple Target Supervector.................109
5.7 Experiments..............................112
5.7.1 Protocol............................112
5.7.2 Evaluation of Weight-based Factor Analysis Model....113
5.7.3 SVM-based Classication...................114
5.7.4 Comparison of Score-shift in Classiers...........115
5.7.5 Robust Condence Score Estimation............117
5.7.6 Unsupervised Adaptation in the NIST 2008 SRE.....119
5.7.7 Discussion...........................120
5.8 Summary...............................121
viii
6 The Importance of the SVM Background Dataset 123
6.1 Introduction..............................123
6.2 Experimental Protocol........................125
6.2.1 GMM Mean Supervector SVM Conguration........125
6.2.2 Datasets and Evaluations Corpora..............125
6.2.3 Score Normalisation.....................126
6.3 Heuristic Background Dataset Selection...............127
6.3.1 Development Evaluations...................127
6.3.2 Shortcomings of Heuristic Dataset Selection........128
6.4 Data-Driven Background Dataset Selection.............129
6.4.1 Support Vector Frequency..................129
6.4.2 Background Dataset Renement...............130
6.5 Experiments..............................131
6.5.1 Performance of Renement Eciency............132
6.5.2 Generalisation of Background Dataset Renement.....134
6.5.3 Common Impostor Selection Between 2006 and 2008 De-
velopment Data........................138
6.5.4 Language Dependence....................140
6.5.5 Renement of Small Candidate Datasets..........141
6.5.6 Iterative Dataset Renement.................142
6.5.7 Ranking via Support Vector Coecients..........146
6.6 Background Dataset Renement Characteristics..........150
6.6.1 Ranking by Support Vector Frequency...........150
6.6.2 Impostor Dispersion in the SVM Feature Space......152
6.6.3 Database Contribution to Rened Background.......158
6.6.4 Ideal Impostor Audio Characteristics............158
6.7 Summary...............................163
7 Experiments in Data-Driven Impostor Selection for Speaker Ver-
ication Systems 165
7.1 Introduction..............................165
ix
7.2 Rened SVM T-norm Score Normalisation Cohort.........167
7.2.1 Test Score Normalisation...................168
7.2.2 Experimental Protocol....................169
7.2.3 Intersecting Background and T-norm Datasets.......170
7.2.4 Disjoint Background and T-norm Datasets.........172
7.2.5 Discussion...........................175
7.3 Rened GMM Score Normalisation Cohorts............176
7.3.1 Score Normalisation Techniques...............177
7.3.2 Experimental Protocol....................178
7.3.3 Development Evaluations...................180
7.3.4 Generalisation of Rened Impostor Datasets........183
7.3.5 Discussion...........................183
7.4 Renement of Multiple SVM-based Feature Sets..........184
7.4.1 SVM-based Feature Sets...................186
7.4.2 Experimental Protocol....................188
7.4.3 Development Evaluations...................190
7.4.4 Generalisation of Rened Datasets.............192
7.4.5 Score-level Fusion.......................195
7.4.6 Exploiting Inter-Feature Impostor Suitability Metrics...196
7.4.7 Discussion...........................198
7.5 Summary...............................199
8 Conclusions and Future Directions 201
8.1 Introduction..............................201
8.2 Session Variability Compensation for SVMs............201
8.2.1 Original Contributions....................202
8.2.2 Future Directions.......................203
8.3 Improving Continuous Unsupervised Speaker Model Adaptation.203
8.3.1 Original Contributions....................204
8.3.2 Future Directions.......................205
8.4 The Importance of the SVM Background..............206
x
8.4.1 Data-Driven Background Dataset Selection.........206
8.4.2 Experiments in Impostor Dataset Selection.........209
8.4.3 Conclusions..........................211
8.5 Summary...............................212
Bibliography 214
xi
xii
List of Tables
2.1 Standard detection cost function (DCF) parameter values.....22
3.1 Common frame-based SVM kernels suited to speaker verication.56
4.1 Unnormalised and T-normalised performance for 1-sided NIST
2005 SRE for GMM-UBM and GMM mean supervector SVM sys-
tems...................................89
4.2 Unnormalised and T-normalised performance for 1-sided NIST
2006 SRE for GMM-UBM and GMM mean supervector SVM sys-
tems...................................90
4.3 T-normalised minimum DCF and EER results for 1-sided GMM
mean supervector SVM trials evaluated using the NIST 2005 and
2006 SRE................................92
5.1 Non-adaptive SVM-based trials on a subset of SRE'05 speakers
comparing the modelling of speakers using single and multiple tar-
get GMM mean supervectors.....................110
5.2 GMM-based progressive NIST 2005 trials with and without the
weight-based factor analysis model..................113
5.3 GMM and SVM-based results for progressive adaptation of the 1-
sided,male trials from the NIST 2005 SRE using the weight-based
factor analysis model..........................116
5.4 Lower-thresholding the WMAP condence measurement in the
male SRE'05 trials using the progressive STS-SVM system with
the weight-based factor analysis model................118
xiii
5.5 TZ-normalised results for the STS-SVM conguration employing
the weight-based FA model on the unsupervised adaptation mode
of the SRE'08..............................119
6.1 Candidate impostor examples available from each data source...126
6.2 T-normed results from heuristic evaluation of candidate back-
ground datasets in the all-language and English-only NIST 2006
SRE...................................128
6.3 Performance statistics obtained from 1-sided,all-language NIST
2006 and 2008 SREs when using the complete,heuristically chosen
and rened background datasets for SVM training.........136
6.4 Performance statistics obtained from 1-sided,English-only NIST
2006 and 2008 SREs when using the complete,heuristically chosen
and rened background datasets for SVM training.........137
6.5 Performance on NIST 2008 and 2006 SREs using the complete
background dataset and rened dataset of 750 examples selected
using SRE'08 as development data..................139
6.6 Performance with matched an unmatched language conditions be-
tween development dataset and test conditions...........140
6.7 Performance obtained on 1-sided,all-language NIST2006 and 2008
SREs when using complete and rened Switchboard 2 background
datasets for SVM training.......................141
6.8 Performance obtained on the 1-sided,all-language NIST 2006 and
2008 SREs using the complete,and rened background datasets
ranked using cumulative support vector coecients.........147
6.9 Mean and standard deviation of active speech length detected in
the 100 highest and 100 lowest-ranking impostor observations...160
6.10 T-normed results from all-language and English-only SRE'06 and
SRE'08 when compiling background dataset based on active speech
duration.................................161
xiv
6.11 Percentage (%) of utterances from native English,non-native En-
glish and non-English speakers in rened and reverse-rened back-
ground datasets.............................162
6.12 T-normed results from all-language and English-only all-language
and English-only SRE'06 and SRE'08 when dividing candidate
dataset by language..........................162
7.1 Candidate impostor examples available from each data source...169
7.2 Performance on NIST 2006 SRE when using full dataset B and
best rened intersecting T-norm and background datasets.....173
7.3 Performance from NIST 2008 SRE using full and best rened T-
norm and background datasets selected based on NIST 2006 eval-
uations.................................173
7.4 Performance on NIST 2006 and NIST 2008 SRE using full and best
rened disjoint T-norm and background datasets..........174
7.5 Performance obtained on 1-sided,English SRE'06 when using full,
heuristically selected and rened impostor datasets for score nor-
malisation................................181
7.6 Performance obtained on 1-sided,English SRE'08 when using full,
heuristically selected and rened impostor datasets for score nor-
malisation................................183
7.7 Candidate impostor examples available from each data source...189
7.8 T-normalised minimum DCF and EER obtained from 1-sided,
English-only NIST 2006 evaluations when using the complete and
rened background datasets in dierent SVM congurations....191
7.9 Min.DCF and EER obtained from 1-sided,English-only NIST
2008 evaluations when using the complete and rened background
datasets in dierent SVM congurations...............192
xv
7.10 Score-level fusion of GMM-svec,GLDS,N-gram and MLLR SVM-
based congurations when evaluated using complete candidate
datasets and SVCoef rened datasets on the English-only SRE'06
and SRE'08...............................196
7.11 Number of impostor examples that appear in the top 2000 from
each SVM feature set.Results for male subset............197
7.12 GLDS performance obtained when combining the impostor suit-
ability metrics from alternate systems for the renement process..198
xvi
List of Figures
2.1 Data ow through a typical automatic speaker verication (ASV)
system..................................14
2.2 Decision Cost Function (DCF) plot of system operating character-
istics..................................22
2.3 Feature extraction process in the baseline GMM-UBMspeaker ver-
ication system.............................44
3.1 An example of SVM components trained using (a) linearly sepa-
rable data and (b) non-separable data................50
3.2 Data ow through the baseline GMM mean supervector SVM
speaker verication system......................73
4.1 DET plots comparing GMM-UBM and GMM mean supervector
SVM systems,with and without ISV modelling,on the (a) NIST
2005 SRE and (b) NIST 2006 SRE..................88
5.1 WMAP condence measures calculated from TZ-normalised LLRs.102
5.2 Data ow in GMM-based continuous progressive model adaptation
system as trial h is encountered....................105
5.3 Data ow of the Single Target Supervector (STS) SVM-based con-
guration for continuous progressive model adaptation.......108
5.4 Data ow of the Multiple Target Supervector (MTS) SVM-based
conguration for continuous progressive model adaptation.....111
xvii
6.1 T-normed performance for 1-sided,all-language SRE'06 as the
complete background dataset was rened compared to reverse-
renement................................133
6.2 Unnormalised and T-normed minimum DCF in 1-sided,all-
language NIST 2006 and 2008 SRE as candidate dataset was rened.135
6.3 Percentage of examples common to background datasets rened
using NIST 2006 and NIST 2008 observations as development data.138
6.4 Performance oered through iteratively removing the lowest-
ranking examples during renement compared to standard rene-
ment approach on the 1-sided,all-language (a) development and
(b) unseen corpora...........................143
6.5 Iteratively retaining the highest-ranking examples during rene-
ment compared to standard renement approach on the SRE'06
and SRE'08...............................145
6.6 Comparison of performance oered by rened datasets selected via
dierent impostor suitability metrics on the 1-sided,all-language
evaluation of the (a) development and (b) unseen corpora.....149
6.7 Support vector frequencies of ranked examples from the complete
background datasets..........................151
6.8 Average inter-example distance in the background dataset as it is
both rened and reverse-rened....................153
6.9 Average hypervolume radii to encompass a given proportion of ob-
servations from each impostor example in the background dataset
as it is rened.............................155
6.10 Average inter-example distance in the background dataset as it is
both rened and reverse-rened using the candidate examples as
development data (Self-renement)..................156
6.11 Comparison of SRE'06 performance between standard renement
using SRE'06 as development data and self-renement using the
candidate impostor examples as development data.........157
xviii
6.12 Contribution (%) of data sources to the complete and rened back-
ground datasets.............................159
7.1 Min DCF.and EER on NIST 2006 SRE when performing data-
driven impostor selection of intersecting background and T-norm
datasets.................................171
7.2 Min DCF.on NIST2008 SRE when using rened intersecting back-
ground and T-norm datasets ranked using NIST 2006 data.....173
7.3 EER of GMM-based ZT-normalised scores on NIST 2006 SRE
when varying the size of the rened Z and T-norm datasets....180
7.4 The eect of renement using support vector frequencies (SVFreq)
and cumulative support vector coecients (SVCoef) on the mini-
mum DCF and EER of the GLDS system in the development 2006
SRE and unseen 2008 SRE......................194
xix
xx
List of Abbreviations
ASV Automatic Speaker Verication
CMS Cepstral Mean Subtraction
DARPA Defense Advanced Research Projects Agency
DCF Detection Cost Function
DCT Discrete Cosine Transform
DET Detection Error Trade-o plot
EARS Eective,Aordable,Reusable Speech-to-text project
EDT Extended Data Task
EER Equal Error Rate
ELLR Expected Log-Likelihood Ratio
E-M Expectation-Maximisation algorithm
FA Factor Analysis
GLDS Generalised Linear Discriminant Sequence
GMM Gaussian Mixture Model
GMM-UBM The GMM with UBM verication structure
HMM Hidden Markov Model
ISV Inter-session Variation
xxi
KL Kullback-Leibler
KPCA Kernel Principal Component Analysis
LDC Linguistic Data Consortium
LFCC Linear Frequency Cepstral Coecients
LLR Log-Likelihood Ratio
LP Linear Predictor
LPCC Linear Predictive Cepstral Coecients
MAP Maximum A Posteriori
MFCC Mel-Frequency Cepstral Coecients
ML Maximum Likelihood
MLLR Maximum Likelihood Linear Regression
MMSE Minimum Mean Squared Error
MTS Multiple Target Supervector
NAP Nuisance Attribute Projection
NERF Non-uniform Extraction Region Features
NIST National Institute for Standards and Technology
PCA Principal Component Analysis
PFS Parametric Feature Set
PLP Perceptual Linear Predictive coecients
PPRLM Parallel Phonetic Recognition with Language model
QUT Queensland University of Technology
RASTA RelAtive SpecTrA
RBF Radial Basis Function
ROC Receiver Operating Characteristic
xxii
SAIVT Speech,Audio,Image and Video Technology
SMS Speaker Model Synthesis
SRE NIST Speaker Recognition Evaluation
STS Single Target Supervector
SVM Support Vector Machine
UBM Universal Background Model
VQ Vector Quantisation
WCCN Within-Class Covariance Normalisation
WMAP World Maximum A Posteriori
xxiii
xxiv
Authorship
The work contained in this thesis has not been previously submitted for a degree
or diploma at any other higher education institution.The work in Chapter 5
was a joint eort with the Laboratoire Informatique d'Avignon at the Universite
D'Avignon.To the best of my knowledge and belief,the thesis contains no
material previously published or written by another person except where due
reference is made.
Signed:
Date:
xxv
xxvi
Acknowledgements
First and foremost,I'd like to thank my Lord and personal saviour,Jesus Christ,
for giving me the family,friends and abilities needed to achieve this accomplish-
ment in my life.My wife Luana | thankyou so much for your love and support
throughout the past few years;I couldn't have done it without you by my side.
Thanks to my family (both immediate and in-law families) for all your encour-
agement along the way.My friends |thanks for giving me a much needed break
in the form of a social life during the writing up of the thesis.
To Sridha,thanks to for providing me with the opportunity to study as one
of your students and also for supplying sucient research necessities including
equipment,IT support and funding for conference travel.To my assistant super-
visors,thankyou for providing direction and answering any questions whenever
my focus was a little fuzzy.In particular,thanks Robbie Vogt for trying to teach
me how to write a paper as well as yourself (though I don't think I ever achieved
that goal!) and answering the 1000's of questions I threw your way even when
you didn't feel like it.To all the members of the SAIVT lab,the many discussions
we have had (work related or not) and the lunchtime card games have,in some
way or another,helped me to complete the work in this thesis;thankyou.
Lastly,I'd like to thank Jean-Francois Bonastre for providing the opportunity
for my wife and I to live in beautiful Avignon,France for six months while I
studied at the Laboratoire Informatique d'Avignon (LIA).It was a pleasure to
work with you and Driss to produce the work presented in Chapter 5 of this
dissertation.
xxvii
xxviii
Chapter 1
Introduction
1.1 The Voice as a Biometric
The constant progress of today's technologies make it vital that relevant security
measures are continually developed accordingly to ensure that items of impor-
tance are not left vulnerable to unauthorised access.Such security measures are
necessary,not only to protect digitally stored personal information,but also to
safeguard countries and their people against threats of terrorism.
The increasing emphasis being placed on security has led to a focus on the
development of biometric authentication technologies.Biometrics are the physical
and behavioural traits that belong to an individual.A subject can be identied
using any number of biometric traits such as face and iris recognition,nger print
and hand geometry,walking gait,and voice analysis.Recognition through each of
these traits pose dierent requirements on the person being identied and result
in diering levels of accuracy.Consequently,the choice of authentication method
is dependent on its intended application.
Speech as a biometric |often referred to as speaker recognition |has many
desirable characteristics that make it useful for a wide variety of applications.
Being one of the most natural human actions,speech can be easily acquired using
non-invasive techniques while demanding little or no requirements of the person
being identied.This is particularly important in forensics-oriented applications
where,for example,an audio recording of a voiced threat may be compared to
1
2 Chapter 1.Introduction
the speech of a number of suspects.Regarding border security and national
defense,the location of a known terrorist can be tracked through the analysis of
speech from localised audio recordings.While other biometric technologies may
require expensive,specialised equipment or direct interaction with the person to
be authenticated (such as allowing their face or ngerprint to be scanned),speech
can be readily captured despite the physical distances separating the user from
the authentication system using telephony-based technologies |both wired and
wireless.
The use of telephony-based speech for authentication is of particular interest
to the private business sector.Banking corporations are one example of such
businesses actively seeking out better methods of providing remote yet secure
services to their clients.Speaker verication is a particularly appealing method
of ensuring remote security by authenticating the identity of a client prior to
allowing transactions,credit card payments and other security-oriented processes.
The need to verify a client's identity remotely via telephone has increased the
demand for the development and deployment of appropriate speaker verication
technologies.
1.1.1 Speaker Verication
Speaker verication is the process of determining whether or not a given sample
of speech originated from the target speaker in question.
The task of speaker verication is often confused with that of speaker identi-
cation,however there is a subtle dierence between the two.In the case of speaker
identication,the identity of the person that provided a speech sample is to be
determined from among a closed group of possible speakers.In contrast,verica-
tion involves producing a binary decision as to whether a given speech segment
originated from a pre-selected speaker.The open-set nature of the verication
process makes this an inherently more dicult task.
Speaker verication can be divided into two categories:text-dependent and
text-independent.In a text-dependent context,the system expects a pre-dened
phrase to be spoken by the user.This approach allows very high accuracy to be
1.2.Aims and Scope 3
achieved through the analysis of particular phase and intonation characteristics
of the speech over time.However,increased interaction between the user and
the system is required as clients may need to produce a particular set of key-
words or be prompted with a required phrase for the verication process.The
text-independent case on the other hand,allows the speaker to use unrestricted
speech for the verication process.This is an inherently dicult task and is most
applicable to forensic-based applications in which speaker-unaware verication is
to be performed.
In the context of speaker verication,speech or audio recordings must rst
be processed to present only relevant data to the system for classication.Both
the physical and behavioural traits of an individual speaker are represented in
speech and it is the objective of speech processing techniques is to transform
speech signals into a discrete set of features that clearly represent these speaker
discriminative traits.
While text-independent speaker verication can produce accurate results
when used under ideal conditions,such conditions are near impossible to en-
counter allowing numerous factors to signicantly degrade classication perfor-
mance.The majority of classication errors are caused by the dierences or
mismatch between the acquired training and testing speech.Mismatch can occur
due to the use of dierent telephone handsets or the presence of diering acoustic
environments.The diculties associated with compensating for these dierences
presents a very active research topic for the speaker verication eld which also
extends to other speech related technologies.
1.2 Aims and Scope
The aimof this research programme was to improve the classication performance
and practicality of speaker verication systems through the use of support vector
machine (SVM) techniques.
The recent introduction of SVMs to the task of speaker verication has
resulted in performance comparable to state-of-the-art probabilistic-based ap-
4 Chapter 1.Introduction
proaches to speaker modelling.While traditional generative modelling approaches
represent a speaker's characteristics in a probabilistic manner,SVMs utilise a
discriminative training process that actively seeks to distinguish positive training
examples from the negative.The discriminative nature of SVMs lends itself par-
ticularly well to the verication task in which the voice characteristics of client
speakers are to be distinguished from impostors.
The scope of this programme is restricted to the underlying pattern recogni-
tion algorithms of speaker verication technology.The development of interfacing
software and the deployment of the associated technology are beyond the scope
of this programme.
Specically,this research programme focuses on text-independent speaker ver-
ication using telephony-based speech.The constraint of text-independent clas-
sication was imposed due to the signicant challenges yet to be addressed in
the domain.Research developments in this scenario,however,are also likely
to aid classication performance in a text-dependent scenario.This work fo-
cusses on telephony-based speech due to the practical and wide-spread use of
telephony-based communications and second,to ensure that research outcomes
are robust to situations in which quality of speech can not be guaranteed.The
challenges associated with developing robust speaker verication techniques using
telephony-based speech are also likely to be applicable to systems beneting from
higher-quality audio acquisition methods.
Three main avenues for the development and improvement of speaker veri-
cation technologies using SVM-based techniques are pursued.
Session variability compensation for SVMs:The biggest contributor to
the degradation of automatic speaker verication (ASV) performance is the
presence of session variations between training and testing conditions.This
topic is regarded as a signicant issue in the research eld as it is perhaps
the most prominent problem that restricts the wide-spread deployment of
ASV systems.While session variability modelling has recently received sig-
nicant attention in the context of Gaussian mixture modelling (GMM),
solutions tailored toward SVMs are still emerging in the research eld.
1.3.Thesis Structure 5
Improving continuous progressive speaker model adaptation:
Progressive speaker model adaptation exploits speech acquired through
normal system use to progressively increase the amount of speaker model
training data in an attempt to improve overall system performance.
Continuous progressive model adaptation is one of the most promising
approaches to this task,however,it does not attempt to counteract the
adverse eects of session variation.Further,the benets of discriminative
SVM-based classication are currently not exploited in this system.
Impostor dataset selection:The SVM relies on the background dataset to
provide discriminatory information against client data during speaker model
training.The selection of a background dataset is often based on the broad
characteristics expected in the impostor trials encountered by the system
such as gender,language and the method of audio acquisition.Although
good performance can be obtained using this heuristic-based approach to
selection,it is not a systematic process,basing impostor selection on the
performance of an entire set rather than analysing how much potential each
impostor example oers to the background dataset.
1.3 Thesis Structure
The remaining chapters of this thesis are composed as follows:
Chapter 2 provides an overview of current speaker verication technologies.
Signicant focus is given to the thorough research eorts made toward the
state-of-the-art GMM-based speaker verication conguration that utilises
a universal background model (UBM).
Chapter 3 presents the support vector machine from a pattern recognition per-
spective.Successful methods that have emerged in the research eld for
SVM-based speaker verication are also detailed in this chapter along with
the baseline SVM-system used throughout this dissertation.
6 Chapter 1.Introduction
Chapter 4 investigates the dierences between modelling session variation in
the GMM and SVM domains.For this task,inter-session variation (ISV)
modelling and nuisance attribute projection (NAP) techniques,imple-
mented in the GMM and SVM domains respectively,are analysed.
Chapter 5 focusses on,rstly,employing session compensation in the GMM-
domain for continuous progressive speaker model adaptation systems and,
secondly,the integration of SVM-based classication in this system to fur-
ther improve classication performance.This work was the result of a six
month internship with the Laboratoire Informatique d'Avignon at the Uni-
versite d'Avignon,France.
Chapter 6 investigates the characteristics that are desired of the background
dataset for the purpose of SVM-based classication.Through the exploita-
tion of the SVMtraining and support vector selection process,an automated
approach to the selection of impostor datasets is developed.
Chapter 7 applies the background dataset renement technique to a range of
speaker verication techniques that are based on impostor cohort selection
to provide improved performance over traditional dataset selection tech-
niques.
Chapter 8 concludes the dissertation with a summary of the contributions of
this research and suggests further directions for continuing research in ro-
bust SVM-based speaker verication.
1.4 Original Contributions
This research programme has resulted in contributions to the eld of speaker
verication in all of the research themes identied above.
1.4.Original Contributions 7
1.4.1 Session Variability Compensation for SVMs
Recently introduced hybrid GMM-SVM speaker verication congurations ex-
ploit both the generative modelling domain and the discriminative SVMdomain.
The success of this classier has brought into question the domain in which session
variability compensation should be employed.This question was addressed in this
work through the comparison of session variation modelling in GMMdomain and
the removal of session variation directly in the SVM kernel space.
Investigations then focussed on the combination of the GMMand SVM-based
session variability modelling approaches through system and score-level fusion
in order to determine whether complementary information could be exploited.
This was of particular interest due to the fundamental dierences in the speaker
modelling processes between the domains.
In addressing the questions outlined above,the following contributions were
made to the research community.
 Employing SVM-based session compensation was found to be marginally
more robust than the use of similar techniques in the GMM domain.
 Experiments demonstrated that employing robust modelling techniques
during GMM training can improve speaker discrimination in the SVM ker-
nel space.
 The modelling of session variability in the GMM and SVM domains was
found to be largely non-complementary.
1.4.2 Improving Continuous Unsupervised Speaker
Model Adaptation
The continuous approach to unsupervised speaker model adaptation is unable to
employ current session variability modelling approaches due to the use of weighted
speaker training data.This work modies the common factor analysis model for
session compensation to account for this weighted speaker training data.
8 Chapter 1.Introduction
The use of SVMs in unsupervised model adaptation scenarios has not yet been
investigated despite their successful application to alternate speaker verication
tasks.The application of SVM-based classication to the continuous system was,
therefore,investigated in this theme to determine the suitability of SVMs to
unsupervised model adaptation.
The following key contributions arose from this work.
 A novel weight-based factor analysis model was proposed specically for the
task of session compensation in continuous unsupervised model adaptation
systems to bring about signicant performance improvements.
 SVM-based classication was found to improve system performance in an
unsupervised scenario over the current GMM-based implementation.
 SVMs demonstrated characteristics that are desirable for the task of unsu-
pervised model adaptation.Specically,they oered increased robustness
to the inclusion of weighted impostor training data and the adverse eects
of score shift over the GMM-based approach.
 The importance of including low-scoring target trial segments in the model
adaptation process was highlighted together with how their benets can
only be exploited through continuous model adaptation.
1.4.3 Impostor Dataset Selection
The major shortcoming of the traditional,heuristic-based approach to back-
ground dataset selection is that it fails to seek out the potential that each candi-
date example oers to the background dataset.
This work investigates a novel data-driven method of analysing the suitability
of impostor examples for the purpose of representing the background population.
Specically,information from the SVM training algorithm is exploited in the
development of an impostor suitability metric.The impostor suitability measure
subsequently provides a means of selecting the background dataset on a per-
observation basis to overcome the shortcoming of the heuristic-based approach.
1.4.Original Contributions 9
Little is known about the characteristics desired of examples that represent the
SVM impostor population.This work utilises the proposed impostor suitability
measure to address this issue through the analysis of characteristics observed in
the highest and lowest-ranking examples from a candidate impostor dataset.
Similar to SVMbackground dataset,impostor datasets used in common score
normalisation techniques must be reliably selected in order to realise their full
potential.Heuristic-based selection approaches are typically employed for this
task.In this theme,the versatility of the proposed data-driven approach is in-
vestigated by extending it to the selection of score normalisation cohorts for both
SVM and GMM-based speaker verication.
A number of key contributions were made to the research eld while pursuing
the objectives outlined above.
 An impostor suitability metric was developed for the ranking of a set of
candidate impostor examples based on the frequency that examples were
selected as support vectors in the training of a set of client SVMs.
 An automated,data-driven approach was proposed for the selection of a
rened set of the most suitable impostor examples from a large and diverse
candidate dataset for use as the background dataset in SVM-based speaker
verication systems.This provided signicant performance improvements
over the traditional heuristic-based approach using a relatively small back-
ground dataset.
 The characteristics desired of the impostor utterances were found to include
lengthy training data,even dispersion in the feature space and diversity of
language characteristics.
 SVM-based classication performance was shown to be more sensitive to the
selection of a suitable T-norm dataset than the selection of the background
dataset through their independent selection using the proposed data-driven
renement approach.
 The SVM-based data-driven renement technique was successfully applied
10 Chapter 1.Introduction
to the selection of score normalisation cohorts for GMM-based speaker ver-
ication to provide improved performance over datasets selected using tra-
ditional,heuristic-based approaches.
1.5 Publications
Listed below are the peer-reviewed publications and under-review submissions
resulting from this research programme.Two of these publications were recog-
nised for their contribution to the research eld at major international conferences
and were awarded the\Best Student Paper Award"at Interspeech 2008 held in
Brisbane,Australia and the prestigious\IEEE 2009 Spoken Language Processing
Student Grant"as an award for the best student paper at the 2009 IEEE Inter-
national Conference on Acoustics,Speech and Signal Processing held in Taipei,
Taiwan.
Peer-reviewed International Journals
 M.McLaren,R.Vogt,B.Baker,and S.Sridharan,\Data-Driven Back-
ground Dataset Selection for SVM-based Speaker Verication",In print
IEEE Transactions on Audio,Speech and Language Processing (Accepted
August 2009).
 M.McLaren,D.Matrouf,R.Vogt,and J.F.Bonastre,\Applying SVMs
and Weight-based Factor Analysis to Unsupervised Adaptation for Speaker
Verication",In print Computer Speech and Language,(Accepted January
2010).
Peer-reviewed International Conferences
 M.McLaren,B.Baker,R.Vogt,and S.Sridharan,\Exploiting Multiple
Feature Sets in Data-Driven Impostor Dataset Selection for Speaker Veri-
cation",to be published in IEEE International Conference on Acoustics,
Speech and Signal Processing,2010.
1.5.Publications 11
 M.McLaren,R.Vogt,and S.Sridharan,\Improved GMM-based speaker
verication using SVM-driven impostor dataset selection",in Interspeech,
pp.1267{1270,2009.
 M.McLaren,B.Baker,R.Vogt,and S.Sridharan,\Data-driven impos-
tor selection for T-norm score normalisation and the background dataset in
SVM-based speaker verication",in International Conference on Biomet-
rics,pp.474{483,2009.
 M.McLaren,B.Baker,R.Vogt,and S.Sridharan,\Improved SVM
speaker verication through data-driven background dataset selection",in
IEEE International Conference on Acoustics,Speech and Signal Processing,
pp.4041{4044,2009.(Awarded the IEEE 2009 Spoken Language Processing
Student Grant).
 M.McLaren,D.Matrouf,R.Vogt,and J.F.Bonastre,\Combining con-
tinuous progressive model adaptation and factor analysis for speaker ver-
ication",in Interspeech,pp.857{860,2008.(Awarded the Best Student
Paper Award).
 M.McLaren,R.Vogt,B.Baker,and S.Sridharan,\A comparison of
session variability compensation techniques for SVM-based speaker recog-
nition,"in Interspeech,pp.790{793,2007.
 M.McLaren,R.Vogt,and S.Sridharan,\SVM speaker verication us-
ing session variability modelling and GMM supervectors,"in International
Conference on Biometrics,pp.1077{1085,2007.
 B.Baker,R.Vogt,M.McLaren,and S.Sridharan,\Scatter Dierence
NAP for SVM Speaker Recognition,"in International Conference on Bio-
metrics,pp.464{473,2009.
12 Chapter 1.Introduction
Chapter 2
An Overview of Speaker
Verication Technology
2.1 Introduction
Automatic speaker verication (ASV) is the process of determining to a specied
level of condence if a person is who he or she claims to be through the analysis
of their speech.Being a mature eld of research,a large collection of literature
exists regarding advances in ASV.
Speaker verication can typically be broken down into three components:
feature extraction,speaker modelling,and classication.Figure 2.1 shows the
common approach to speaker model training and testing in a speaker verication
system.Feature extraction involves processing the raw speech data in order to
obtain a set of speaker-discriminate features representing the characteristics of
the speaker.A speaker model is then trained using this set of features.During
the verication process,feature extraction is once again performed to obtain a
set of features to be compared to the model of the target speaker using relevant
pattern matching algorithms.The likelihood of the test speech originating from
the target speaker is represented by a classication score that is thresholded to
obtain the verication decision.
This chapter which overviews advances in speaker verication technology over
the past few decades,is structured as follows.Section 2.2 discusses methods of
13
14 Chapter 2.An Overview of Speaker Verication Technology
Figure 2.1:Data ow through a typical automatic speaker verication (ASV)
system.
evaluating performance of an ASV systemalong with the tools and datasets avail-
able for this purpose.Focus is given to the annual National Institute of Standards
and Technology (NIST) Speaker Recognition Evaluation (SRE) corpora as they
are used predominantly for the evaluation of system performance in this research
programme.
Section 2.3 provides an overview of the common approaches to feature ex-
traction and the techniques employed to ensure robustness during this process.
Speaker modelling and classication techniques using Gaussian mixture models
(GMM) are detailed in Section 2.4 including the recent technological advances
that have arisen due to the state-of-the-art performance oered by this classier.
Normalisation techniques used to improve ASV performance and robustness
to adverse conditions are described in Section 2.5.Section 2.6 then details one of
the most important technological advances for ASV systems;session variability
compensation.The reference GMM-UBM system conguration used throughout
this research programme is nally presented in Section 2.7.
2.2 Evaluation of Speaker Verication
This section details a number of corpora available for the purpose of evaluating
ASV system performance along with the denition of the most common perfor-
2.2.Evaluation of Speaker Verication 15
mance metrics used to analyse the resulting classication scores.The corpora
that have largely replaced early corpora such as YOHO speech corpus [25,26]
and King [65] are described in this section.As stated previously in Section 1.2,
the scope of this research programme is restricted to telephony-based speaker
verication and,likewise,the following datasets are also restricted to this genre.
The evaluation of a speaker verication system is a critical part of system de-
velopment.Development evaluations provide a means of tuning system parame-
ters to ensure classication performance is maximised for the expected conditions
of audio acquisition.
Several components are required to analyse the performance of a given ASV
system.Firstly,a large corpus of labelled speech audio is needed to provide both
training and testing data for the evaluation.Second,an appropriate evaluation
protocol for the corpus specifying the trials to be conducted is also required,
where a trial involves classifying whether a test segment originated from a given
target speaker.Lastly,a suitable metric must be specied to evaluate the clas-
sication performance of the system based on a set of classication scores and a
key indicating which scores correspond to which target trials.
2.2.1 The Switchboard Series of Corpora
The Switchboard series of corpora was collected by the Linguistic Data Consor-
tium(LDC) as part of the Eective,Aordable,Reusable Speech-to-text (EARS)
project,sponsored by the Defense Advanced Research Projects Agency (DARPA).
The name\switchboard"comes from the method with which subjects were con-
nected via telephone.A caller would call the switchboard which would then
randomly call another subject from a database of registered participants.The
system would then prompt discussion on a randomly chosen topic and record
the following 6 minutes of conversation from which the rst minute is typically
discarded to remove any introductory and o-topic conversation.The conversa-
tion style of recording generally provided around 2.5 minutes of active or useable
speech from each conversation side.
Switchboard I [50,51] consisted of landline-based telephony speech from both
16 Chapter 2.An Overview of Speaker Verication Technology
electret and carbon-button handset types.This corpus was collected from 543
U.S.participants with a total of 4800 conversation sides or speech segments and
made available in 1997.Speech recognition research has benetted from the full
speech-to-text transcriptions of this corpora being made available.
Switchboard II consisted of three separate phases diering in demographic re-
gion;Mid-Atlantic,Midwest,southern regions respectively.The majority of par-
ticipants were sourced from local universities.As a result,the corpus of speech
represents a considerably younger generation than the previous Switchboard I
dataset.While calls were sourced only from landline telephony connections,par-
ticipants were encouraged to make calls from a variety of handsets to bring about
diering channel and handset conditions within the corpora.
Switchboard-2 Phase I [52] consisted of 3638,5-minute telephone conversa-
tions from 657 participants.Phase II [54] consisted of 4,472 conversations in-
volving 679 participants.Phase III [53] 5,456 sides from 640 participants under
varied environmental conditions.These three phases were recorded in the period
1996{1998 and released in 1998,1999 and 2002 respectively.
The Switchboard Cellular series of corpora [55] was released in two parts in
2001 and 2004 respectively.Part 1 focussed primarily on GSM cellular phone
technology with a total of 2618 sides (1,957 from GSMcell phones) from 254 par-
ticipants roughly divided in gender.Part 2 focussed on cellular phone technology
from a variety of service types,with CDMA technology being most dominant due
to its popularity at the time of collection.A total of 4,040 sides (2,950 cellular)
from 419 participants were recorded under a variety of environmental conditions.
2.2.2 The Fisher English Speech Corpus
The Fisher corpus of English speech [33,34] was collected and transcribed to text
to address the critical need of developers attempting to build robust automatic
speech recognition systems.However,the Fisher dataset is highly applicable to
the needs of ASVsystemdevelopers because of its size and large number of unique
participants present in the corpus.
The Fisher corpus was released in two parts between 2004-2005 and consisted
2.2.Evaluation of Speaker Verication 17
of a total of 11,699 recorded telephone conversations,with each subject typically
contributing to between one and three calls.Conservations were held between two
participants,who were typically unknown to each other,on an assigned topic and
lasted around 10 minutes.While this increased the formality of conversations,
it was intended to maximise inter-speaker variation and the range of vocabulary
within the corpus.
Participants in the collection of the Fisher corpus represented a wide variety of
demographics including gender,age,dialect region and English language uency.
Further,the Fisher subjects were sourced from a number of regions to provide a
variety of pronunciations including U.S.regional pronunciations,non-U.S.vari-
eties of English and foreign-accented English.These linguistic variations provide
the opportunity to improve the robustness of speaker verication systems when
encountering such variations in English speech.
2.2.3 The Mixer Corpora
The Mixer corpora [32] was collected in three phases using a similar method to the
Fisher corpus [34] where the majority of calls were initiated by the platform,while
subjects were also able to initiate calls.Subjects participated in up to 30 calls of
at least 6 minutes in duration,where a subset of these calls was collected from
unique handsets,multi-channel recording devices and both landline and cellular
services.This allowed a large variety of session characteristics to be represented,
making the corpora highly appropriate for the development of robust techniques
in ASV systems.
In contrast to the corpora previously described in this Section,the Mixer
copora collected speech spoken in a number of languages and non-native English
speech.The languages present in the Mixer corpora include Arabic,Mandarin,
Russian,Spanish and English (represented in around 84% of speech).Bilingual
speakers completed a minimum of four calls in languages other than English as
well as additional calls in English.This allowed ASV research to analyse the
eects of the same speaker enrolling and testing with dierent langauges.
18 Chapter 2.An Overview of Speaker Verication Technology
2.2.4 NIST Speaker Recognition Evaluation Protocols
The U.S.National Institute of Standards and Technology (NIST) Speaker Recog-
nition Evaluations (SRE) have been held annually since 1996 (with the exception
of 2007) with the purpose of driving\the technology forward,to measure the
state-of-the-art,and to nd the most promising algorithmic approaches"[93] to
text-independent speaker recognition.The NIST SREs have become common
place amongst the leaders of the speaker verication research eld where the
state-of-the-art technology continues to be re-dened.
Evaluations are performed by releasing a large corpus of speech data and an
evaluation protocol to SRE participants.Groups are then required to evaluate
the corpus using an ASV systemdeveloped using withheld data,typically sourced
from the previous SRE,and submit their results to NIST for analysis.NIST
then provides performance statistics for each participant using a known speaker-
trial key.This evaluation key is later released to participants to allow further
development of their systems in time for the following SRE.
The datasets released for each SRE often target specic genres or classication
scenarios to motivate research into the most pressing areas of ASV technology.
Prior to 2004,the NIST SRE copora were sourced from the Switchboard series
of English Speech corpora after which the Mixer database was used.
Years 1996 to 2003
The rst evaluations (1996-1998) investigated the dierences between the source
of training data of roughly 2 minutes and the length of testing data (3,10 and
30 seconds).The number of subjects used in the SREs increased from 20 to 250
speakers per gender in these early evaluations.
In 1999,the training data of 2 minutes was sourced from a single conversation
with the test data ranging in length between 15-45 seconds.This evaluation
focussed on mismatch between the training and testing conditions with results
divided into the categories matched,somewhat mismatched and very mismatched
depending on the telephone number and handset used in the training and testing
data.The year 2000 onwards saw all trials originating from dierent phone
2.2.Evaluation of Speaker Verication 19
numbers to ensure a moderate amount of mismatch in the SRE conditions.
The years 2001 to 2003 saw the core evaluation conditions focused on cellu-
lar data sourced from the Switchboard Cellular corpus [55].This introduced a
number of challenging conditions into the SREs including factors such as voice
compression and rapidly changing environment conditions due to the nature of
mobile communication.
As little development data was available for these cellular-based SREs,an
extended data task (EDT) was also dened in the evaluation protocol in which
speech over landline telephones was used to investigate the eects of increased
training data (up to 16 sides per speaker) on classication performance.The EDT
also provided speech-to-text transcripts of the telephony speech data to motivate
research into the exploitation of high-level acoustic features (see Section 2.3.4) in
the speaker verication task.
Years 2004 to 2006
In 2004,the NIST SRE presented a new evaluation protocol in which the previous
core condition and extended evaluation tasks were combined [89,90,91].In
this way,evaluations could include training data from 10 seconds through to 16
training sides with test conditions using speech segments of 10 seconds to a full
conversation side.All participating sites,however,were required to perform the
compulsory\one side train-one side test"condition of the evaluation with all
other conditions being optional.
In contrast to the Switchboard corpora used for previous evaluations,the
introduction of Mixer data presented more challenging conditions for SRE par-
ticipants to contend with.Such challenges were introduced due to the use of
speech from both landline services and cellular transmissions and the variety of
languages represented in the corpus.
Unsupervised adaptation was introduced in these evaluations as an additional
mode in which each test condition could be evaluated.This motivated research
into exploiting test segments in the model training process to improve system
performance.Further information on the process of unsupervised adaptation is
20 Chapter 2.An Overview of Speaker Verication Technology
presented in Chapter 5.
The NIST2006 SRE reused a proportion of the data fromthe 2005 SRE which,
consequently,presented diculties during the system development process due
to the potential overlap in speakers between the development and unseen data.
The diculties associated with diverse and substantial data collection resulted in
the following NIST SREs being held every two years.
The NIST 2008 SRE
The NIST 2008 SRE [92] saw the introduction of several challenging tasks.Most
dominant of these was the use of conversational speech data recorded using a
microphone in an interview type scenario and taken from the Mixer 5 Interview
speech corpus [92].Additionally,conversational telephone speech was recorded
over a microphone channel to introduce a new test condition.The use of interview
style data allowed longer speech segments (approximately 15 minutes) to be used
for training and testing.A proportion of target speakers from the NIST 2006
SRE were also present in the 2008 SRE,however,there was no overlap in speech
segments between the corpora.
The introduction of interviewspeech to the evaluation protocol motivated par-
ticipants to tailor their systems to be heavily robust to the dierences in channel
characteristics between the telephony and microphone-based evaluation condi-
tions.However,this was a somewhat dicult task due to the lack of microphone-
recorded development data available at the time of the evaluation.
2.2.5 Performance Measures
Speaker verication performance is typically measured using the equal error rate
(EER) and minimum decision cost function (DCF) [82].These measures rep-
resent dierent performance characteristics of a system,however,their accurate
estimation relies on a sucient number of trials to be evaluated in order to ro-
bustly calculate the relevant statistics.System performance can also be repre-
sented graphically to assist in the direct comparison of systems.For such a task,
2.2.Evaluation of Speaker Verication 21
the detection error trade-o (DET) plots [81] or receiver operating characteristic
(ROC) curves are utilised.
The performance of a speaker verication system can be represented by two
specic types of errors;false alarms and missed detections.A false alarm (false
positive) occurs when a speech segment from an impostor speaker is incorrectly
identied as originating from the target speaker.On the other hand,a missed
detection (false negative) refers to the rejection of the target speaker from the
system when he or she was infact matched to the target model.There exists a
trade-o between these two types of error such that a system can be tuned to
a specic application-dependent operating point.For example,a threshold can
be applied to classication scores such that false alarms are minimised,however,
this would result in an increased number of missed detections.
The ability to tune the errors encountered by a system allows the decision
threshold to be adjusted to target a specic operating point with the expectation
of achieving a similar proportion of errors in the evaluation of unseen data.With
regard to performance metrics,the EER provides a measure of missed detections
and false alarms when dening the decision threshold to cause an equal proportion
of errors to occur.In contrast,the DCF assigns a cost to each of these errors
and takes into account the prior probability of a target trial.The decision cost
function is dened as,
C
DET
= C
Miss
P
MissjTarget
P
Target
+C
False Alarm
P
False AlarmjNonTarget
P
NonTarget
(2.1)
where the cost of a missed detection and false alarm are given by C
Miss
and
C
FalseAlarm
,respectively,P
Target
and P
NonTarget
represent the prior probabilities
of encountering target and non-target trials,respectively,and P
MissjTarget
and
P
FalseAlarmjNonTarget
are the system-dependent miss detection and false alarmrates,
respectively.The decision threshold of a system can then be selected to minimise
the cost function.The ability to adjust the parameters of the decision cost
function makes the minimum DCF metric suitable for the evaluation of a variety
of application-specic systems.The DCF parameter values used in NIST SREs
and throughout this dissertation are detailed in Table 2.1.
22 Chapter 2.An Overview of Speaker Verication Technology
Parameter
Value
C
Miss
10
C
False Alarm
1
P
Target
0.01
P
NonTarget
0.99
Table 2.1:Standard detection cost function (DCF) parameter values.
Figure 2.2:Decision Cost Function (DCF) plot of systemoperating characteristics
With regard to the graphical interpretation of systemperformance,both DET
and ROC plots depict the rate of missed detections as a function of the false alarm
rate to represent a range of system operating points.In contrast to the linear
scale used in the representation of error rates in the ROC plot,the DET utilises
a normal deviate scale to provide a more meaningful interpretation of ASV based
operating characteristics.In using this scale,when the target and impostor score
distributions are Gaussian,the DET plot of the system would be a straight line.
An example of a DET plot is given in Figure 2.2 where the performance of System
2 far outweighs that of the System 1.
For practical purposes,factors such a computational eciency may be rel-
evant in reporting system performance.This is particularly important in the
2.3.Audio Speech Processing 23
development of ecient underlying algorithms prior to the deployment of ASV
systems,however,this is not a concerning factor in this dissertation.Instead,
the traditional EER and minimum DCF statistics will be presented as these are
more readily accepted in research literature.
2.3 Audio Speech Processing
In the context of automatic speaker verication,speech processing refers to those
operations applied to the raw auditory speech signal to produce a set of features
suitable for use in a classier.This feature extraction process is used to pro-
duce a vector of elements holding the speaker-specic information from the audio
frames.It is these features which are used to train speaker models and to perform
classication.The selection of a feature set is critical for ASV as it in uences
factors such as performance and computation time [138].Ideally,the feature set
will maintain high inter-speaker variability and low intra-speaker variability.
A robust feature set should be invariant to additive noise,amplication,and
channel and handset variations (also referred to as session variations) [103].The
development of more commonly-employed feature extraction techniques aim to
achieve this objective through a three-stage process:the processing of the audi-
tory speech signal to produce an initial feature set (Section 2.3.1),normalisation
and removal of unwanted noise and acoustic characteristics from the features
(Section 2.3.2),and lastly,ltering of the feature set to retain only features that
include speech activity (Section 2.3.3).
While the majority of feature sets have focussed on the short-time cepstrum,
recent developments have investigated the potential benets that the high-level
characteristics of speech (such as word and phone-level information) can oer to
the speaker verication task.These advances are discussed in Section 2.3.4.
2.3.1 The Short-time Cepstrum
To date,cepstral features have received most attention in speaker verication lit-
erature due to their ability to extract speaker discriminative information whilst
24 Chapter 2.An Overview of Speaker Verication Technology
also retaining information regarding the linguistic content of the audio record-
ing [80,102,103,107].A signicant benet of analysing speech in the cepstral
domain is that linear time-invariant channel eects can be conveniently repre-
sented as mean osets from the cepstral coecients [18].
The development of short-time feature extraction process has typically been
based on the fact that the human vocal tract produces a combination of unique
frequencies from a limited range during speech.These methods extract features
fromrelatively stationary moments of speech via\windowing"and are,therefore,
based on the lower-order structure of speech.
The sliding-window [11] method allows a speech audio signal to be analysed
on a frame-by-frame basis using a 10-30ms window where it is common to have
an overlap of 10ms between consecutive frames.This approach is common in
ASV systems due to its eectiveness in capturing vocal tract information while
separating the true speech signal from the background audio.The raised sinusoid
Hamming window is typically employed in literature,as is the case in this work,
as it provides a robust estimation of the frequency of the speech captured in the
frame.
Two categories of cepstral feature extraction are common in speaker veri-
cation systems today,diering in the manner in which the log-magnitude spec-
trum is represented.Filter bank analysis captures the energy of the magnitude
spectrum using a set of bandpass lters while linear predictive analysis involves
approximating the magnitude spectrum using an all-pole lter.
Filterbank Analysis
Filterbank analysis was one of the rst methods developed for the purpose of
speech processing and remains one of the most eective techniques in literature
today [80].Filterbank analysis represents the short-time spectrumof a speech sig-
nal as a set of lter bank outputs,restricted to the frequency range of speech [80].
Approximately 20,partially overlapping lterbanks are used in this process to
produce a compact set of cepstral coecients to represent the spectrum.Spacing
lterbanks according to the linear scale allows linear frequency cepstral coe-
2.3.Audio Speech Processing 25
cients (LFCC) to be extracted,however,a more common approach is to use a
mel-scale for lterbanks.
Mel-Frequency Cepstral Coecients (MFCC) are produced through lterbank
analysis by spacing the lterbanks according to the mel-scale.The mel-scale
transforms the physical frequency scale to represent the way in which frequencies
are perceived by humans [120].In this way,each lterbank output holds a quan-
titative measure of information relative to the other lterbanks.The logarithmic
mel-scale is estimated by
f
mel
= 2595 log
10

1 +
f
Hz
700

:(2.2)
Due to computational constraints,MFCCs are calculated in the frequency domain
using the fast Fourier transform.
Cepstral coecients are derived by transforming the log-energies of the l-
terbank outputs using a discrete cosine transform (DCT).By using the DCT in
this process,the correlation between the energy outputs of adjacent lterbanks
is minimised,thus allowing simpler modelling of the MFCCs.
Delta coecients are generally appended to each feature to capture the dy-
namic properties of the speech signal.These coecients approximate the in-
stantaneous derivative of each of the cepstral coecients by nding the slope
coecient when performing a least-squares linear regression over a window of
consecutive frames with a window length of 3-7 seconds.
Linear Predictive Analysis
Linear predictive analysis is based on a speech model that incorporates a glottal
excitation signal ltered through the vocal tract and nasal cavity.Accordingly,
the linear predictor (LP) model attempts to describe a speech signal s
n
at time n
using a linear combination of P past signal values and a weighted input excitation
g
n
as
s
n
= g
n

P
X
p=1
a
p
s
np
(2.3)
where  is the excitation weight and a
p
are the predictor coecients.These co-
ecients essentially dene an all-pole lter that captures the speech information
26 Chapter 2.An Overview of Speaker Verication Technology
from the frequency spectrum.Their estimation is performed using a minimum
mean squared error (MMSE) criterion where the residual error is assumed to be
the excitation signal g
n
.While this excitation signal can be useful in the esti-
mation of information such as the pitch of the voiced speech,it is the predictor
coecients that provide the majority of speaker discriminative information as a
feature set.The coecients produced by the LP model form the fundamental
feature set used in ASV systems,however,they are generally transformed into a
more suitable representation for the purpose of speaker modelling and classica-
tion.
Linear prediction cepstral coecients (LPCC) [47] have seen signicant recog-
nition in literature for the task of speaker verication [19,21,28,43,115].LPCC's
are calculated as a Fourier or cosine transform from the log-magnitude spectrum
that is estimated through the frequency response of the all-pole lter dened by
the predictor coecients.As an extension to the base-features,delta features are
often considered part of a feature set as they capture the dynamic nature of the
speech signal [80].
Perceptually-weighted linear prediction (PLP) coecients [62] attempt to rep-
resent speech based on human perception by applying several relevant factors to
the speech signal prior to modelling the signal using the linear predictor.A
Bark-scale transformation is applied to the power spectrum in order to equalise
the information content of the signal.Additionally,the dierences in perceived
loudness and power levels are normalised.
2.3.2 Robust Acoustic Feature Extraction
As stated previously,the objective of the feature extraction process is to produce
a set of features from an auditory speech signal that maximises inter-speaker
variation while minimising intra-speaker variation.In order to accomplish these
goals,extraction techniques that are robust to the potential adverse eects arising
from changes in acoustic and channel conditions must be employed.
Methods such as cepstral mean subtraction [47],RASTA processing [63],fea-
ture warping [99] and feature mapping [107] have been successfully employed to
2.3.Audio Speech Processing 27
improve the robustness of the feature extraction process and are brie y described
below.
Cepstral Mean Subtraction
A common method of improving the robustness of a feature set is cepstral mean
subtraction (CMS) [47].This process reduces the eects of channel distortion by
removing the mean fromcepstral coecients [11].Essentially,CMS can be viewed
as a high-pass lter applied to a set of feature vectors.Although the technique is
eective at reducing the eects of channel distortion,it has been shown to also
remove benecial speaker-specic information from the ASV system [140].In
order to alleviate the eect of additive channel noise,cepstral mean and variance
normalisation (CMVN) [128] was proposed as an extension to CMS.
RASTA Processing
RelAtive SpecTrA or RASTA processing of speech was introduced by Hermansky
and Morgan [63] with the purpose of suppressing very slowly or very quickly
varying components in the lter banks during feature extraction.The technique
was based on the fact that human hearing characteristics are relatively insensitive
to these components.RASTA ltering is essentially a band-pass lter used on
the time trajectories of feature vectors extracted from speech.It has been shown
however,that this method also removes speaker-specic information in the low
frequency bands [99].
Feature Warping
A more recent technique termed feature warping aims to compensate for the
nonlinear distortions that are introduced to the distribution of log-energy-based
cepstral features when additive noise and channel distortion exist in a speech
signal.This is accomplished by constructing a new feature vector conforming to
a Gaussian distribution [99].
Feature warping warps a cepstral feature by rstly estimating an original,dis-
torted distribution using the group of N previous features.The warped values of
28 Chapter 2.An Overview of Speaker Verication Technology
the cepstral feature are determined by ranking these N features and locating the
relative position of the original feature values in the target Gaussian distribution.
Feature warping provides a robustness to additive noise and linear channel
mismatch while retaining the speaker-specic information that can be lost when
using CMS and RASTA processing.Such desirable traits has seen the use of
feature warping become commonplace in many ASV congurations submitted to
the annual NIST SREs.
Feature Mapping
Feature mapping transforms feature vectors from a specic context space to a
neutral space [107].This approach employs a channel-independent model from
which channel-dependent models are adapted to explicitly model the eects of
handset dierences.
Feature mapping has,in particular,showed great success for alleviating hand-
set and channel mismatch in speaker verication scenarios [129].It's application
to the task of speech recognition has also been found to be advantageous to
performance [107].
2.3.3 Speech Activity Detection
The segmentation of the audio signal during the feature extraction process assists
in the process of speech activity detection (SAD).Speech activity detection aims
to lter out of the feature set those features that do no contain speech activity
and,subsequently,provide no benecial information to the speaker modelling
process.
SAD is often accomplished by removing features that were extracted from
frames of the audio signal in which the total energy was below a pre-determined
threshold.In the case of the audio signal being subject to high levels of noise,
however,false detections are likely to occur if a simplistic threshold approach
is employed.Methods have been proposed based on the detection of periodic
components or cepstral features extracted from the audio signal to overcome this
downfall [57,126].
2.3.Audio Speech Processing 29
Commonly employed in recent literature is a bi-Gaussian approach in which
the distribution of both high energy and low energy frames are modelled [117].
Frames belonging to the higher-energy Gaussian are retained while the remainder
are removed fromthe feature set.In using this approach,speech activity detection
can operate successfully on audio with a relatively low signal-to-noise-ratio (SNR)
compared to alternate approaches.
An extension to this bi-Gaussian approach to SAD involves restricting periods
of detected speech to a minimum time frame (approximately 0.5 seconds) [117].
In this way,high energy bursts due to background or channel noise are not in-
correctly detected as speech.
2.3.4 High-Level (Long-term) Features
The majority of ASV systems exploit acoustic-based feature sets due to the sig-
nicant amount of speaker information they possess.However,the performance
oered by these feature sets is limited by the acoustic conditions in which the
speech audio is acquired.Consequently,substantial classication performance
dierences are common between favourable and adverse conditions.
In order to overcome the shortcoming of the acoustic-based features,research
has turned toward other sources of speaker-discriminative information from a
recorded audio signal.Recent developments have focused on the high-order struc-
ture of speech in which a speaker's choice of phones,words and phrases can be
extracted and exploited for speaker modelling and classication [121].
High-level information that can be captured in an auditory speech sig-
nal include a speaker's idiolect,pronunciation idiosyncrasies and speaking
style [2,40,44] along with the dynamics of a speaker's pitch,fundamental fre-
quency and energy [1].
These high-level features achieve a relatively low level of performance com-
pared to acoustic-based features.This is due to the diculties associated with
detection of the relevant speech components and the inherently high intra-speaker
variation fromthe approach.High-level features were designed,however,with the
intention of providing complementary information to the cepstral-based modelling
30 Chapter 2.An Overview of Speaker Verication Technology
and classication process.Accordingly,the combination of information from
high-level features with that oered by cepstral-based features provides added
robustness to ASV systems and increased performance due to the fusion of the
complementary discriminative information between the feature sets [60].
Word-Level Language Modelling
Developments in ASV technology have endeavoured to extract information re-
garding a speaker's unique language speech pattern or idiolect from a speech
sample [40].Such information is found in the speech-to-text transcripts of a
speech corpus such as those in the Switchboard corpora.Doddington,et al.[40]
successfully modelled these idiolects to exploit the speaker-dependent information
in a ASV scenario.
Speaker idiolects can be modelled using a\bag-of-N-grams"classier.This
classier models the probabilities of N-worded sequences from the transcripts of
a speaker's utterances.A trial's classication score is typically calculated as the
expected log likelihood of the N-gram word utterances that occur in the speech
segment using,

s
=
1
N
X
i
log p
s
(x
i
) log p
0
(x
i
) (2.4)
where N is the number of tokens in the sequence,x
i
is the ith token,and p
s
and
p
0
are the speaker and background probabilities,respectively.
A sucient number of N-word sequences need to exist in the training data
to adequately model a speaker's characteristics.For this reason,a number of
utterances or long speech segments are often utilised to increase the classication
performance.Further,sequences are often restricted to the use of uni-grams or
bi-grams to ensure adequate training data is available for modelling and classi-
cation.
In modelling word-level features,a suitable speech-to-text transcription tech-
nique is required for the task.Consequently,the speaker verication classication
performance is dependent on the robustness and accuracy of this transcription
process.For this reason,the development of word-level speaker verication is per-
formed using manually transcribed speech segments to eliminate this additional
2.3.Audio Speech Processing 31
dependency.
Recent studies proposed an approach in which a world model was rstly
trained on a signicant amount of speech to re ect the language characteris-
tics of the speaker population [5].From this model,speaker models were adapted
from the world model using the maximum a-posteriori (MAP) algorithm analo-
gous to GMM-based speaker verication (see Section 2.4.2).This increased the
robustness of the system across all training lengths such that it was found to
provide complementary information to cepstral-based features even when trained
on a single utterance.
Phone-Level Speaker Information
Similar to the word-level systemdescribed above,phone-level speaker verication
utilises a bag-of-N-grams classier to model the probabilities of phone sequences
in a speech segment.The way in which a speaker pronounces phones is expected
to possess a signicant degree of information,however,using phone sequences for
speaker verication is not a trivial task.This is because the phone being realised
by a speaker is not always identiable leading to numerous ways in which the
dierences between speakers can be modelled.
One successful approach to modelling phoneme idiosyncrasies was proposed by
Andrews,et al.[2].For this task,a parallel phonetic recognition with language
model (PPRLM) conguration was employed in which six dierent languages
were modelled.The phone recognisers produced six streams of recognised phones
which were then modelled in the bag-of-N-grams classier.It was anticipated
that modelling phones from several languages would allow the speaker-specic
pronunciation characteristics to be captured.
Recent advances have demonstrated that N-gram features are particularly
suited to discriminative,SVM-based modelling and classication [23].
Prosodic Information
Prosody information captures the dynamic nature of a speaker's pitch,fundamen-
tal frequency and energy.These speech characteristics posses a certain degree of
32 Chapter 2.An Overview of Speaker Verication Technology
speaker-specic information useful to the ASV systems.
The use of the fundamental frequency of a speech segment to distinguish be-
tween speakers received attention early in speaker verication literature [3].This
information has since been incorporated into short-term feature sets,however,
it's use in providing additional,complementary information to the speaker veri-
cation task continues to be researched [1,71].
The dynamics of prosodic information were exploited in [1] in which bi-gram
models were used to model the frequency and energy trajectories of speech to
more robustly capture speaker-discriminative information.Non-uniform extrac-
tion region features (NERFs) [71] were dened as a sliding temporal region in
which feature extraction was performed.The region boundaries in this approach
were based on either a xed window length or the values of other features.These