Speaker Recognition

collarlimabeansSecurity

Feb 23, 2014 (3 years and 1 month ago)

62 views

An Intro to

Speaker Recognition

Nikki Mirghafori



Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial,
and A. Stolcke.

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Today’s class


Interactive


Measures of success for today:


You talk at least as much as I do


You learn and remember the basics


You feel you can do this stuff


We all have fun with the material!


2

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

A 10
-
minute “Project Design”


You are experts with different backgrounds. Your previous startup companies
were wildly successful. A large VC firm in the valley wants to fund YOUR next
creation, as long as the project is in speaker recognition.


The VC funding is yours, if you come up with some kind of a coherent plan/list
of issues:


What is your proposed
application
?


What will be the
sources of error
and variability, i.e., technology
challenges?


What types of
features
will you use?


What sorts of
statistical modeling tools
/techniques?


What will be your
data
needs?


Any
other issues
you can think of along your path?


3

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Extracting Information from Speech

Speech

Recognition

Language

Recognition

Speaker

Recognition

Words

Language Name

Speaker Name

“How are you?”

English

James Wilson

Speech Signal

Goal:

Automatically extract
information transmitted in speech
signal


What’s noise? what’s
signal?


Orthogonal in many ways


Use many of the same
models and tools

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

5

Speaker Recognition Applications


Access control


Physical facilities


Data and data networks


Transaction authentication


Telephone credit card purchases


Bank wire transfers


Fraud detection



Monitoring


Remote time and attendance logging


Home parole verification


Information retrieval


Customer information for call centers


Audio indexing (speech skimming device)


Personalization



Forensics


Voice sample matching

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

6

Tasks



Identification vs. verification


Closed set vs. open set identification


Also, segmentation, clustering, tracking...

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

7

Speaker Model Database

Test Speech

Identification

Whose voice is it?

Closed
-
set
Speaker
Identification

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

8

Speaker Model Database

Test Speech

Identification

Whose voice is it?

Open
-
set
Speaker
Identification

None of

the above

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

9

Speaker Model Database

Test Speech

Verification/Authentication/Detectio
n

Does the voice match?

Yes/No

Verification
requires
claimant ID

“It’s me!”

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Speech Modalities


Text
-
dependent

recognition


Recognition system knows text spoken by person


Examples: fixed phrase, prompted phrase


Used for applications with strong control over user input


Knowledge of spoken text can improve system performance


Text
-
independent

recognition


Recognition system does not know text spoken by person


Examples: User selected phrase, conversational speech


Used for applications with less control over user input


More flexible system but also more difficult problem


Speech recognition can provide knowledge of spoken text


Text
-
Constrained
recognition. Exercise for the reader.



Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

11

Text
-
constrained Recognition


Basic idea: build speaker models for words
rich in speaker information


Example:


“What time did you say?
um
...
okay
,
I_think

that’s a good plan.”


Text
-
dependent strategy in a text
-
independent context

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Strongest


security


Biometric:
a human generated signal or attribute for authenticating a
person’s identity



Voice is a popular biometric:


natural signal to produce


does not require a specialized input device


ubiquitous: telephones and microphone equipped PC




Voice biometric with other forms of security



Something you have
-

e.g., badge


Something you know
-

e.g., password


Something you are
-

e.g., voice

Have

Know

Are

Voice as a biometric

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

13

How to build a system?


Feature choices:


low level (MFCC, PLP, LPC, F0, ...) and high level (words,
phones, prosody, ...)


Types of models:


HMM, GMM, Support Vector Machines (SVM), DTW, Nearest
Neighbor, Neural Nets


Making decisions: Log Likelihood Thresholds, threshold setting
for desired operating point


Other issues: normalization (znorm, tnorm), optimal data selection
to match expected conditions, channel variability, noise, etc.

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Verification Performance


There are many factors to consider in design of an
evaluation of a speaker verification system

Speech quality


Channel and microphone characteristics


Noise level and type


Variability between enrollment and
verification speech

Speech modality


Fixed/prompted/user
-
selected phrases


Free text

Speech duration


Duration and number of sessions of
enrollment and verification speech

Speaker population


Size and composition


Most importantly:
The evaluation data and design should match the target
application domain of interest

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Verification Performance

Probability of False Accept (in %)

Probability of False Reject (in %)

Text
-
dependent
(Combinations)

Clean Data

Single microphone

Large amount of
train/test speech

Text
-
independent
(Conversational)

Telephone Data

Multiple
microphones

Moderate amount
of training data

Text
-
dependent
(Digit strings)

Telephone Data

Multiple
microphones

Small amount of
training data

Text
-
independent
(Read sentences)

Military radio Data

Multiple radios &
microphones

Moderate amount
of training data

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Verification Performance

PROBABILITY OF FALSE ACCEPT (in %)

PROBABILITY OF FALSE REJECT

(in %)

Equal Error Rate
(EER) = 1 %

Wire Transfer:

False acceptance
is very costly

Users may tolerate
rejections for
security

Customization:

False rejections
alienate customers

Any customization
is beneficial

Application operating
point depends on
relative costs of the
two error types

High Convenience

High Security

Balance

Example
Performance Curve

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Human vs. Machine


17


Motivation for comparing human
to machine


Evaluating speech coders and
potential forensic applications



Schmidt
-
Nielsen and Crystal used
NIST evaluation (DSP Journal,
January 2000)


Same amount of training data


Matched Handset
-
type tests


Mismatched Handset
-
type tests


Used 3
-
sec conversational
utterances from telephone speech


Error

Rates

Humans

44%

better

Humans

15%

worse

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Features


Desirable attributes of features for an automatic system (Wolf
‘72)


Occur naturally and frequently in speech


Easily measurable


Not change over time or be affected by speaker’s health


Not be affected by reasonable background noise nor
depend on specific transmission characteristics


Not be subject to mimicry

Practical

Robust

Secure


No feature has all these attributes

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

19

Training & Test Phases

Feature

Extraction

Model

Training

Enrollment
Phase

Training speech for each
speaker

Model
for each
speaker

Feature

Extraction

Verificatio
n

Decision

Recognition Phase

(e.g. Verification)

?

“It’s me!”

Accepted

Rejecte
d

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Decision making

Verification decision approaches have roots in signal detection theory


2
-
class Hypothesis test:


H0:

the speaker is an impostor

H1:

the speaker is indeed the claimed speaker.


Statistic computed on test utterance S as
likelihood ratio:

Likelihood
S
came from speaker model

Likelihood

S
did
not

come from speaker model

L =

log


L

< q
reject

Feature
extraction

Speaker

Model

Impostor

Model

Decision

S

+

-

> q
accept

L

L

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

21

Decision making



Identification: pick model (of N) with best score


Verification: usual approach is via likelihood ratio tests,
hypothesis testing, i.e.:


By Bayes:


P(target|x)/P(nontarget|x) =


P(x|target)P(target)/P(x|nontarget)P(nontarget)


accept if > threshold, reject otherwise


Can’t sum over all non
-
target talkers
--

world for SV!


Use “cohorts” (collection of impostors)


Train “universal”/”world”/”background” model (speaker
independent, it’s trained on many speakers)



Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

22


Traditional speaker
recognition systems use


Cepstral feaures


Gaussian Mixture Models
(GMMs)

Spectral Based Approach

D.A. Reynolds, T.F. Quatieri, R.B. Dunn. “Speaker Verification using
Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1
--
3), January/April/July 2000

Feature

Extractio
n

Fourier

Transform

Magnitud
e

Sliding window

Cosine

Transform

Log

Backgroun
d

Model

log likelihood
ratio

Speaker

Model

Adapt

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

23

Features: Levels of Information


Semantic


Dialogic


Idiolectal


Phonetic


Prosodic


Spectral


Low
-
level cues

(physical

characteristics)

High
-
level cues

(learned behaviors)

semantics,
idiolects,
pronunciations,
idiosyncrasies

socio
-
economic
status, education,
place of birth

prosody, rhythm,
speed intonation,
volume
modulation

personality type,
parental influence

acoustic aspects
of speech, nasal,
deep, breathy,
rough

anatomical
structure of vocal
apparatus

Hierarchy of

Perceptual Cues

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Low level features


Speech production model: source
-
filter
interaction


Anatomical structure (vocal tract/glottis)
conveyed in speech spectrum

Glottal pulses

Vocal tract

Speech signal

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Word N
-
gram Features

25

Idea (Doddington 2001):


Word usage can be idiosyncratic to a speaker


Model speakers by relative frequencies of word N
-
grams


Reflects vocabulary AND grammar


Cf. similar approaches for authorship and plagiarism detection on text documents.


First (unpublished) use in speaker recognition: Heck et al. (1998)

Implementation
:


Get 1
-
best word recognition output


Extract N
-
gram frequencies


Model likelihood ratio OR


Model frequency vectors by SVM

I_shall

0.002

I_think

0.025

I_would

0.012





Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

26

Phone N
-
gram features

Open
-
loop
phone
recognition

Support

Vector
Machine

(SVM)

[+ 0.0254 0.0068 0.0198]

[
-

0.0001 0.8827 0.7264]

[
-

0.0329 0.2847 0.2983]

score

Model the pattern of phone usage or “short term pronunciation”
for a speaker

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

MLLR transform vectors as
features

27

Speaker
-
independent

Speaker
-
independent

Phone class A

Phone class B

Speaker
-
dependent

Speaker
-
dependent

MLLR Transforms = Features

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

28

Models


HMMs:


text dep (could use whole word/phone model)


prompted (phone models)


text ind’t (use LVCSR)
--

or GMMs!


templates DTW (if text
-
dependent system)


nearest neighbor: frame level, training data as “model”, non
-
parametric


neural nets: train explicitly discriminating models


SVMs


Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Speaker Models
--

HMM


Speaker models (voiceprints) represent voice biometric in
compact and generalizable form


h
-
a
-
d


Modern speaker verification systems use
Hidden
Markov Models

(HMMs)



HMMs are statistical models of how a speaker
produces sounds



HMMs represent underlying statistical variations
in the speech state (e.g., phoneme) and temporal
changes of speech between the states.



Fast training algorithms (EM) exist for HMMs with
guaranteed convergence properties.


Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Speaker Models


HMM/GMM

Form of HMM depends on the application

“Open sesame”


Fixed Phrase
Word/phrase models

/s/

/i/

/x/

Prompted phrases/passwords
Phoneme models

General speech

Text
-
independent
single state HMM

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Word N
-
gram Modeling: Likelihood Ratios

31



L
L
=
j
j
Background
Spea
j
j
Score
1
)
(
)
(
log
ker

Model N
-
gram token log likelihood ratio


Numerator: speaker language model estimated from enrollment data


Denominator: background language model estimated from large speaker
population


Normalize by token count






Choose all reasonably frequent bigrams or trigrams, or a weighted combination of
both

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Speaker Recognition with SVMs

32


Each speech sample (training or test) generates a point
in a derived feature space


The SVM is trained to separate the target sample from
the impostor (= UBM) samples


Scores are computed as the Euclidean distance from the
decision
hyperplane

to the test sample point


SVMs

training is biased against misclassifying positive
examples (typically very few, often just 1)

Background sample

Target sample

Test sample

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Feature Transforms for SVMs


SVMs have been a boon for higher
-
level (as well as cepstral
speaker recognition) research


they allow great flexibility in the
choice of features


However, we need a “sequence kernel”


Dominant approach: transform variable
-
length feature stream into
fixed, finite
-
dimensional feature space


Then use linear kernel


All the action is in the feature transform!


33

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

34

Combination of Systems



Systems work best in combination, especially ones using “higher level” features


Need to estimate optimal combination weight. E.g., use neural network


Combination weights trained on a held
-
out development dataset

GMM

MML
R

WordHM
M

PhoneNgra
m

Neural Network Combiner

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

35

Variability: The Achilles Heel...


Variability (extrinsic & intrinsic) in
the spectrum can cause error


Data of focus has mainly been
extrinsic


“Channel” mismatch:



Microphone


carbon
-
button, hands
-
free,..


Acoustic environment


Office, car, airport, ...


Transmission channel


Landline, cellular, VoIP, ...


Three compensation approaches:


Feature
-
based


Model
-
based


Score
-
based

Compensation techniques

help reduce error.

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

NIST Speaker Verification
Evaluations


Annual NIST evaluations of speaker verification technology (since 1996)


Aim: Provide a common paradigm for comparing technologies


Focus: Conversational telephone speech (text
-
independent)

Evaluation Coordinator

Linguistic Data Consortium

Data Provider

Technology Developers

Comparison of
technologies on
common task

Evaluate

Improve

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

37

The NIST Evaluation Task


Conversational telephone speech, interview


Landline, cellular, hands
-
free, multiple
-
mics in
room


5 min of conversations between two speakers


Various conditions, e.g.,


Training: 8, 1, or other number of conversation
sides


Test: 1 conversation side, 30 secs, etc.


Evaluation:


Equal Error Rate (EER)


Decision Cost Function (DCF)









= (10, 1, 0.01)


Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

The End


What’s one interesting you learned today you
may share with a friend over dinner
conversation?

38

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

Backup slides


39

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

40

Word Conditional Models
--

example


Boakye et al. (2004)


19 words and bi
-
grams


Discourse markers
: {actually,
anyway, like, see, well, now,
you_know, you_see, i_think,
i_mean}


Filled pauses
: {um, uh}


Backchannels
: {yeah, yep, okay,
uhhuh, right, i_see, i_know }


Trained whole
-
word HMMs,
instead of GMMs, to model
evolution of speech in time


Combines well with low
-
level (i.e., cepstral GMM)
system, especially with
more training data

Nikki Mirghafori

4/23/12

EECS 225D
--

Verification

41

Phone N
-
Grams
--

example


Idea
(Hatch et al., ‘05)
: model the pattern of
phone usage or “short term pronunciation” for a
speaker


Use open
-
loop phone recognition to obtain phone
hypotheses


Create models of relative frequencies of phone n
-
grams of the speaker vs. “others”


Use SVM for modeling


Combines well, esp. with increased data


Works across languages