Reconnaissance du locuteur

parathyroidsanchovyΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

48 εμφανίσεις

1


Cours parole du 9 Mars 2005

enseignants: Dr. Dijana Petrovska
-
Delacrétaz

et Gérard Chollet

Reconnaissance du locuteur


1.
Introduction,

Historique,

Domaines

d’applications


2.
Les

indices

de

l’identité

dans

la

parole

3.
Vérification

du

locuteur

1.
Théorie

de

la

decision

2.
Dépendante

/

Indépendante

du

texte

4.
L’imposture

vocale

5.
Vérification

audio
-
visuelle

de

l’identité

6.
Evaluations

7.
Conclusions



2

Why should a computer recognize

who is speaking ?


Protection

of

individual

property

(habitation,

bank

account,

personal

data,

messages,

mobile

phone,

PDA,
...
)



Limited

access

(secured

areas,

data

bases)


Personalization

(only

respond

to

its

master’s

voice)


Locate

a

particular

person

in

an

audio
-
visual

document

(information

retrieval)


Who

is

speaking

in

a

meeting

?


Is

a

suspect

the

criminal

?

(forensic

applications)

3

Tasks in

Automatic Speaker Recognition


Speaker

verification

(Voice

Biometrics)


Are

you

really

who

you

claim

to

be

?


Identification

(Speaker

ID)

:


Is

this

speech

segment

coming

from

a

known

speaker

?


How

large

is

the

set

of

speakers

(population

of

the

world)

?



Speaker

detection,

segmentation,

indexing,

retrieval,

tracking

:


Looking

for

recordings

of

a

particular

speaker


Combining

Speech

and

Speaker

Recognition


Adaptation

to

a

new

speaker,

speaker

typology


Personalization

in

dialogue

systems




4

Applications


Access

Control


Physical

facilities,

Computer

networks,

Websites


Transaction

Authentication


Telephone

banking,

e
-
Commerce


Speech

data

Management


Voice

messaging,

Search

engines


Law

Enforcement


Forensics,

Home

incarceration


5

Voice Biometric


Avantages


Often

the

only

modality

over

the

telephone,


Low

cost

(microphone,

A/D),

Ubiquity


Possible

integration

on

a

smart

(SIM)

card



Natural

bimodal

fusion

:

speaking

face


Disadvantages


Lack

of

discretion


Possibility

of

imitation

and

electronic

imposture


Lack

of

robustness

to

noise,

distortion,



Temporal

drift

6

Speaker Identity in Speech


Differences

in


Vocal

tract

shapes

and

muscular

control


Fundamental

frequency

(typical

values)


100

Hz

(Male),

200

Hz

(Female),

300

Hz

(Child)


Glottal

waveform


Phonotactics


Lexical

usage


The

differences

between

Voices

of

Twins

is

a

limit

case


Voices

can

also

be

imitated

or

disguised

7

spectral envelope of / i: /

f

A

Speaker A

Speaker B

Speaker Identity


segmental

factors

(~
30
ms)


glottal excitation
:

fundamental frequency, amplitude,

voice quality (e.g., breathiness)


vocal tract
:

characterized by its transfer function and
represented by MFCCs (Mel Freq. Cepstral Coef)



suprasegmental

factors


speaking

speed

(timing

and

rhythm

of

speech

units)


intonation

patterns


dialect,

accent,

pronunciation

habits

8

What are the sources of difficulty ?


Intra
-
speaker

variability

of

the

speech

signal

(due

to

stress,

pathologies,

environmental

conditions,

)


Recording

conditions

(filtering,

noise,

)


Channel

mismatch

between

enrolment

and

testing


Temporal

drift


Intentional

imposture


Voice

disguise

9

Acoustic features


Short

term

spectral

analysis


10

Intra
-

and Inter
-
speaker variability

11

Speaker Verification


Typology of approaches (EAGLES Handbook)


Text dependent


Public password


Private password


Customized password


Text prompted


Text independent


Incremental enrolment


Evaluation


12

History of Speaker Recognition

13

Current approaches

14

Dynamic Time Warping (DTW)

Best

path

)
,
(
)
Y
,
X
(
2
j
i
d
y
x



“Bonjour” locuteur test Y

“Bonjour” locuteur X


Bonjour” locuteur 1

“Bonjour” locuteur 2

“Bonjour” locuteur n

DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.

15

Vector Quantization (VQ)

best

quant.

)
,
(
)
Y
,
X
(
X
2
j
i
C
d
y



Dictionnaire locuteur 1

Dictionnaire locuteur 2

Dictionnaire locuteur n

“Bonjour” locuteur test Y

Dictionnaire locuteur X

SOONG, ROSENBERG 1987

16

Hidden Markov Models (HMM)

Best

path

)
S
(
P
log
)
Y
,
X
(
i
X
j
y




“Bonjour” locuteur 1

“Bonjour” locuteur 2

“Bonjour” locuteur n

“Bonjour” locuteur test Y

“Bonjour” locuteur X

ROSENBERG 1990, TSENG 1992

17

Ergodic
HMM

Best

path

)
S
(
P
log
)
Y
,
X
(
i
X
j
y




HMM locuteur 1

HMM locuteur

2

HMM locuteur

n

“Bonjour” locuteur

test

Y

HMM locuteur

X

PORITZ 1982, SAVIC 1990

18

Gaussian Mixture Models
(GMM)

REYNOLDS 1995

19

HMM structure depends on the application

20

Some issues in Text
-
dependent

Speaker Verification Systems

:

The CAVE and PICASSO projects


Sequences

of

digits


Speaker

independent

HMM

of

each

digit


Adaptation

of

these

HMMs

to

the

client

voice

(during

enrolment

and

incremental

enrolment)


EER

of

less

than

1

%

can

be

achieved


Customized

password


The

client

chooses

his

password

using

some

feedback

from

the

system


Deliberate

imposture



21

Gaussian Mixture Model


Parametric

representation

of

the

probability

distribution

of

observations
:

22

Gaussian Mixture Models

8 Gaussians per mixture

23

GMM speaker modeling

Front
-
end

GMM

MODELING

WORLD

GMM

MODEL

Front
-
end

GMM model



adaptation

TARGET

GMM

MODEL

24

Baseline GMM method

HYPOTH.

TARGET

GMM MOD.

Front
-
end

WORLD

GMM

MODEL


x
P
x
P
Log
]
)
/
(
)
/
(
[





)
/
(

x
P
)
/
(

x
P

=

25


Two

types

of

errors

:


False

rejection

(a

client

is

rejected)


False

acceptation

(an

impostor

is

accepted)


Decision

theory

:

given

an

observation

O

and

a

claimed

identity


H
0

hypothesis

:

it

comes

from

an

impostor


H
1

hypothesis

:

it

comes

from

our

client


H
1

is

chosen

if

and

only

if

P(
H
1
|O)

>

P(
H
0
|O)


which

could

be

rewritten

(using

Bayes

law)

as



Decision theory

for identity verification

)
1
(
)
(
)
(
)
1
(
H
P
Ho
P
Ho
O
P
H
O
P

26

Signal detection theory

27

Decision

28

Distribution of scores

29

Detection Error Tradeoff (DET) Curve

30

Evaluation



Decision

cost

(FA,

FR,

priors,

costs,

)


Receiver

Operating

Characteristic

Curve


Reference

systems

(open

software)


Evaluations

(algorithms,

field

trials,

ergonomy,

)

31

NIST Speaker Verification Evaluations


A

reference

standard

to

compare

algorithms

and

stimulate

new

developments


Distribution

(via

LDC)

of

development

and

test

databases

with

:


Increasing

difficulty

(from

land

line

to

mobile)


Several

hundreds

of

speakers

(
2

mn

of

training

data

per

client),


Several

thousands

test

accesses

(
5

to

50

sec

per

access),


Participation

of

15
-
20

labs

every

year

(MIT,

IBM,

Nuance,

Queensland

Univ,

ELISA

consortium,

.
)


Annual

workshop,

Special

issues

in

Journals,



32

National Institute of Standards & Technology (NIST)

Speaker Verification Evaluations



Annual evaluation since 1995



Common paradigm for comparing technologies

33

Speaker Verification

(text independent)


The

ELISA

consortium


E
NST,

LI
A,

IR
ISA
,

...


http
:
//www
.
lia
.
univ
-
avignon
.
fr/equipes/RAL/elisa/index_en
.
html


BECARS

:

B
alamand
-
E
NST

C
EDRE

A
utomatic

R
ecognition

of

S
peakers


NIST

evaluations


http
:
//www
.
nist
.
gov/speech/tests/spk/index
.
htm

34

NIST evaluations : Results

ENST 2003
35


Evaluations: NIST 2004

36

Combining Speech Recognition and Speaker Verification.


Speaker

independent

phone

HMMs


Selection

of

segments

or

segment

classes

which

are

speaker

specific


Preliminary

evaluations

are

performed

on

the

NIST

extended

data

set

(one

hour

of

training

data

per

speaker)

37

ALISP

:
A
utomatic
L
anguage
I
ndependent
S
peech
P
rocessing

Data
-
driven speech segmentation

38

Searching in client and world speech dictionaries

for speaker verification purposes

39

Fusion

40

Fusion results

41


Voice Transformations and Forgery (occasional, dedicated)



Isolated

individuals

with

few

resources

or

“professional

impostors”

with

a

dedicated

budget

can

menace

the

security

of

speaker

recognition

systems


Voice

transformation

technologies

(e
.
g
.

segmental

synthesis

using

an

inventory

of

client

speech

data)

are

nowadays

available


Speaker

recognition

research

should

explicitly

address

this

forgery

issue

and

define

appropriate

countermeasures


Prevention

by

predicting

many

different

forgery

scenarios

42

Voice Forgery using ALISP

The same words

or not

Impostor


The same words


or not

client


transformation






A modification of a source speaker‘s speech to imitate a target speaker


43

Conversion system: ALISP encoder

Speech

MFCC analysis

HNM

HMM recognition


Harmonic


envelope

Symbol index

-

Representative index

-

DTW path

Choice of the

best representative

unit

Prosody (energy+pitch)

MFCC + delta

Database of

HNM Representatives

HMM

models

Noise

envelope

44

Conversion system: ALISP Decoder

Concatenation


of HNM

parameters for

each

representative

HNM

Synthesis

Speech signal

Symbol index

Pitch, energy, timing

Representative index


DTW path

45


Preliminary results: DET curves


Fa
before

forgery
:

16

±

2
.
0

%

(
1700

files)


Fa
after

forgery
:

26

±

2
.
0

%

(
1700

files)







46


Preliminary results



True distributions

47

Multimodal Identity Verification


M
2
VTS

(face

and

speech)


front

view

and

profile


pseudo
-
3
D

with

coherent

light


BIOMET
:


(face,

speech,

fingerprint,

signature,

hand

shape)


data

collection


reuse

of

the

M
2
VTS

and

DAVID

data

bases


experiments

on

the

fusion

of

modalities

48

Speaking Faces : Motivations



In

many

situation

a

video

sequence

is

acquired


Fusion

of

face

and

speech

increases

robustness


Forgery

is

more

difficult




49

Talking Face Recognition

(hybrid verification)

50

Lip features


Tracking

lip

movements

51

A talking face model


Using

Hidden

Markov

Models

(HMMs)

Acoustic
parameters

Visual
parameters

52

Imposture Model

53

Cloning

54

Conclusions, Perspectives


Deliberate

imposture

is

a

challenge

for

speech

only

systems


Verification

of

identity

based

on

features

extracted

from

talking

faces

should

be

developped


Common

databases

and

evaluation

protocols

are

necessary


Free

access

to

reference

systems

will

facilitate

future

developments



55

BioSecure Residential Workshop


Aug
.

1
st

-

26
th,

2005

in

ENST,

Paris


Reference

systems

for

speech,

face,

talking

face,

fingerprint,

iris,

hand,

signature,




Comparative

evaluations

on

large

databases

(BIOMET,

BANCA,

FVC,

)


Fusion

of

modalities

http://www.biosecure.info