Lecture 16 Speaker Recognition

matchmoaningΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

67 εμφανίσεις

Information College, Shandong University @ Weihai

Lecture 16



Speaker Recognition


Definition


Method of recognizing a Person form
his/her voice.


Depends on Speaker Specific
Characteristics


To determine whether a specified speaker
is speaking in a given segment of speech


This task is the one closest to biometric
identification using speech

Voice is a popular Biometric


Voice Biometric:


Natural signal to produce


Does not require a specialized input device


Can be used on site or remotely


Telephone banking, Voice mail browsing, ….


Security


Keys, card, ...


Passwords, PIN, ...


Fingerprint, voiceprint, Iris
-
print…

Similar Tasks


Speaker Verification



Extract information from the stream of speech.


Verifies that a person is who she/he claims to be.


One
-
to
-
one comparison.


Speaker Recognition


Extract information from the stream of speech.


Assigns an identity to the voice of an unknown
person.


One
-
to
-
many comparison.


Speech Recognition



Extracts information from the stream of speech.


Figures out what a person is saying.

Task of Today


Speech Recognition


History


Scheme


Speaker Features


Methods

Recognition Milestone


1920, first electromechanical toy: “Rex'‘,
(Elmwood Co. )



Late ‘1940s, US Defense, Automatic Translation Machine


Project failed, but sparked the research at MIT, CMU, commercial
institutions.



During 1950's, first system capable of recognizing digits
spoken over the telephone was developed by Bell Labs.



1962, “Shoebox” form IBM


In early 1970's, the system HARPY capable of sentences,
limited grammar, by Carnegie
-
Mellon University.


HARPY required so much computing power as in 50 contemporary
computers.


Moreover, the system recognized discrete speech, where words
are separated by longer pauses than usual.

Recognition Milestone


In the 1980’s, significant progress in speech recognition technology:


Word error rates continue to drop by factor of 2 every two years.


IBM in 1985, in real time, isolated words from set of 20,000 after 20
-
minute training, with error rate < 5%.


AT&T, call routing system, speaker independent word
-
spotting technology,
few key phrases.


Several very large vocabulary dictation systems:


require speakers to pause between words.


Better for specific domain.


In 1990's:


VoiceBroker deployed by Charles Schwab, stock brokerage, in 1996.


ViaVoice by IBM, first distributed with the now almost forgotten operating
system OS/2 in 1996.


1997, Dragon introduced Naturaly Speaking, first continuous speech
recognition package


Today:


Airline reservations with British Airways,


Train reservation for Amtrak,


Weather forecasts & telephone directory information

Terminology of Speech Recognition


Speaker Dependent Recognition


The recognition system is designed to work
with just one or a small number of
individual speakers


Speaker Independent Recognition


These systems are designed to work with
all the speakers from a given linguistic
community


Large Vocabulary Recognition


Example are domain specific recognition systems
such as used by medical consultants for dictating
notes on their ward rounds


Very difficult to make accurate large vocabulary,
speaker independent systems


Small Vocabulary Recognition


Typically recognition of a few keywords such as
digits or a set of commands.


Example: voice operated telephone number dialing

Terminology of Speech Recognition


Isolated Word Recognition:


Systems which can only recognize individual words
which are preceded and followed by relatively long
period of silence


Connected Word Recognition:


Systems which can recognize a limited sequence of
words spoken in succession (e.g. “Ninety
-
eight
thirty
-
five four thousand”)


Continuous Word Recognition:


These systems can recognize speech as it occurs
and recognize the speech in real time. Such system
usually work with large vocabulary, but with
moderate accuracy.

Terminology of Speech Recognition

Speech Recognition Scheme


Three steps in Speech recognition are
performed in ANY recognition system:



Feature Extraction



Measurement of similarity



Decision making

Recognition
Systems

feature
extraction

speech

test
pattern

pattern
matching

decision
rule

reference
patterns

accept/
reject

c
0
(
t
)

...

c
1
(
t
)

c
M
(
t
)




c
M
(
t
)


c
1
(
t
)


c
0
(
t
)




2
c
0
(
t
)


2
c
1
(
t
)


2
c
M
(
t
)

Derive a
compact
representation
of the speech
waveform

Pattern matching is
constrained in many
ways, e.g. the rules of
language (grammar),
spelling and possible
pronunciations

Find the word
with the
greatest
similarity to
the input
speech

Speech Model & Features

Speaker Recognition Features


The features are low
-
level speech signal
representation parameters that convey
complete information about the signal.


High
-
level characteristics like accent,
intonation, etc. are encoded within the
representation in a very complex and cryptic
manner.


The features contain speaker
-
dependent
components.


Uniqueness and permanence of the features is
problematic.

Questions


Do the features that uniquely characterize
people exist?


Uniqueness and permanence of most of the
features used in biometric systems have not been
proven.



Is the human’s ability to identify a person
a limit that no automatic system can
overcome?


Automated systems might be able to identify
people better than average person can do. In
practice, expert systems do not perform the task
better than the experts who built them.

Questions


How important are the algorithms versus
the knowledge of features and their
relationships to achieve high identification
accuracy?


Knowledge of features and their relationships is
fundamental for accurate biometric systems. The
algorithms play an important, still secondary, role
in the process as no algorithm can compensate for
the lack of the adequate features.

Speaker models


Used to represent the speaker specific
information conveyed in the feature vectors


Several different modeling techniques have
been applied:


Template Matching


Nearest Neighbor


Neural Networks


Hidden Markov Models


State
-
of
-
the
-
art speaker recognition
algorithms are based on statistical models of
short
-
term acoustic measurements on the
input speech signal


Use long
-
term averages of acoustic features

(spectrum, pitch…) first and earliest Idea :


To average out the factors influencing

intra
-
speaker variation, leave only the speaker
dependent component.


Drawback : required long speech utterance(>20s)


Training SD model for each speaker


Explicit segmentation
:

HMM


Implicit segmentation
:

VQ,GMM

Speaker models


HMM:


Advantage : Text
-
independent


Drawback : A significant increase in computational
complexity


VQ:


Advantage : Unsupervised clustering


Drawback : Text
-
dependent


GMM :


Advantage : Text
-
Independent, Probabilistic
framework (robust), Computationally efficient,
Easily to be implemented.

Speaker models


Discriminative Neural Network


M
odel the decision function which best
discriminate speakers


Advantage :


Less parameters, higher performance
compared to VQ model.


Drawback :


The network must be retrained


when a new
speaker is added to


the system.

Speaker models

21

Progressing

State of the Art: Speech Recognition

0
5
10
15
20
25
30
35
40
1993
1995
1997
1999
2001
2003
Speaker
Independent
Dictation
Broadcast
News
Telephone
Conversations
Easy

Hard

VQ

NN

1985

1995

HMM

VQ

NN

GMM

HMM

VQ

NN

QV Example

Speaker B

Speaker A

Acoustic Space 1

Acoustic Space 2

This sample has less
distortion for A than for B

distortion

HMM Example


Two model of “tomato”

[t]

[ah]

[ow]

[m]

[aa]

[ey]

[t]

[ow]

0.2

0.8

0.5

0.5

Word in the vocabulary is presented with phonemes.


Each phoneme is viewed as an HMM


A word model is constructed by combining HMMs for the phonemes

Gaussian Mixture Model (GMM)

(GMM) State Level

Speech Recognition

Speaker Recognition

Speaker k

1

1

2

2

1
p
2
p
………
………
………

i

i

i
p
Gaussian Mixture Model (GMM)

……

………

1
i


1
i


1
i
p

Limits


The best performing algorithms for text
-
independent
speaker verification use Gaussian Mixture Models
(GMM) (single state HMM)


The linguistic structure of the speech signal is not
taken into account and all sounds are represented
using a unique model


The sequential information is ignored


There is a recent trend in using High
-
level features


Large Vocabulary Continuous Speech Recognition System


Good results for a small set of languages


Need huge amount of annotated speech databases (an enormous
amount of time and human effort )


Language and task dependent