Voice Biometrics

standingtopΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

79 εμφανίσεις

Voice Biometrics

General Description


Each individual has individual voice
components called
phonemes
.


Each phoneme has a
pitch
,
cadence
, and
inflection


These three give each one of us a unique voice
sound.


The similarity in voice comes from cultural and
regional influences in the form of accents.


General Description


According to the National Center of Voice and Speech, as one phonate, the
vocal folds and produces a complex sound spectrum made up of a range of
frequencies and overtones. As the spectrum travels through the various
-
sized
areas in the vocal track, some of the frequencies resonate more than others.


Larger spaces resonate at a lower frequencies


Smaller at higher frequencies


The two largest spaces in the vocal track and, the throat, and the mouth,
produce the two lowest resonant frequencies or formants.


Certain inflections and pitches we learn from family members.


Voice physiological and behavior biometric are influenced by our body,
environment, and age.


It is possible that our voice does not always sound the same.


So is voice a good biometric?


Formants

are the resonant frequencies of the vocal
tract when vowels are pronounced. While vowels are
attributed to this periodic resonance, consonants are
not periodic. They are produced by restriction of air
flow with the mouth, tongue, and jaw.


Linguists classify each type of speech sound (called
phenomes) into different categories. In order to identify
each phenome, it is oftentimes useful to look at its
spectrogram or frequency response where one can find
the characteristic formants

General Description


Although all phenomes have their own formants, vowel
sound formants are usually the easiest to identify


All formants have the trait of waxing and waning in
energy in all frequencies, which is caused by the
repeated closing and opening of the human vocal tract.
On average, this repeated closing and opening occurs at a rate of
125 times per second in an adult male and 250 times per second
in an adult female.



This rate gives the sensation of pitch (higher
frequencies result in higher pitches).


Formant values
can vary widely from person to person,
but the spectrogram reader learns to recognize patterns
which are independent of particular frequencies and
which identify the various phonemes with a high degree
of reliability.

Vowel “I”

Vowel “A”


Formants can be seen very clearly in a wideband
spectrogram, where they are displayed as dark bands.
The darker a formant is reproduced in the
spectrogram, the stronger it is (the more energy there
is there, or the more audible it is):


But there is a difference between oral vowels on one
hand, and consonants and nasal vowels on the other.


Nasal consonants and nasal vowels can exhibit
additional formants, nasal formants, arising from
resonance within the nasal branch.


Consequently, nasal vowels may show one or more
additional formants due to nasal resonance, while one
or more oral formants may be weakened or missing due
to nasal antiresonance.


Oral formants are numbered consecutively upwards
from the lowest frequency. In the example, fragment
from the previous wideband spectrogram shows the
sequence [ins] from the beginning. Five formants are
visible in this [i], labeled F1
-
F5. Four are visible in this
[n] (F1
-
F4) and there is a hint of the fifth. There are
four more formants between 5000Hz and 8000Hz in [i]
and [n] but they are too weak to show up on the
spectrogram, and mostly they are also too weak to be
heard.


The situation is reversed in this [s], where F4
-
F9 show
very strongly, but there is little to be seen below F4.

Individual Differences in Vowel
Production


There are differences in individual formant
frequencies attributable to: size, age, gender,
environment, and speech.


The acoustic differences that allow us to
differentiate between various vowel productions
are usually explained by a
source
-
filter theory
.


The source is the sound spectrum created by
airflow through the glottis which varies as vocal
folds vibrate. The filter is the vocal track itself
-

its shape is controlled by the speaker.



The three figures below (taken from Miller)
illustrate how different configurations of the
vocal tract selective pass certain frequencies and
not others. The first shows the configuration of
the vocal tract while articulating the phoneme [i]
as in the word "beet," the second the phoneme
[a], as in "father," and the third [u] as in "boot."
Note how each configuration uniquely affects
the acoustic spectrum
--
i.e., the frequencies that
are passed

Voice Capture


Voice can be captured in two ways:


Dedicated resource like a microphone


Existing infrastructure like a telephone


Captured voice is influenced by two factors:


Quality of the recording device


The recording environment


In wireless communication, voice travels through open
air and then through terrestrial lines, it therefore,
suffers from great interference.


Algorithms for Voice Interpretation


Algorithms used to capture, enroll and match
voice fall into the following categories:


Fixed phase verification


Fixed vocabulary verification


Flexible vocabulary verification


Text
-
independent verification.

Voice Verification


Voice biometrics works by digitizing a profile of a
person's speech to produce a stored model voice print,
or template.


Biometric technology reduces each spoken word to
segments composed of several dominant frequencies
called formants.


Each segment has several tones that can be captured in
a digital format.


The tones collectively identify the speaker's unique
voice print.


Voice prints are stored in databases in a manner similar
to the storing of fingerprints or other biometric data.

Application of Voice Technology


Voice technology is applicable in a variety of areas but
for us, those used in biometric technology include:


Voice Verification


Internet/intranet security:


on
-
line banking


on
-
line security trading


access to corporate databases


on
-
line information services


PC access restriction software


Parental control


Business software as a DSP solution at check points where smart
cards or PIN

used entrance / exit control points



Voice Recognition


hands free devices, for example car mobile hands free sets


electronic devices, for example telephone, PC, or ATM cash
dispenser


software applications, for example games, educational or office
software


industrial areas, warehouses, etc.


spoken multiple choice in interactive voice response systems,
for example in telephony


applications for people with disabilities


Voice verification systems are different from voice recognition
systems although the two are often confused.


Voice recognition is used to translate the spoken word into a
specific response. The goal of voice recognition systems is
simply to understand the spoken word, not to establish the
identity of the speaker. A good familiar example of voice
recognition systems is that of an automated call center asking a
user to “press the number one on his phone keypad or say the
word ‘one’.” In this case, the system is not verifying the identity
of the person who says the word “one”; it is merely checking
that the word “one” was said instead of another option.


Voice verification verifies the vocal characteristics against those
associated with the enrolled user.


The US PORTPASS Program, deployed at remote locations
along the U.S.

Canadian border, recognizes voices of enrolled
local residents speaking into a handset. This system enables
enrollees to cross the border when the port is unstaffed.

How is voice recognition performed?


Voice recognition can be divided into two classes:


template matching
-

template matching is the simplest technique and has
the highest accuracy when used properly, but it also suffers from the most
limitations.


feature analysis


The first step is for the user to speak a word or phrase into a
microphone.


The electrical signal from the microphone is digitized by an
"analog
-
to
-
digital (A/D) converter", and is stored in memory.


To determine the "meaning" of this voice input, the computer
attempts to match the input with a digitized voice sample, or
template, that has a known meaning.


This technique is a close analogy to the traditional command
inputs from a keyboard. The program contains the input
template, and attempts to match this template with the actual
input using a simple conditional statement.

The two stages of a biometric system

Software


Open Source Speech Software from Carnegie Mellon
University


Hephaestus
: Open Source activities at Carnegie Mellon



CMU Sphinx

recognition engines
--

Sphinx 2, Sphinx 3, Sphinx 4, and
SphinxTrain.


PocketSphinx

Sphinx for embedded platforms.


Festvox Project

speech synthesis engines, voices and tools


CMU Statistical Language Modeling Toolkit

(CMU SLM)


CMUdict

--

pronunciation dictionary


OpenVXI

--

VoiceXML

browser


SALT browser

-

finally online!


Audio Databases

--

AN4, Microphone array, etc


RavenClaw
-
Olympus

Dialog system development toolkit.


We will try CMU Sphinx Group Open Source Speech
Recognition
http://cmusphinx.sourceforge.net/html/cmusphinx.php