INTELLIGENT VOICED-BASED MACHINE CONTROL

joinherbalistΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

118 εμφανίσεις

speech recognition, automation, robotics,

voice access control, voice actuation, machine control

Grażyna DEMENKO
*


INTELLIGENT VOICED
-
BASED MACHINE CONTROL

This paper presents the results of the pilot survey of the acoustic models obtained from the
dat
a
b
ase containing different types of texts, created for the needs of the first LVCSR system for
Polish. Additionally, background information about the design of the database is presented along with
the d
e
scription of the applied methods of the corpus construc
tion and current statistics of the database
co
n
tents. The applications of voice actuation for intelligent m
a
chine control are discussed.

1. INTRODUCTION

Robots and automation equipment can „listen” to voice commands and perform
several tasks, approaching
to the human behavior, and improving the human machine
interfaces. Speech and Speaker Recognition Technology promises to change the way
we interact with machines (robots, computers etc.) in the future. This technology is
getting matured day by day and scie
ntists are still working hard to overcome the
r
e
maining limitation. Nowadays it is introducing many important areas eg. in the field
of industry to control doors, lifts, lights, cameras, pumps and equipment by simple
voice commands. Current speech recognit
ion systems rely heavily on databases whose
size and structure depend more or less on their particular application.
Excellent ove
r-
view of current speech reco
g
nition systems is given eg. in [7], [10].

2. SPEECH RECOGNITION TECHNOLOGY

2.1. THE PRINCIPLES OF

STATISTICAL SPEECH RECOGNITION

At present LVSR systems are firmly based on the principles of statistical pattern
recognition. The basic methods of applying these principles to the problem of speech
recognition were pioneered by Jelinek from IBM in the 197
0’s and little has changed.

__________


*

Phonetic Department
, SpeechLab, Adam Mickiewicz University

in Poznań, www.speechlabs.pl

Prace Naukowe Instytutu Maszyn, Napędów i Pomiarów Elektrycznych

Nr 62

Politechniki Wrocławskiej

Nr 62

Studia i Mat
e
riały

乲′8

OMM8



60

An unknown speech waveform is converted by a front end signal processor into a
seq
u
ence of acoustic vectors

Y

=
y
1
, y
2
…y
T
. Each of these vectors is a compact
represent
a
tion of the short time speech spectrum covering a period of

typically 10
msecs.
Thus

an average ten word utterance might have a duration of around 3 seconds
and would be represented by a sequence of 300 acoustic vectors. The utterance consists
of a sequence of words
W

=

w
1
,
w
2

w
n

and it is the job of the LVSR syst
em to dete
r-
mine the most probable word sequence
W
^ given the observed acoustic signal
Y
. To do
this

Bayes’ rule is used to decompose the required probability
P
(W/
Y
) into two comp
o-
nents
, that is,


W
^

= arg
w

max
P
(
W
/
Y
) = arg
w

max
P
(
W
)
P
(
Y
/
W
)/
P
(
Y
)


This eq
uation indicates that to find the most likely word sequence W,

the word s
e-
quence which maximizes the product of
P
(
W
) and
P
(
Y
) must be found. The first of this
terms represents a priori probability of observing
W

independent of the observed si
g-
nal and this

probability is determined by a language model. The second term repr
e-
sents the probability of observing the vector sequence
Y

given some specified word
sequence
W

and this probability is dete
r
mined by an acoustic model (eg. [5]).



Fig.1. Overview of St
atistical Speech Recognition

Fig. 2. HMM
-
based Phone Mo
d
el

Figure 1 and 2 show how these relationships might be computed. A word sequence
W

=
This is speech
is postulated and the language model computes its probability
P
(
W
).
Each word is then convert
ed into a sequence of basic sounds or phones using a
pronouncing dictionary. For each phone there is a corresponding statistical model
called a hidden Markov model (HMM). The sequence of HMMs needed to represent
the postulated utterance are concatenated to

form a single composite model and the
probability of that model generating the observed sequence
Y

is calculated. This is the

61

required probability P(Y/W).
In principle

this process can be repeated for all possible
word sequences and the most likely seque
nce selected as the recogniser output [5], [10].

2.2. LVSR DATABASE STRUCTURE FOR POLISH

The Large Vocabulary speech recognition (LVSR) database for Polish is intended
to provide material for both training and testing of commands and speech dictation of
co
mmon texts, including isolated word systems, word
-
spotting systems and
vocabul
a
ry independent systems which use either whole word or subword modeling
appr
o
aches. The common specification is a mixture of semi
-
spontaneous (controlled
dict
a
tion) and read/dict
ated speech based on the general language features and
peculiarities of Polish on the different linguistics as well as phonetics levels. This
results in reco
r
ding session duration of approx. 60 minutes. The variable part of the
database includes speech del
ivered by 1000 speakers.
The session recorded for each
speaker consists of approximately 20

40 min of semi
-
spontaneous speech and appro
x-
imately 30 min of read speech (about 170 sentences). The corpora have been defined
to cover the v
o
cabularies needed for
the 3 main applications:

A.

Semi Spontaneous Speech. This sub
-
corpus contains formal speech (dictation
on various application topics).

B.

Read

Speech. Grammatically and Phonetically Controlled Structure. The
ov
e
rall statistics of triphone coverage is as follows
: triphones within word:
10593; triphones containing an accented vowel: 8492, unaccented triphones
10650, triph
o
nes in phrase final position: 4495.

Read Speech. Semantically Controlled Structure. General purpose words and
phr
a
ses. Application
-
specific sho
rt texts for users’ needs.

For the purpose of the present project office environment was assumed to be the
target environment. A standard office is a relatively quiet area where the stationary
background noise characteristics is close to white noise. It wa
s decided to obtain
st
e
reo recordings from two microphone positions: a “close distance” and “medium
d
i
stance” position using a headset microphone and a “table” microphone. Both of
them are microphones with cardioid characteristics. Headset microphone is mo
unted
close to the speaker’s mouth and the acquired recordings are expected to be clean, i.e.
with good signal
-
to
-
noise ratio and very low reverberation. Two types of microphones
were used: Sennheiser ME
-
3 for 'close distance' position and AKG C
-
1000S


fo
r the
'
m
iddle distance'.

To enable easy management of great number of speakers data and the recorded
utterances the
QuestionRecorder

program was created using JAVA as the
progra
m
ming language. The Setup Window appears after program launches and
requires s
e
t
ting of all necessary data concerning the recorded person sampling rate, ID
number of the scenario (50 recording scenarios are available) and the directory for the
recorded waveforms. The names of files were created automatically during the

62

recording ses
sion. For each utterance two files are stored: a wave file and a text file
describing recording conditions (SAM label file, (eg.[2], [3], [6]).

In the first stage the recordings were labeled by a group of 30 trained students of
the Institute of Linguistic
s in Poznań whose work was supervised by a phonetician.
The lexicon created for the needs of the project consists of three parts. The CW
lex
i
con (78.150 entries) covers a broad range of vocabulary extracted from an
especially designed newspaper (177.64634

words). For the SAP lexicon (5177
entries) we have decided different application area. The PN lexicon consists of
46200 first/last names, organization and place names. Moreover, a frequency lexicon
( 450.000 words) was designed to complete the coverage of

the vocabulary
occurring in the speech corpora.

For the purpose of the annotation of the recorded speech data special software was
designed based on the Client
-
Server architecture using MSDE 2000, and Windows
2003 Server Client applications were program
med in C#.
The program enables the
import of the recordings with
Question Recorder

and the respective text files to the
Annotation Database 2 (eg. [3]). Non
-
speech acoustic events are divided into four
categories as: filled pause, speaker noise, stationary

noise or intermittent noise.
(Fig.

3).


Fig. 3. Spectrogram and waveform of the acoustically segmented utterance /
to n’e jest f tale watfe/

2.3. PRELIMINARY ACOUSTIC MODELING AND EXPERIMENTS FOR POLISH

The present results were obtained using only the clo
se
-
talk microphone recordings
of 116 h of speech, produced by 321 speakers. The database was divided into three
sets: utterances with a low, moderate and high speech rate, respectively. Speech rate
was determined based on automatic alignment and estimated
as the inverse mean
d
u
ration of a vowel. The speakers were split into 5 cross
-
validation sets in such a way
that the number of speakers, the number of slow sentences and the number of fast
sentences were approximately equal across different cross validatio
n sets. This
g
u
aranteed that the same speaker was never used for both training and testing (even
with a different speech rate). A list of words was generated from orthographic
annotations, from which a dictionary was automatically generated using grapheme
-
to
-
phoneme transcription rules. The dictionary contained over 32000 different words.

63

The stoch
a
stic acoustic speech models for Polish were trained using HTK [12]. Prior
to the m
o
deling, the corpus was segmented by forced alignment using models based
on a d
iff
e
rent 5h database. The standard training procedure including HInit, HRest,
HERest and HHEd for triphone CDHMM was generally used: a list of ca. 60
contextual‘ q
u
estions’ served for state clustering; the average number of Gaussian
mixtures in each state
was set to 12. It can be observed that: a) the high speech rate
test sentences were consistently harder to recognize than the moderate ones, even for
high speech
-
rate models, b) save the above rule, the best recognition rates were
obtained for models of th
e same speech rate class as the test set. At this preliminary
stage we focus on the problem of speech rate, as it is one of the sources of recognition
rate degradation, b
e
cause of poor acoustic matching of some ph
o
nes in case of fast
speech [12].

3. VOICE
ACTUATION FOR MACHINE CONTROL

3.1.VOICE
-
BASED DOOR ACCESS CONTROL

Access control for buildings represents an important tool for protecting both
bui
l
ding occupants and the structure itself.
The door access control is a physical sec
u-
rity that assures the sec
urity of a room or building by means limiting access to that
room or building to specific people and by keeping records of such accesses.


Fig.4. Voice
-
based door access control system [13]

The ability to verify the identity of a speaker by analyzing spee
ch, or speaker ver
i-
fication, is an attractive and relatively unobtrusive means of providing security for

64

admission into an important or secured place.
In the exemplary system (eg. [13]), the
access may be authorized simply by means of an enrolled user spea
king into a
microphone attached to the system.

Figure 4 shows the schematic diagram of the proposed intelligent voice based door
access control.
The proposed system basically consists of three main components
n
a
mely voice sensor, speaker verification syst
em and door access control. A low
-
cost
microphone commonly used in the computer system is used as voice sensor to record
the person voice. The recorded voice is then sent to the voice based verification
sy
s
tem which will verify the authenticity of the pers
on based on his/her voice. It can
be also used by handicapped people and anybody that can speak and seek an easier
way to operate a door and ot
h
er electronic devices.

3.2. A VOICED CONTROLLED LIFT

Commanding/controlling of the computer and applications, u
sing speech for handling
the environment (smart house) are very promising fields, especially for disabled pe
o-
ple. Most frequently, executive devices are electric drives and collection, processing
and forming of control algorithm is conducted by electronic
-
digital means in such
systems. A lift is a characteristic electromechanical device controlled in electronic
digital way containing controlled electric drives, differently realized positioning sy
s-
tems, quite a complex logic control system and having high re
quirements for reliabi
l-
ity and safety. The exemplary model can be constructed by using average powered
controller, with voice recognition, programmable terminal and logical lift program,
which connect them all (eg.[1], [4]).

3.3 ROBOTIC WHEELCHAIR

Robotic

wheelchairs are very special type of vehicles dedicated to motor disabled
users. However, so as to deploy these vehicles in real applications, it is necessary to
develop human
-
robot interfaces to command the wheelchair. Indeed, from the end
user perspecti
ve, this interface has a decisive impact in the comfort and performance
of the navigation task. The robot’s basic mapping and navigation functions are
prov
i
ded include: detection and avoidance of negative obstacles, robust point
-
to
-
point
pla
n
ning and navig
ation, shared control between autonomous controller and human
user (eg.[8], [9], [11]).

4. CONCLUSIONS


65

The current plans assume including the rest of the corpus of over 1500 speakers,
resulting in ca. 1000 h of speech.
The present results should be regard
ed as the
prel
i
minary verification of the development of acoustic models for Polish speech
recogn
i
tion system.
They are encouraging however further experiments are indispe
n-
sable to improve the obtained acoustic models and provide an outcome practically
use
ful in the designed speech recognition system.
Firstly, we intend to take the
following steps: tuning modeling and testing parameters (eg. number of mixtures,
word insertion p
e
nalty); including distant microphone recordings in training. We
expect that inte
grating linguistic and acoustic models will improve the overall
performance of our sy
s
tem.

REFERENCES

[1]

CERNYS P., KUBILIUS V., MACERAUSKAS V., RATKEVICIUS K.
Intelligent Control of the Lift
Model
, IEEE International Workshop on Intelligent Data Acquisi
tion and Advanced Computing
Systems: Technology and Appl
i
cations, 8

10 September 2003, Lviv.

[2]

DEMENKO G., GROCHOLEWSKI S., KLESSA K., OGÓRKIEWICZ J., WAGNER A., LANGE
M.,

ŚLEDZIŃSKI D., CYLW
IK N.,

JURISDIC

Polish Speech Database for taking dictation of
legal texts
(submitted for LREC ’2008)

[3]

ELRA: European Language Resources Association homepage:
http://www.elra.info/

[4]

JOHNSON, J.
Working with Stepper Motors
. 1998. Electronic Inventory Online.

http://eio.com/jasstep.htm

[5]

YOUNG S.
Large Vocabulary Continuous Speech Recognition.
IEEE Signal Processing Magazine
13(5): 1996, 45

57.

[6]

http://www.speechlabs.pl

[7] LOOF J., GOLLAN CH, HAHN S., HEIGOLD G., HOFFMEISTER B., PLAHL D., RYBACH R.,
SCHLUTER AND H. NEY,
The RWTH 2007 TC
-
STAR Evaluation System for European English and
Spanish
, Interspech 2007, 2145

1249.

[8]

PIRES, G. AND NUNES, U.
A wheelchair steered through voice commands and assisted by a rea
c-
tive fuzzy
-
logic controller.

Journal of Intelligent and Robotic Sy
s
tems, 2002, 34(3):301

314.

[9]

SIMPSON R., LOPRESTI E., HAYASHI S., NOURBAKHSH I., MILL
ER D.,
The Smart Whee
l-
chair Component System
Journal of Rehabilitation Research & Development,

V. 41, Nr 3B, 2004,
429

442.

[10]

SHAUGHNESSY, D.
Interacting with computers by voice: automatic speech recognition and sy
n-
thesis

Proceedings of the IEEE Volume
91, Issue 9, Sept. 2003, 1272

1305

[11]

ENGDAHL, T.
How to Use Disk Drive Stepper
-
Motors
. 2004, Communications Ltd.

http://www.epanorama.net/circuits/diskstepper.html

[12]

SZYMAŃSKI M., OGÓRKIEWICZ J., LANGE M., KLESSAK., GROCHOLEWSKI S., DEMENKO G
.
First evaluation of Polish LVCSR acoustic models obtained from the JURISDIC database,
submitted
for Interspeech 2008


[13]

WAHYUDI M., ASTUTI W., SYAZILAWATI M.
Intelligent Voice
-
Based Door Access Control
System Using Adaptive
-
Network
-
based Fuzzy Inference Systems for Building Security,
Journal of
Computer Science 3, 2007




66

The project is supported by The Polish Scientific Committee (Project ID: R00 035 02).