design, implementation and evaluation of a voice controlled ... - DNV

movedearAI and Robotics

Nov 17, 2013 (3 years and 1 month ago)

146 views

N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY




D
ESIGN
,
IMPLEMENTATION AND EVALUATION
OF A VOICE CONTROLLED INFORMATION
PLATFORM APPLIED IN SHIPS

INSPECTION


T
HESIS IN
N
AUTICAL
E
NGINEERING


0


by

T
OR

YVIND
B
JØRKLI





D
EPARTMENT OF
E
NGINEERING
C
YBERNETICS




F
ACULTY OF
E
LECTRICAL
E
NGINEERING AND
T
ELECOMMUNICATION




DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY


Page

i
ABSTRACT

This thesis describes the set-up of a speech recognition platform in connection with ship
inspection. Ship inspections involve recording of damages that is traditionally done by taking
notes on a piece of paper. It is assumed that considerable double work of manual re-entering the
data into the ship database can be avoided by introducing a speech recogniser. The thesis
explains the challenges and requirements such a system must meet when used on board. Its
individual system components are described in detail and discussed with respect to their
performance. Various backup solutions in case the speech recogniser fails are presented and
considered. A list of selected relevant commercially available products (microphones, speech
recogniser and backup solutions) is given including an evaluation of their suitability for their
intended use. Based on published literature and own experiences gained from an speech
demonstrator having essentially the same interface as the corresponding part as the DNV ships
database, it can be concluded that considerable improvement in microphone and speech
recognition technology is needed before they are applicable under challenging environments.
The thesis ends with a future outlook and some general recommendations about promising
solutions.





















DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY


Page

ii

PREFACE
The Norwegian University of Science and Technology (NTNU) offers a Nautical engineering
Studies programme, required entrance qualification are graduation from a Naval Academy or
Maritime College, along with practical maritime experience as an officer. In combination with
the studies at the NTNU, a post-graduate thesis is to be done.
I embarked on a career at sea, when I commenced my education at Vestfold Maritime College
and later joined the Navy’s Officers’ training school. With my background from RnoCG (The
Royal Norwegian Coastguard), RnoN (Royal Norwegian Navy) as well as the merchant fleet I’m
naturally interested in shipping and safety at sea.
The thesis is a result of my maritime experience along with the theoretical basis adapted at the
NTNU, results from the semester project “The surveyor in the information age” written spring
1999 has also been a contributing factor to the outcome.
When forming the structure of this project, the NTNU project template has been used as the main
guideline.
The thesis is the result of a literature study as well as information given by former and present
surveyors within the DNV. This thesis in nautical engineering has been accomplished under
supervision of Professor Tor Onshus at Department of Engineering Cybernetics (ITK
1
) NTNU.



A
CKNOWLEDGEMENTS

This thesis in nautical engineering has been carried out during the autumn 1999 at the
Department for Strategic Research in DNV. During this time, I was privileged to been work in
DNV’s Mobile Worker project. That has provided an inspiring research at the junction between
research and product testing. I would like to sincerely thank the project members for fruitful
collaboration during my stay.
The assistance and guidance given by a number of individuals has been essential for the
accomplishment of the thesis.
I would like to thank my head supervisor Professor Tor Onshus for his guidance during this
project.
I wish to thank all the researchers at DTP 343, Department for Strategic Research at DNV’s head
office at Høvik. And especially
Dr.Scient; Dipl. Ing. (FH) Thomas Mestl, for allowing me to carry out this project as a part of
the Mobile Worker research program and his support during the thesis.
Finally, I would also like to thank Thomas Jacobsen, a fellow student, for his assistance in
checking and commenting my work.

Oslo Wednesday, 15 December 1999


Tor-Øyvind Bjørkli

1
Department of Engineering Cybernetics
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY


Page

iii
Table of contents page
ABSTRACT................................................................................................................................i
PREFACE...................................................................................................................................ii
1 INTRODUCTION........................................................................................................1
2 SYSTEM DESIGN.......................................................................................................4
2.1 Constraints, requirements and potentials 4
2.2 System set up 7
2.3 Back up solutions 8
3 EVALUATION OF COMMERCIAL AVAILABLE SYSTEM
COMPONENTS.........................................................................................................15
3.1 Microphones 15
3.1.1 Physical principles 15
3.1.2 Noise reducing measures 18
3.1.3 Body placement 23
3.1.4 Commercially available microphones 24
3.2 Speech recognition software 28
3.2.1 Principles 31
3.2.2 Recognition enhancing measures 31
3.2.3 Commercially available products 33
3.2.4 General conclusions 38
4 SPEECH DEMONSTRATOR...................................................................................40
4.1 Set-up 41
4.2 Experiences gained from the speech demonstrator 44
5 RECOMMANDATIONS AND FUTURE OUTLOOK............................................46
6 REFERENCES...........................................................................................................48
Appendices 50

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

I
NTRODUCTION

Page

1
1 INTRODUCTION
Speech recognition itself is nothing new in fact everybody is doing it every day. However a machine that
recognises the spoken word is a technological challenge and only recently they have come available. Such
dictation systems e.g. for specific professions such as radiology has been around for years and carry five-
figure price tags. Less expensive general-purpose systems require discrete speech, which is a tedious
method of dictation with a pause after each word. Two years ago Dragon Systems achieved a new
milestone with the release of NaturallySpeaking. This first general-purpose speech recognition system
allows dictating in a conversational manner. IBM quickly followed with ViaVoice, costing hundreds of
dollars less than the first version of NaturallySpeaking.
A major factor driving the development of these speech-enabled applications is the steady increase in
computing power. Speech recognition systems demand a lot of processing and disk space.
The fine line below gives the history of speech recognition system, (PC Magazine, 10 March
1998):

Speech Technology Timeline
Late 1950s: Speech recognition research begins.
1964: IBM demonstrates Shoebox for spoken digits at New York World's Fair.
1968: The HAL-9000 computer in the movie 2001: A Space Odyssey introduces the world to
speech recognition.
1978: Texas Instruments introduces the first single-chip speech synthesiser and the Speak and
Spell toy.
1993: IBM launches the first packaged speech recognition product, the IBM Personal Dictation
System for OS/2.
1993: Apple ships PlainTalk, a series of speech recognition and speech synthesis extensions for
the Macintosh.
1994: Dragon Systems' DragonDictate for Windows 1.0 is the first software-only PC-based
dictation product.
1996: IBM introduces MedSpeak/Radiology, the first real-time continuous-speech recognition
product.
1996: OS/2 Warp 4 becomes the first operating system to include built-in speech navigation and
recognition.
June 1997: Dragon ships NaturallySpeaking, the first general-purpose continuous-speech
recognition product.
August 1997: IBM ships ViaVoice.
Fall 1997: Microsoft CEO Bill Gates identifies speech recognition as a key technological
advance.
Future: The next generation of speech based interfaces will enable people to communicate with
computers in the same way they communicate with other people (Scientific American, August
1999)


In fact, the dream that machines can understand human speech has been for century’s as Leonhard Euler
expressed already in 1761:
”It would be a considerable invention indeed, that of a machine able to mimic our speech, with its sounds
and articulations. I think it is not impossible”.
As functioning speech, recognition systems appear on the market they are tried out in a large
variety of everyday applications, raging from cars, toys personal computers and mobile phones
to telephone call centres. It has taken more than four decades for the speech recognition
technology to become mature enough for these practical applications. Moreover, some computer
industry visionaries have predicted that speech will be the main input modality in the future user
interfaces. It is nevertheless important to note that the current speech recognition and application
boom is not only due to advanced speech recognition algorithms developed during the last few
years, but may be mainly due to huge processing power improvements in current
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

I
NTRODUCTION

Page

2
microprocessors. In fact, the core speech recognition technology, on which current applications
mainly rely, was already developed in the late 1980’s and early 1990’s.
The trend in IT development goes towards miniaturisation of components. As the components
become smaller and smaller they will have a size that easily will fit into e.g. clothing jewellery
and helmet. The ideal situation would be that the individual components are so small, that the
user could not notice wearing them.
Already today, there exists various equipment that could allow entering digital information into the
reporting system. These devices are however not designed for environments such as given in ships. They
are mainly usable only in office surroundings.
Assume an inspector is about to inspect a tank section in a ship: it is hot, noisy, dirty, and humid.
To get access to the areas of importance a ladder has been arranged. Traditional inspection tools
for tank coating and structure (rust, cracks, etc) is a hammer and a flashlight. As you climb the
ladder both hands are needed, one for securing yourself and the other for the inspection tool. A
crack in the tank structure must of course be reported. Today this is done by scribbling down a
note on a peace of paper and stowed away in a pocket.
Vision (Potential Applications of the Technology)
Imagine you have access to all information, recordings, and equipment needed, by a not yet
invented device. The device consists of a miniature digital camera integrated in your helmet, a
very small PC unit, and a microphone assume further that this device is fully voice controlled
allowing to navigate within an information database and taking notes even when both hands are
occupied. A written report including pictures could then be completed “on the job” with help of
a speech recognition system securing the information needed as well as saving time.

Figure 1: Ideally any equipment shall support the surveyor in his work in such a way that he can
fully concentrate on his primary work: detection of defects, a speech recognition system would
be a desired “secretary”.
This thesis address the design, implementation, and evaluation of a voice recognition system in
connection with DNV’s ship inspection. The thesis will mainly reflect on problems and tasks
concerning Voice recognition. The background for this theme is to make a survey more effective
(save time and money), increase the quality of work, immediate updating of class status upon
completion of survey and to contribute in improving the surveyors work condition.
Speech entry for entering comments and findings would therefore be of great benefit for today’s
surveyors, since it would obliterate double work, sources of errors and speed up the necessary
“paper work”.
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

I
NTRODUCTION

Page

3
Because of this tremendous potentials new IT technology offers, DNV has initiated a project
called Mobile Worker, that focuses on the utilisation of new technology and cordless
communication to help the surveyors in their everyday work.
What can be done to make the best of these developments? And how can the tasks and processes
be arranged in order to work faster and better?
The traditional inspection tasks are varied, and the surveyors solve them in many different ways
using various conventional aids. Inspectors look at the condition of what is being inspected,
make notes, and fill in standard forms. They have rulebooks, instructions, and other necessary
documentation to supplement their own knowledge and memory. Many DNV inspectors have
also used mobile phones and portable PC’s for a long time to gain access to information and
communicate with customers and colleagues. However, at some point, the mobility will cease
and the equipment will have to remain in a cabin, a suitcase or at the office. The inspectors
therefore often need information in situations where it is not available, such as when they want to
compare observed conditions and damage with the regulations and standard examples. What is
the point of having stored all kinds of information in a ship or loss database if you cannot use it
when you have to make a decision? Alternatives that are even more mobile are now starting to
appear on the scene and the possible mobile tasks, specifications of mobile solutions will be
described.
The human aspect may even be a greater challenge. Unless the user accepts the new
opportunities as a natural part of their job the new equipment may lower their job satisfaction
and they with the result that they may quickly stop using it. Portable technology can easily create
the feeling that the tool is taking control over the work situation instead of supporting it. An
important part of every new solution will be to adapt tasks such that the technology will motivate
and make-work simpler (Andersen, 1999).
Surveyors may have to leave the work site to find a required manual; list of approved suppliers and
approved equipment. E.g., the right pumps from the right vendor. If they fail to refer to the manual, they
may attempt lengthy or complex procedures from memory and produce errors. Once the manual is
retrieved, the surveyor may have to find room within a cramped workspace to put a large
manual/drawing. Also, any attempt to climb onto equipment while holding a large manual might
jeopardise safety.
Speech recognition systems are error-prone and not very robust to real world disturbances, such
as ambient background noise including speech spoken by other surrounding speakers,
communication channel distortions, pronunciation variations, speaker stress, or the effects of
spontaneous speech. Current speech recognition systems can be categorised in two groups
according to their robustness level. Applications falling into the first category have typically
large vocabularies and they have been designed to recognise continuos speech successful use of
these systems requires efficient minimisation of all possible interference sources. High quality
audio equipment, including a close talk microphone, noise free operating environment, and
particularly, a co-operative and motivated user, are needed in order to achieve a high recognition
performance. Dictation systems for continuous speech are typical examples speech recognition
systems belonging to this application category. Truly robust speech recognition applications
form the second group. Robust systems can cope with distorted speech input (to a certain extent)
and still provide high recognition accuracy even with inexperienced novice users. These systems
can usually recognise only discrete words and their vocabulary size is limited to some tens of
words. A good example of a robust system is a speaker-dependent name recognition application
in voice dialling. In the name dialling system, the user has trained voice-tags; i.e. names that
have phone numbers attached to. By speaking a certain voice-tag, a phone call to the attached
number is then made. Because of this apparent simplicity, name dialling is very useful for
example in a car environment where the user’s hands and eyes are busy. It is important to note
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

4
that speech recognition alone does not have any particular value. To use speech as an input
modality, there must always be some practical advantages. Furthermore, one cannot overestimate
the importance of a good user interface; it is essential that speech recognition applications are
extensively tested with real users under realistic operating environments.

Organisation of the Thesis
In the next chapter, a possible design of a speech recognition system is presented. The
subsequent chapters discuss the individual components in more detail. Chapter 4 describes a
speech recognition demonstrator and chapter 5 ends the thesis with some discussion and future
outlook.




2 SYSTEM DESIGN
This chapter describes a possible system design. Constraints and requirements, the surveyors
work process and a potential backup system is addressed.

2.1 Constraints, requirements and potentials
DNV’s constraints can be divided into 2 groups namely, DNV’s database (
N
AUTICUS
), and the
surveyors work process including the effect of a speech based reporting system on it.
Det Norske Veritas has developed the
N
AUTICUS
database system, which contains all information
about the ship over its entire life (Lyng, 1999).
N
AUTICUS
is based on a product-model. Allowing in principal unlimited information to be
attached to each element of the ship. For example a geometrical representation of the hull, as
well as mathematical analysis of structure and machinery are contained in it.
The system also permits analysis of the ships structural strength and behaviour under any sea
state and loading conditions, with visual feedback. Several analysis options are available;
including fully integrated finite-element capabilities and direct wave-load calculations. The same
model allows also for a full set of machinery calculations to be performed. With only one model
serving all functions, repetitive data input is eliminated.
The accumulated information regarding a specific
NAUTICUS
class vessel is available to DNV at
any time, a surveyor is able to retrieve updated data on matters relating to ship status and
conditions, survey feedback, new-building and certificate status, and component and system
information.

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

5


Figure 2: A surveyor may retrieve some information by accessing the Product Model through on
his laptop PC.
Figure 3: The NAUTICUS product model is intended to be a mirror of the real world and any
information like user guides, hints, warnings or restrictions are attached to the model.
Ambient condition: a worker in challenging environment (temperature, water, shielded, confined
spaces, noisy, etc) no support available (power supply, Internet or phone connection)
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

6
From
NAUTICUS

Planning of
survey, in
office
Execution of
survey, onboard
the ship
Into
NAUTICUS

Checklist
Survey report
Reporting the
survey, at the office
Nauticus allows a 3D representation of the ships structure and allows to record, store, and
retrieve information. The product model will at any instant of the ship’s lifetime hold its original
description as well as its present and historical condition, both for the structure and equipment
Throughout a ship's life, reports, drawings, sketches and engineering calculations will be stored in the
NAUTICUS
product model. Simply “clicking” on a part, system, or compartment in a 3D view or a tree
structure the information can be accessed.
The Classification and Statutory Certificates are main “deliverables” of a classification society.
The main deliverable of a surveyor is however, the inspection report. Introduction of
NAUTICUS

to DNV surveyors in the field may enable them to issue full-term certificates while still onboard
the vessel.











Figure 4: The survey work process: through
NAUTICUS
, the surveyors will have ready access to
ship information needed for planning of surveys. When onboard the surveyor may wish to
retrieve information about e.g. the minimum allowable steel thickness. However, this
information can only be retrieved from
NAUTICUS
. The surveyor must therefore either be
connected to the headquarters or to a local copy of
NAUTICUS
at the completion of a survey, and
after verification of survey data; the results are recorded straight into the ship database and will
then be accessible for other surveyors immediately.

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

7
Renewed
plates
Fully
integrated
sketcher tool
Crack
description
Figure 5: Screen dump of NAUTICUS operation and some of the options available, an
integrated sketcher tool, for drawings and pictures, annotation of the inspected item is scheduled
for future implementation in
NAUTICUS
.
2.2 System set up
The speech recognition system can be compared with a chain usually consisting of 4 independent
units, (see Figure 6):
• User.
• Microphone.
• Speech recognition software.
• Computer hardware.
It may be possible to insert more components into the chain to improve the recognition accuracy,
examples of such components are:
• Noise reducing software.
• Noise reducing Hardware.

Depending on the degree of integration, these “extra” components may increase the weight and
size of the equipment totally and it may decrease the user’s ability to move around. Extra
equipment may also be power consuming, which means that the operation time is reduced.

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

8

Figure 6: The speech recognition system can be compared with a chain consisting of a user,
microphone, noise reducing hardware, noise reducing software, speech recognition software, and
computer hardware.

Consists of a number of independent links. Each component of the speech recognition system
has its individual strength. As indicated in Figure 6, the user represents the systems first link. In
today’s systems the recognition rate depends principally on the users skills (dictating etc), to
handle speech recognition systems. If the user is not familiar with speech as an input option, the
result of recognition fails totally; the user is defined as a weak link. The microphone as input
device is discussed in chapter 3.1. The microphone is the second component in any speech
recording or transmission system. Its function is to convert acoustic sound waves into an
equivalent electrical signal. The commercial available microphones today are not constructed to
operate in ambient noise and therefore backup solutions for input may be necessary, as discussed
in chapter 2.3. The microphone is identified as a weak link. Noise reducing hardware may
increase the quality of the speech signal and is considered as a strong link. Furthermore, noise-
reducing software is a strong link since powerful algorithms are achievable. Chapter 3.2
discusses speech recognition software and it is considered as a weak link. When it comes to
dealing with non-native speakers, the level of recognition is unsatisfying. Today’s computer
hardware is not the limiting factor in a speech recognition system.


2.3 Back up solutions
If something fail a back up system is needed. An example of such a back up system is a hand
held keyboard for text input or a track ball for navigation as shown in. (Figure 7). Trackball and
touchpad do not allow immediate textual input reducing hereby their user range. Furthermore, all
the backup solutions presented here render the ability to give input with the use of no hands. A
backup device may also be space consuming and thus interfere with the user’s ability to move,
e.g. in confined spaces. However some backup systems may be preferable when navigating in a
2 dimensional space; this is very difficult to achieve with voice. The error frequency when using
a keyboard is little compared to the error frequency when using voice.
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

9

Figure 7: A keyboard or a trackball may function as a backup system if the speech recognition
fails. Furthermore, a keyboard may be used in vacant periods of the inspection for text editing.
One may consider different backup systems like keyboard, hand-scanner, mouse etc. These
backup systems may primarily serve as a supplement to the existing voice input.









Figure 8: From left to right, potential back up systems: Trackball allows in addition mouse
functions whereas the data glove in addition can also type letters or numbers into the computer,
Twiddler
2
. That allows the user to have a full keyboard access using one hand. The right picture
is of a hand held Dictaphone for audio input.

If the speech recognition system is used or, a mobile platform (palm top, wearable computer)
then the standard desktop computer input devices are inadequate in this situation. A conventional
keyboard, for example, is not a practical input device since it was designed to be used while
sitting down. A major factor in the development of input devices concerns the placement of the
devices. Keyboards require the user to have the fingers free for typing. Thus, the keyboard must
be held in place by a means other than the user having to grasp it. The advantages and
disadvantages of various backup solutions are summarised in table 1 and table 2.
As mentioned full size keyboards are cumbersome. Chorded keyboards use fewer keys to input
text. Combinations of keys are used to indicate particular letters. Some can be strapped to one
hand or a wrist A keyboard allows a full range of textual input. In mobile work, the keyboard has
to be worn and positioned for input. This conflict has given rise to alternative keyboard devices.


2
http://www.handykey.com/

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

10
(a) Full size keyboards: A normal keyboard is unattractive in a wearable context
because a variety of small text-input devices have been developed.

(b) Miniaturised keyboards: There are miniaturised
keyboards with the options of a full size keyboard,
available on the commercial market. A chording
keyboard is one where combinations of keys are
punched to indicate particular letters. If only used solely,
the backup keyboard could be stored in a jacket and be
pulled out when needed. If used more frequently it could be strapped on the wrist, the
belt or elsewhere on the body.
The advantage is that the user has all the features of a full size keyboard, demands minimum
of space, little or no training required. It is inexpensive, requires low power, low bandwidth,
and is compatible with existing software.
A shortcoming is that it is cumbersome to use for navigational purposes, annoying for the
user because of its placing (on the arm), using a miniaturised keyboard in confined spaces
may also require back lightning. Further no feedback (click or blimp) whether the pushing of
a button was successful or not is available. There is no pointing capability inherent in the
device.

Commercial available miniaturised keyboards
→ The QWERTY
3
keyboard from L3 Systems
4
is designed for
wrist mounting. This keyboard is totally sealed, has optional
adjustable back lighting with a choice of PS/2 or USB
5
interface.
Features:
• Optional wrist strap to provide the capability of
attaching it to your wrist.
• Back lighting.
• PS/2 or USB interface.
→ The PGI micro keyboard from Phoenix Group, Inc
6
is rugged and sealed to protect it from the
elements. Weighing less than 160 grams, it is designed to be “arm mounted” and offers PC compatibility
with 59 keys and 99 functions in a package about the size of a dollar bill. It is supplied with a PS2
compatible mini-din connector. Features:
• Optional wrist strap to provide the capability of
attaching it to your wrist.
• PS/2 or USB interface.
• Back lighting.

(c) Virtual keyboard
→ The AUDIT
7
, which is a text editor and a computer control program that is particularly
prepared for acoustic communication and remote computer operation
8

9
(a keyboard interface
devised for remote computer control). The keyboard interface may be used as an ordinary text
editor at any office or personal computer terminal with graphic screen and normal keyboard or as

3
The name derives from the first six characters on the top alphabetic line of the keyboard
4
www.l3sys.com

5
Universal Serial Bus.
6

www.ivpgi.com/

7
Audible Editor
8
www.dnv.com/ocean/nbt/audit/docs/report.htm

9
www.dnv.com/ocean/nbt/audit/zframe.htm

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

11
a remote text editor with 12 button keypad, as for instance a push button telephone/cellular, a
calculator.
A virtual keyboard could be based on a modified Braille alphabet (Gran, 1992).

Figure 9: Screen dump of the AUDIT virtual keyboard.

(d) Chording keyboards: They are smaller and with fewer keys and can be strapped to
one hand or a wrist. These input devices typically provide one button for each finger
and/or the thumbs. Each button controls multiple key combinations. Instead of the
usual, one-at-a-time key presses, chording requires simultaneous key presses for each
character typed, similar to playing a musical chord on a piano.
The advantages are that it requires fewer keys than a conventional keyboard. With
fewer keys and the fingers never leaving the keys, finger strain is minimised. The
user can place the keyboard wherever it is convenient, which helps alleviate
unnatural typing positions.
Disadvantages is that the one handed requirement for input means that it could not be
used for applications where the user must have both hands totally free at all times. It
requires at least 10 to 15 hours of training to operate and it is only suitable for textual
input. Usually slows data entry considerably. There is no pointing capability inherent
in the device.
Commercial available chording keyboards
→ The BAT personal keyboard from Infogrip, Inc
10
, is a one-handed,
compact input device, that replicates all the functions of a full-size
keyboard, but with greater efficiency and convenience.
Letters, numbers, commands and macros are simple key combinations,
“chords,” that can be mastered after some training BAT’s ergonomic
design reduces hand strain and fatigue.
The BAT is also a typing solution for persons with physical or visual
impairments and can increase productivity when used with graphic or desktop publishing
software. Features:
• Left or right hand configuration.
• Dual keyboard option includes both left and right-hand units.
→ The Twiddler from Handykey Corporation
11
is a pocket-sized
mouse pointer plus a full-function keyboard in one unit that fits in

10


www.infogrip.com/

11
www.handykey.com/

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

12
either right or left hand. It plugs into both keyboard and serial ports on IBM-compatible PC's and
works on DOS, Microsoft Windows 3.X/95/NT, Unix, and Palm Pilot operating systems. The
Twiddler's mouse pointer is based on a sensor sealed inside the unit and immune to dust and dirt.
The Twiddler incorporates a keyboard that is an ergonomic keypad designed for “chords” keying, i.e.
press one or more keys at a time. Each key combination generates a character or command. With 12
finger keys and 6 thumb keys, the twiddler can emulate the 101 keys on the standard keyboard.

Table 1: Evaluation of commercial available keyboards as backup system.
Advantages

Disadvantages

Enables textual input
Can’t be used if task requires two hands
Reasonable speed (50 wpm) is achievable Training required for proficiency
Inexpensive, low power and low bandwidth
requirements
No pointing capability inherent in the device
Can be made waterproof

(e) P
OINTING DEVICES
: Pointing devices may be necessary even if the speech recognition
works perfectly. Joysticks, touchpads and trackballs, are defined as pointing devices in
means of their ability to move the cursor on the screen. A pointing device does not enable
hands-free navigating and requires training and a free surface. The ability to point to a
position on a screen is important for all direct manipulation interfaces and, for all
applications where there is a figure or a map to annotate.
The advantages are that pointing devices are intuitive, allow random access and
positional input, and are compatible with desktop interfaces (Krauss & Zuhlke 1998).
They are widely available and could provide a virtual keyboard by having a
representation of a keyboard on the screen and pointing to the various keys desired.
Disadvantages are that the interfaces that currently utilise pointing devices are resource
intensive. They are inexact for precise co-ordinate specification and they are slow when used
to provide a virtual keyboard.

(f) T
RACKBALLS
: This stationary device lodged in the keyboard or
found as a stand-alone product lets users control the cursor with a
rotating ball (rather than a conventional mouse). Trackballs have
been around for years and have been continually refined for
better performance.
The advantages is that a trackball requires little space, ability to
point at, and “enter” a pushbutton, scroll menu and a text field in
a program.
The disadvantages are that there are no textual input options, cumbersome placing of
the trackball requires training, not possible to operate hands-free.

(g) T
OUCHPADS
: Touchpads: Most commonly seen on notebook
computers are made of flexible material similar to a laptop
screen. Users control cursor movement by running their
fingertips or a stylus along the touch-sensitive surface. The
advantage of a touchpad is that it allows continuos positioning
of the cursor. Many users find that it offers a more natural
motion than a Track-Point button.
The disadvantage of touchpads is that they are extremely
sensitive to moisture contamination. They are also energy consumptive demanding a
constant power supply and offering no battery saving “sleep mode” capability. In today’s
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

13
sub-notebooks, palmtops, cordless keyboards, and handheld remote controls, battery life is a
major concern.
Capacitance-based touchpads have another operational downside. They are insensitive to
pressure directed downward on the pad, and will not operate using a common stylus. They,
therefore, have no potential for 3D input options such as pressure-sensitive scrolling,
signature capture, character recognition, etc.



Commercial available pointing devices

T
RACKBALL
: The RAT-TRAK™ trackball from Industrial Computer Source
(ICS)
12
. Feature user defined keys; instant speed control and ergonomic design.
The product is comes with a PS/2 connection and is Microsoft mouse
compatible. Dimensions (L x W x H) (201,4mm x 102,7mm x 57,2mm. The
price is app 40 $.

J
OYSTICK
: The MicroPoint™ from Varatouch Technology, Inc (VIT)
13
, is
a small, base diameter of 10mm, fully functional joystick. MicroPoint is a
variable resistance electronic analogue device that uses resistive rubber. It
can be used with a variety of analogue-to-digital converters. The price is
app 60 $.

T
OUCHPAD
: The Smart Cat from Cirque
14
. It measures about 10 cm square and
has a touch surface that measures 3 by 7.62 cm. The Smart Cat allows single- or
double-click by tapping on the surface, tap in the left corner for a left button
click, and tap on the right side for the right button. The device also comes with
standard left and right buttons. It also scrolls both horizontally and vertically
through applications that support scrolling The price is app $49.

Table 2: Evaluation of commercial available pointing devices as backup
system.
Advantages

Disadvantages

Can indicate a point in two dimensional
space (map)
Current devices are resource intensive and
usually require a surface for positioning
Are intuitive Inexact for precise co-ordinate specification
Faster than typing Slow when used to provide a virtual keyboard
Can be made waterproof

(h) Dictaphone: Findings could also be described and recorded via a Dictaphone. In an
earlier DNV project analogue Dictaphones has been tried out on surveyors with
unsatisfying results. The reason could be found in the handling of the Dictaphones

12
www.labyrinth.net.au/~ieci/products/input/html/rat-trak.html

13
www.varatouch.com

14
www.cirque.com

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

SYSTEM

DESIGN
Page

14
and discipline of the user. However, in some situations and for some persons this
may be a suitable backup solution.

Commercial available Dictaphones
→ The Olympus D1000 Dictaphone
15
. According to the producer, it is
possible to dictate 140 words/min. The machine further allows:
• indexing - automatic date time recording, automatic dictation
numbering
• 2 recording times, Standard (15 min) and Long mode (33 min),
which influence the length of the storage capacity when 2 MB of
flash memory is available.
A special feature of the Olympus D1000 is that voice recorded in
standard mode can be transcribed in editable text by the IBM voice
recognition software ViaVoice. The recorded voice must for this task
be transferred to the PC either by cable or by a flash card. ViaVoice Transcription uses approx.
30,000 words in its basic vocabulary, extendable up to 64,000 words by the user. In addition,
ViaVoice Transcription has a dictionary with approx. 320,000 words and a back-up vocabulary
that includes spell checking and pronunciation.

Table 3: Evaluation of Dictaphone as backup system
Advantages

Disadvantages

Are intuitive Only possible to use in quiet surroundings
Faster than typing Requires correct handling for good result
Can be made waterproof Too many and too small buttons


Summary and recommendations on input devices
The choice of backup input devices will depend on the application and work surroundings and
the users experience with these devices, the user should be allowed to choose his preferred input
device.

R
ECOMMENDATIONS

As seen in the discussion in this chapter. The scull and throat mounted microphone seems to be the most
attractive solutions when it comes to hands free speech input in a noisy environment. However, these
microphones are yet not available for speech recognition purposes. Another positive feature is the non-
obstructing capabilities of a scull-mounted microphone since it can simply be hidden it in the helmet.
Real user friendliness requires a wireless connection between the system components. It would allow the
user to move freely but would also the take aspect of safety in to consideration.





15
www.olympus-europa.com/voice_processing/index.htm

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

15
























3 EVALUATION OF COMMERCIAL AVAILABLE SYSTEM
COMPONENTS
3.1 Microphones
The use of voice applications (speech based technologies) are becoming increasingly common
on personal computers. Audio applications like Internet telephony, computer telephony,
videoconferencing and speech recognition is transforming the PC into the desired
communications appliance. High quality microphones are required to enable these voice
applications. However, many applications simple pre suppose an ideal microphone with the
result that a not optimal microphone is selected leading to poor acoustic input into the voice
recognition software. Severe performance degradation can result when the microphone is not
viewed as a critical performance element in the speech recognition chain. By selecting the proper
microphone element (scull mounted, noise cancelling, etc.) and implementing it correctly, the
performance of the voice application can be dramatically improved.
The primary barrier for a successful introduction and user acceptance of voice recognition
software has been due to noise that contaminates the speech signal and degrades the performance
and quality of speech recognition. The current commercial remedies, such as noise cancellation
software and noise cancelling hardware show to be inadequate to deal with real world situations.
Certain unwanted signals e.g. background talk that is very similar to the actual voice signal of
interest and thus are indistinguishable from it, very often degrading the recognition.
3.1.1 Physical principles
The voice is the users “keyboard”, and the recognition result of the voice input, depends on the
sound characteristics of the microphone.
Although there are different models of microphones, they all do the same job. They transform
acoustical movements (the vibrations of air created by the sound waves) into electrical signals.
This conversion is relatively direct and the electrical signal can then be amplified, recorded, or
transmitted.
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

16
Definition
16
: a microphone is a generic term that refers to any element that transforms acoustic
energy (sound) into electrical energy (electricity (audio signal)).

B
ASIC MICROPHONE THEORY

Microphones can be classified with respect to several characteristics:
• Current induction in coil
• Voltage change in capacitator
• Accelerometer
• Change in resistance
But the most common ones are dynamic and condenser. Dynamic microphones are dependable, rugged,
and reliable (compared to condenser microphones) and used where physical durability is important. They
are also reasonably insensitive to environmental factors, and thus find extensive use in outdoor
applications. Figure 10 shows the construction of a coil microphone and Figure 11 of a condenser
microphone.

Dynamic microphone: The characteristic of a dynamic microphone is that a flexible mounted diaphragm
is coupled to a coil of fine wire. The coil is placed in the air gap of magnet such that it is free to move
back and forth within the gap. When sound strikes the diaphragm, the diaphragm surface vibrates in
response. The motion of the diaphragm couples directly to the coil, which moves back and forth in the
field of the magnet. As the coil moves in the magnetic field, a small electrical current is induced in the
wire. The magnitude and direction of that current is directly related to the motion of the coil, and the
current thus is an electrical representation of the sound wave, main characteristics are (text box):



Figure 10: In a dynamic microphone a coil is mounted in the force field of a permanent magnet.
Sound waves move the coil back and forth and thereby inducing a electrical current that is
equivalent with the sound intensity.

Condenser (Electret) Microphones: This type of microphone transforms sound waves into electrical
signals by changing the distance between condensator plates. The electret condenser microphone is the
dominant choice for microphones used in computers because of its superior price/performance ratio and
its small size. Sound waves cause the top plate to vibrate, which in turn alters the capacitance, resulting in
a voltage. The electrical signal voltage varies correspondingly with polarity and amplitude of the
frequency and amplitude of the sound waves. An external power supply is needed to measure capacity
changes and to pre amplify the signal. Some condenser microphones have a battery attachment that is
either part of the microphone housing or on the end of the cable, as part of the connector. Condenser

16
www.acronymfinder.com

• Moving coil with magnet
(like a speaker)
• Requires no electrical power
• Generally more rugged than
condenser
• Generally not as sensitive as
condenser microphones
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

17
• Moving diaphragm only
• More sensitive than dynamic
microphones
• Requires power
• Traditional condenser
microphones
• requires high-voltage power
supply
• Modern electret microphones
require battery

Notasruggedasdynamic
Diaphragm
microphones are as mentioned also less durable than dynamic microphones. The characteristics are (text
box):
















Figure 11: A condenser microphone measures a loss in the condenser caused by changing the
distance between 2 thin metal plate. These types of microphone require power source in terms of
a battery.

Microphone response
The construction of a microphone determines its behaviour or response with respect to the
physical properties of a sound wave. A pressure microphone has a response that is proportional
to the pressure in a sound wave, whereas a gradient microphone has a response that corresponds
to the difference in pressure across some distance in a sound wave. The pressure microphone is a
fine reproducer of sound, while the response of a gradient microphone is typically greatest in a
certain direction, rejecting thereby undesired background sounds, pressure microphones are
therefore direction sensitive.

Figure 12: The basic microphone design independent of physical measured principle (condenser
or dynamic).
Most voice-based applications require that background noise is cancelled or attenuated and that
the microphone captures the voice input clearly and with high fidelity.
The main task of the microphone is to transform the sound wave into an electrical signal that
ideally contains only the desired signal. The microphone must therefore deliver high quality
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

18
signals to the computer even in noisy surroundings (up to 100 dB), humid areas, and high
temperature zones having EX
17
features. In addition, it should be easy in use. Some
commercially available products fulfil these features but they are usually not made for usage
with a PC. These products are mainly developed in connection with VHF’s, UHF’s, to suit
professions like smoke divers, police officers, etc. The conversion of such VHF/UHF
microphones to a PC microphone is in principle straightforward. However, their conceived
performance will be considerably lower because in VHF/UHF usage the brain is able to detect
spoken words buried in a noise that is even much higher than the signal level. This means that
the brain is capable to reconstruct a sentence based on just a few fractions. A computer does not
have this capability yet; its performance depends highly on the signal to noise ratio. Other
equally important characteristics that are relevant to the speech recognition:
• Frequency bandwidth.
• Distortions.
• Echo and echo delay.
• Noise type (interference’s, reverberations, background stationary noise).
• Signal-to-noise ratio.
• Accelerations and movements,
• Positioning of the microphone on the body,
• Other characteristics such as the mechanical effects that may occur when using a
press-to-talk microphone.

3.1.2 Noise reducing measures
Noise or other undesired signals can be reduced in a variety of different ways.







Figure 13: Different approaches to noise reducing.

A. Shielding: Old socks, etc i.e. material that absorbs certain frequencies of a sound.
Shielding usually works better for high frequency noise than low frequency noise.
B. Microphone construction
There exist two different types of noise cancelling microphones namely:
• Acoustic Noise Cancelling Microphone (ANCM) (passive)
• Electronic Noise Cancelling Microphone (ENCM) (active)


→→
→ Acoustic Noise Cancelling Microphone construction (ANCM)
Both sides of an ANCM diaphragm is equally open to arriving sound waves see Figure 14. Both
port openings are the distance “D” apart. Because of this distance, the magnitude of sound
pressure is greater in the front than in the rear of the diaphragm and slightly delayed in time.
These two effects create a net pressure difference (P
net
= P
front
- P
rear
) across the diaphragm that

17
The component is certified for explosive areas
A
Shielding
B
Microphone
construction
C
Hardware
signal
processing
D
Software
signal
processing
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

19
causes it to move. This system is less contaminated with noise than that of a microphone with
only one opening. Noise cancelling is therefore achieved by the design of the microphone.



Figure 14: Acoustic noise cancelling microphone.


→→
→ Electronic Noise Cancelling Microphone (active)
In active noise cancellation a secondary noise source is introduced that destructively interferes
with the unwanted noise. In general, active noise cancellation microphones rely on multiple
sensors to measure the unwanted noise fields and the effect of the cancellation. The noise field is
modelled as a stochastic process, and a algorithm is used to adaptively estimate the parameters of
the process. Based on these parameter estimates, a cancelling signal is generated. The challenge
of this approach is that future values of the noise field must be predicted.
The electronic (active) noise-cancelling microphone is built in the same way as the acoustic noise-
cancelling microphone in that it measures the net pressure difference in a sound wave between two points
in space. The characteristics of the active electronic noise cancelling microphone is that it utilise an array
of two "pressure" microphones arranged in opposing directions, with a spacing between the microphone
that equals the port distance “D,” as illustrated in Figure 15. A typical pressure microphone in an array
has the rear diaphragm port sealed to the acoustic wave front while the front is open. The result is the
diaphragm movement represents the absolute magnitude of the compression and rarefaction of the
incoming sound wave and not a pressure difference between two points. An array of two pressure
microphones achieves noise-cancelling characteristics because the output signal of each microphone is
electrically subtracted from the other by an operational amplifier. The operational amplifier output signal
gives the “cleaned sound signal. (Oppenheim, A.V. Weinstein, E. Zangi, K. Feder, M and. Gauger D;
1994).
.

Figure 15 Electronic (active) noise-cancelling microphone
.

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

20

Comparison of noise cancelling microphone technologies
18

Table 4 Compares of passive and active noise cancellation microphones, (ANC-100 (Andrea
electronics
19
) and nomad. (Telex
20
)) The active noise-cancelling microphone is more susceptible
to electronic interference and fluctuating temperatures. it also and costs more because of the
“extra” electronics packed inside (two opposing Omni-directional microphones).
Table 4: comparison of boom microphone technology.

Passive noise cancellation Active noise cancellation
Number of microphones One Two
Microphone element type Bi-directional Omni-directional
Frequency response pattern Same Same
Product facts Single element has inherent
balance over all frequencies,
temperature and time
Dual elements and electronics susceptible
to system imbalances over changing
frequencies temperature and time
Noise cancellation approach Acoustic Electronic
Noise cancellation performance
Good in office surrounding Bad in office surrounding
Susceptibility to electronic noise
None Moderate
Voice recognition performance Good in office surroundings Bad in office surroundings
Cost Approx. 30 $ Approx. 60 $

C. Noise filtering hardware: Hardware filters are often used in connection with
stationary noise, e.g. removal of 50 Hz noise or constant engine noise. Hardware
filters can be designed such that they filter out noise above below or around
(bandpass) some specified frequency. They are usually built up by operating
amplifiers The time continuos signal is smoothened by the filter resulting in a (still)
time continuos time signal (Kuo, F: 1966).





C
OMMERCIAL AVAILABLE PRODUCTS



18
Test from Speech technology Magazine January/February 1998
19
www.andreaelectronics.com.

20
www.computeraudio.telex.com

Time continuos microphone signal
Time continuos filtered signal
Hardware filter
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

21

→→
→ ClearSpeech-Microphone
21
: ClearSpeech-Mic is a digital noise reduction hardware that
significantly removes background noise.
Noise Cancellation Characteristics:
• 300 Hz to 3,400 Hz voice bandwidth
• Single tone noise reduction - >70 dB
• White noise reduction - >12 dB
Power Supply
• 9 to 24V DC
• 0.5 Watt power consumption
Price: $129.00.

D. Noise filtering software: Powerful software algorithms can be written that can
perform similar tasks as hardware filter (low, high and bandpass filters) these filters
requires however that, the time continuos sound signal is digitised first by e.g. an
A/D converter. Directional microphones reduce both continuous and discrete noise
events from “off-axis” locations. Processing the digitised microphone signal by some
software reduces continuous noise from all sources and directions, including any
internal noise from the microphone and sound card circuitry. Noise reducing
software and hardware usually assumes a stationary noise source. If the noise
characteristics are changed over time (non-stationary) more adaptive software is
needed. (Oppenheim, A., Schafer, R: 1975).

C
OMMERCIAL AVAILABLE PRODUCTS



→→
→ ClearSpeech technology
22
: ClearSpeech is an algorithm designed to remove background noise from
speech and other transmitted digital signals. ClearSpeech improves communication through devices such
as telephones and radios, and can be used to increase the performance of speech recognition programs.
ClearSpeech can be implemented in real-time on embedded chips or can be run on a PC under Windows.
The algorithm is designed to remove stationary or near stationary noise from an input signal containing
both noise and speech. Stationary noise is that which has constant noise signal statistics. ClearSpeech
algorithm is adaptive as background noise changes.
Speech picked up by a PC’s microphone can be impacted by background noise. NCT’s ClearSpeech-
PC/COM removes ambient noise from speech while it is being recorded thereby dramatically improving
receive-side intelligibility. ClearSpeech-PC/COM can be used in a variety of PC based voice applications
including voice recognition, voice mail, internet voice communications, real-time processing of voice
from noisy environments and post processing of previously recorded voice.
T
HE CHARACTERISTICS ARE
:
• Continuous and adaptive removal of background noise from speech
• Up to 20 dB signal-to-noise improvement
• programmable noise reduction parameters
• Includes application software to invoke ClearSpeech-PC/COM while recording and to process
previously recorded audio files
• Integrated with Windows® 95 or NT Audio Compression Manager

Noise Cancellation Characteristics
• 300 Hz to 3.85 kHz voice bandwidth
• Single tone reduction > 70 dB

21
www.nct-active.com/csmicr.htm

22
www.nct-active.com/

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

22
• White noise reduction > 12 dB
In the following a test performed by Defence Group Incorporated DGI is quoted, (Grover
&Makovoz, 1999).

Figure 16
23
: shows noisy speech recorded simultaneously with noise cancelling microphones
24

from different vendors, before (red) and after software noise reduction. (Black). In both cases the
software was able to remove, noise satisfactory recorded by the microphones. Each microphone
had quite different output with a distorted speech input signal. The software caused however no
added distortions in either case.

Speech-to-Text
Dictation requires a very low error rate in order to be accepted. Even with a headset microphone
and in a “quiet” office, the background noise limits achievable error rates for large vocabulary
dictation systems. This is shown from testing the VoiceType dictation system from IBM, both
with and without software noise processing. Tests were done in an office, with no background
speakers, using an Andrea ANC-600 headset microphone. Two different recordings were made
one for enrolment, another for testing. The speech-to-noise ratios (SNRs) were 30-40 dB. One
copy of VoiceType was trained and tested with no software noise reduction, and another was
trained and tested with the software noise processing. Training and testing without the software
noise reduction gave 76 errors in 1009 words of spoken text. Training and testing with added
software filtering gave 22 errors out of the same 1009 words of spoken text. Results are
summarised in
Table 5 below.

Table 5: Test result of voice filtering
IBM VoiceType
No filtering Filtering
Error rate in quiet office 7.5 % 2.2 %

By using vocabularies that are more restricted and grammars, voice command systems tend to be more
noise resistant than large vocabulary dictation systems. Yet, they often must tolerate very high noise
levels for automotive, industrial, military and other applications. The gain when combining noise
reduction software with a commercial voice command system are certain.
Data was recorded for 100 speech commands in a “quiet” environment
25
. An extended testing set was
then also prepared by mixing with a range of noise levels and types from various vehicle and industrial
environments. Enrolment using only the “quiet” data was then done both with and without software noise

23
pictures taken from http://www.ca.defgrp.com/n_test.html#microphones

24
(A) The Andrea ANC-600, and (B) the Shure 10A.
25
Test performed by Defence Group Incorporated (DGI) Signal and Image Processing Group: http://www.ca.defgrp.com
/.
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

23
reduction included in the enrolment process. Testing was then done, again both with and without filtering
included, and now using the extended test data set with added noise mixing. For voice command
applications, a false positive response is a critical failure actually worse than no response at all.
Table 6 summarises both correct responses, and also false positives, versus the input speech-to-noise
ratio, both with and without the benefits of noise reduction by software processing.

Table 6: Results with and without noise filtering.
Voice
Command
No noise filtering With noise filtering
SNR (dB) Correct False positives Correct False positives
35 100 % 0 % 100 % 0 %
20 88 % 4% 99 % 1 %
10 6 % 4 % 77 % 0 %

Table 7 shows results when the processing was used only for testing, but not for enrolment. Enrolment
here (as above) used the “quiet” enrolment data, but no filtering. Testing was again done on the noisy test
data, both with and without noise removal filtering. Performance gains in this case, from using noise
removal filtering only for testing, are not as good, since there was inevitably some noise in the basic
training data, where the processing was not used, while corresponding (and larger) noise perturbations
were removed only under testing. Even so, the processing still provides appreciable benefits in a noisy
environment, even if not used in initial enrolment.

Table 7: Results with and without noise filtering in testing.
Voice
Command
No noise filtering
training/testing
With noise filtering
Final testing
SNR (dB) Correct False positives Correct False positives
35 100 % 0 % 100 % 0 %
20 88 % 4% 99 % 1 %
10 6 % 4 % 58 % 1 %

3.1.3 Body placement
The location of the microphone during recording plays an important role. For instance, a
microphone placed directly under the nose or mouth will capture a lot of breathing sounds and so
contaminating the signal unnecessary. There are several options on placing the microphone on
the body see; Figure 17.
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

24

Figure 17: Different options on placing of the microphone on or near the body are available.The
best place of the microphone is such that it does not obstruct the user.

A scull mounted microphone appears to be very appealing from various reasons: It will not
obstruct the user, it is small, it weighs as little as 40-50 g and it has a high signal-to-noise ratio.
The scull microphone can be placed inside the helmet and as the user puts on his helmet, he
“puts” on his microphone as well. The headphone is automatically connected with the
microphone and also positioned inside the helmet.
This communication system is based on the principle of bone transmission and does therefore
not obstruct the face. Due to its placement inside the helmet, it the influence of outside noises
considerably decreases. Clear and easy communication is possible even while wearing a mask or
a breathing apparatus. Its use is simple and comfortable.
Further, some users might not be comfortable when talking to a machine and may not accept a
headset microphone easily. A microphone should therefore be attached to the helmet.

3.1.4 Commercially available microphones
(a)
Active noise cancelling boom microphones

E
MKAY
26
: Offers a RF Wireless Headset, a single channel, full duplex
system, with a transmit range greater than 10 m. The head-set is capable of
performing up to 10 hours between recharges. Its lightweight construction
and ear-loop design ensure a comfortable fit. The headset has been designed
for use in PC voice recognition, computer telephony, and Internet telephony.

A
NDREA
27
: Electronics Corporation Active Noise Reduction (ANR)
earphone, Active Noise Cancellation (ANC) near-field microphone, patented
Digital Super Directional Array (DSDA™) and Directional Finding and
Tracking Array (DFTA™) far-field microphones.


26
www.emkayproducts.com
.
27

www.andreaelectronics.com.

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

25


T
ELEX
28
: offers a USB
29
digital microphone for speech dictation applications.
This microphone delivers pure digital signals to the speech recogniser and
eliminates the performance variations inherent in analogue sound cards. The
headset also includes Acoustic Noise Cancellation technology to cancel
background noise that can degrade speech recognition performance.
Speech recognition software performance is greatly dependent on the quality
of the audio signal. Software developers have had a difficult challenge
dealing with the wide variations in performance and quality of analogue
sound cards. With a USB interface, the voice signal bypasses the soundcard
with direct digital input to the USB bus. Table 8

below is an independent
judgement of an ANCM.

Table 8: Pro and cons of Active noise cancelling microphones.
Pro’s Con’s
Not certified for Ex environments
Commercial available for connection with
PC
Does not functions in extreme areas
Does not stand rough use
Not user friendly
Consists of many components

(b)
Scull mounted microphone
Example of two manufacturers of scull microphones are, CGF Gallet
30
a
French producer and Ceotrtonics
31
a German producer.
T
HEIR MICROPHONE CHARACTERISTICS ARE SISMIALAR
:
• Measured principle: accelerometer with a sensitivity of
1mV/mG. Bandwidth: from 20Hz to 20 kHz.
• Amplifier: bandwidth from 300 Hz to 3 kHz at -3 dB.
• Weight of head equipment: 55±2 g.
• Electrical tightness: IP 54 cover.
• Has EX (explosion safety) features.

Table 9: Scull mounted microphone.
Pro’s Con’s
Certified for Ex environments
Light weight Requires use of helmet

28

www.computeraudio.telex.com

29
Universal Serial Bus
30
http://www.gallet.fr

31
http://www.ceotronics.de

DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

26
Functions in extreme
32
areas Not commercial available for connection with
PC
Stands rough use Rather expensive (approx. 4000 Nkr)
User friendly The microphone from Ceotronics needs
amplifier

(c)
Throat microphone
The throat microphone uses an “indirect air-borne vibrations”
technique to receive vibration energy generated on the skin near the
vocal cords. It features high isolation capability not only for
environmental noise, but also from frictional vibrating sound
generated by the microphone head. They can even be made water,
dust, and corrosion resistant.
These microphones provides clear communication when wearing
breathing apparatus or in very high noise environments. A dual slope
band pass filter circuit rejects the unwanted low frequency body
resonance.





Table 10: Throat microphone.
Pro’s Con’s
Light weight Not certified for Ex environments
Noise cancelling Not “user friendly” since it requires, taking
the throat microphone on and off.
Functions in extreme
33
areas Not commercial available for connection
with PC
Stands rough use Consists of many components
Ideal for wearing under protective or
Hazardous material clothing

Total hands-free operation VOX or PTT
activation

Provides clear audio in high noise
environments



32
Areas with noise, humidity and high temperatures
33
Areas with noise, humidity and high temperatures
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

27
Table 11: The different microphone products compared with each other.
Consideration
s
Noise cancelling boom
microphone
Scull mounted Microphone Throat microphone
Performance Bad Good Good
Price cheap Medium medium
Easy use No Yes No
Other Available for use with
speech recognition
applications
N.A for use with speech
recognition applications
N.A for use with
speech recognition
applications

Table 12: Evaluation of commercial available microphones as input device for speech
recognition.
Advantages Disadvantages
Faster than a keyboard and mouse When user is working with co-workers, a cue
system is needed to let the computer know when
the utterances are intended for the computer, vice
the co-worker
Does not require use of hands May need press to talk switch requiring a hand
Can improve performance in hands-
busy (maintenance), eyes-busy
(inspection) tasks
Use of “bracket words” requires some training

High background noise levels can cause
inaccurate word recognition and false inputs and
thus a backup input device may be required

Behavioural states (e.g. anxiety, stress) and task
loading can affect human voice characteristics
and degrade interactive speech system
performance

Prompts are required when assistance is needed
to recall the appropriate procedures in a given
situation.

Interfaces must be developed to prompt users
when the possible Vocabulary used by the
system is beyond the user’s recall.

Feedback must be presented to the user when
spoken words are not understood (this Takes up
valuable display space)

Recognition rates of 95% mean one error in 20
words. Easy, quick ways to correct errors must
be developed.
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

28

Specifying a position in a two dimensional space
is difficult. A pointing device is needed for these
tasks.

When using a speech synthesis system, it can be
difficult (or impossible) to interrupt the system
while it is “speaking” under the recognition
process

Annoyance when the recognition is not correct





3.2 Speech recognition software

Speech recognition systems are error-prone and not very robust to real world disturbances, such as
ambient background noise including speech spoken by other surrounding speakers, communication
channel distortions, pronunciation variations, speaker stress, or the effects of spontaneous speech. Current
speech recognition systems can be categorised in two groups according to their robustness level.
Applications falling into the first category have typically large vocabularies and they have been designed
to recognise continuos speech. Successful use of these systems requires efficient minimisation of all
possible interference sources. High quality audio equipment, including a close talk microphone, noise free
operating environment, and particularly, a co-operative and motivated user, are needed in order to achieve
a high recognition performance. Dictation systems for continuous speech are typical examples of this
category. Truly robust speech recognition applications form the second group. Robust systems can cope
with distorted speech input (to a certain extent) and still provide high recognition accuracy even with
inexperienced novice users. These systems can usually recognise only discrete words and their
vocabulary size is limited to some tens of words. A good example of a robust system is a speaker-
dependent name recognition application in voice dialling. In the name dialling system, the user has
trained voice-tags; i.e. names that have phone numbers attached to. By speaking a certain voice-tag, a
phone call to the attached number is then made. Because of this apparent simplicity, name dialling is very
useful for example in a car environment where the user’s hands and eyes are busy. It is important to note
that speech recognition alone does not have any particular value. To use speech as an input modality,
there must always be some practical advantages. Furthermore, one cannot overestimate the importance of
a good user interface; it is essential that speech recognition applications are extensively tested with real
users under realistic operating environments.
All modern speech recognition systems follow roughly the same basic architecture as shown in
Figure 18 The task of a speech recognition system is to transform the digital speech signal into a
discrete editable word, or a sequence of words. This transformation process consists of several
steps. (Viiki, 1999).
1. First, a time continuos, digital microphone signal is converted into a sequence of discrete
acoustic observations.
2. Then the actual recognition process makes use of three different knowledge sources: the
acoustic models, the lexicon, and the recognition algorithm. The algorithm extracts
individual blocks that with a high degree of possibility represents single words.

Speech
Database
Acoustic
Models
Training
Recognition
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

29












Figure 18: A block diagram of a speech recognition system.
The actual “speech recognition algorithm” is the module dealing with individual words. The
designers usually focus on three major elements: the vocabulary (complexity, syntax, and size),
the environment (bandwidth, noise level, and distortion type), and the speakers (stressed/relaxed,
trained/untrained).
Language specific models (typical sentence construction, phrases, grammar, etc) to identify
different word types (substantives, adjectives, verbs, etc) and the individual word recorded is
compared with the word signature from the corresponding vocabulary databases.
A number of voice recognition systems are available on the market. The most powerful can
recognise thousands of words. However, they generally require an extended training session
during which the computer system becomes accustomed to a particular voice and accent. Such
systems are said to be speaker dependent.
Many systems also require that the speaker speak slowly and distinctly and separate each word
with a short pause. These systems are called discrete speech systems. Recently, great strides
have been made in continuous speech systems (voice recognition systems that allow you to
speak naturally). There are now several continuous-speech systems available for personal
computers.
Because of their limitations and high cost, voice recognition systems have traditionally been
used only in a few specialised situations. For example, such systems are useful in instances when
the user is unable to use a keyboard to enter data because his or her hands are occupied or
disabled. Instead of typing commands, the user can simply speak into a headset. Increasingly,
however, as the cost decreases and performance
improves, speech recognition systems are entering
the mainstream.

Important characteristics of a speech
recognition program
The bottom line for speech recognition software is
speed and accuracy. If the software can't decipher
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

30
what you said correctly, it is not usable. Likewise if the decipheration process takes too long
time nobody will use it. Speech recognition programs can however, do more than take basic
dictation, and there are a lot of other features that could make these packages productivity-
enhancing tools. Further important characteristics are:

→→
→ Set-up and Training
Wizards that assist to set up the system ant to help getting started are considered important. The
figure to the right shows the Dragon NaturallySpeaking 3.0 user wizard, to adjust the volume,
measure sound quality, and train the program. To improve accuracy further, documents
containing common words can be imported.

→→
→ Editing and formatting
Getting the words onto the screen is only half the
job. How well speech recognition software
handles editing and formatting is also critical. The
way modeless operation and natural-language
support is achieved influence the user friendliness.
For example, when NaturallySpeaking stumbles
on a homonym
34
, one should simply repeat the word and the program would select the
alternative.

→→
→ Application Integration
In the future, speech recognition will take place in the background and speech will become
simply another way of interacting with the PC. As a precursor to that, vendors have been
developing tight links between their speech-recognition programs and the applications
commonly used every day, especially word processors. In this example, we dictated directly into
Word using L&H Voice Xpress Plus. The Command Browser shows all of the variations that can
be used to insert a table into Word.

→→
→ Command-and-control
Although continuous speech dictation is a relatively recent development, command-and-control
applications have been around for years. A “What Can I Say” command should lists the
commands that are available anywhere in e.g. Windows.











34
Homonym (two words are homonyms if they are pronounced or spelled the same way but have different meanings)
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

31



3.2.1 Principles
Speech recognition and Natural Language Processing (NLP) systems are complex pieces of
software (Raghavan, 1998). A variety of algorithms are used in the implementation of speech
recognition systems. Speech recognition works by disassembling sound into small units and then
piecing back together, while NLP translates words into ideas by examining context, patterns,
phrases, etc.
Speech recognition works by breaking down sounds the hardware “hears” into smaller, non-
divisible, sounds called phonemes. Phonemes are distinct units of sound. For example, the word
“those” is made up of three phonemes; the first is the “th” sound, the second the hard “o” sound,
and the final phoneme the “s” sounds. A series of phonemes make up syllables, syllables make
up words, and words make up sentences, which in turn represent ideas and commands.
Generally, phonemes can be thought of as the sound made by one or more letters in sequence
with other letters. When the Speech Recognition software has broken sounds into phonemes and
syllables, a “best guess” algorithm is used to map the phonemes and syllables to actual words.
Once the Speech Recognition software translates sound into words, Natural Language processing
software takes over. The Natural Language Processing software parses strings of words into
logical units based on context, speech patterns, and more “best guess” algorithms. These logical
units of speech are then parsed and analysed, and finally translated into actual commands the
computer can understand based on the same principles used to generate logical units.
Optimally Speech Recognition and NLP software can work with each other non-linearly in order
to facilitate better comprehension of what the user says and means. For example, a Speech
Recognition package could ask a NLP package if it thinks the “tue” sound means “to”, “two”,
“too”, or if it is part of a larger word such as “particularly”. The NLP system could make a
suggestion to the Speech Recognition system by analysing what seems to make the most sense
given the context of what the user has previously said. Speech Recognition systems may
determine which sounds or words were emphasised by analysing the volume, tone, and speed of
the phonemes spoken by the user and report that information back to the NLP system.
3.2.2 Recognition enhancing measures
Using speech recognition systems in real working surroundings reveal that they do not perform
as stated by the sales agent. There exist a couple of factors that determine the recognition rate
(Allen, 1992).
The major requirements relate to:
• Vocabulary, speech and language modelling.
• Training material (if needed), the data collection platform, pre-processing
procedures.
• Speaker dependency and speaking modes.
• Environment conditions (ambient conditions).
In the following, I will discuss the main factors influencing the recognition process. A speech
recogniser is based on some speech modelling using various paradigms. The best known are
Dynamic Time Warping (DTW), Hidden Markov Modelling (HMM), and Artificial Neural
Networks (ANN’s). Most of the approaches distinguish two phases: A training phase and an
exploitation phase. The first phase is devoted to learning speech characteristics from data:
1) Acoustic wave forms.
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

32
2) Phonetic/linguistic descriptions.
3) Specific features, etc.
Speaker dependency
The system may either be tailor-made for a particular speaker (speaker dependent) or designed to
tolerate a large variety of speaker variability (Stern et al. 1996).
Other systems may be tuned to the voice of a particular single speaker. Or a set of speakers (multi-speaker
system). They may also be tuned to trained instead of general untrained speakers.
In order to achieve higher recognition rate a training phase is usually needed on a pre-specified set of
words/sentences. The characteristics of this set required set are described in terms of:
1. Type of data.
2. Speech acoustic wave forms.
3. Acoustic data with phonetic labelling.
4. Acoustic data with the corresponding orthographic forms.
5. Acoustic data with the corresponding phonetic transcription.
6. Acoustic data with the corresponding recognition-units transcription, etc.
7. Size of data (how many hours/minutes of speech).
8. Number of speakers and how they are selected.
9. Other characteristics (Sex, age, physical psychological state, experience, attitude,
accent, etc.).
10. Acquisition channels: (single microphone, set of microphones, similar telephone
handset or as many handsets as possible).
11. Environment conditions (noisy, quiet, all conditions, etc.) and constraints derived
from the operating condition.
A speaker adapted system: learns the characteristics of the current speaker and hereby
continuos improving the performance. At the beginning the system may be used in a degraded
mode (either speaker-independent or speaker dependent) and ending up as an optimised speaker-
adapted system. The person who is going to use the system should do the adaptation. In order to
tune in on his specific voice signature. Usually two approaches are used:

→→
→ Static adaptation: The static adaptation process starts with an off-line learning from pre-
recorded data and a training phase before using the system. The system references are adapted to
the new speaker once and for all. The duration of this process is important: it can be real-time or
even last for hours. The speech data needed can be acoustic data without any manual labelling or
manual pre-processing, or it may have to be labelled (orthographic plus phonetic). The speech
corpus may range from a few minutes of speech to a few hours.

→→
→ The Dynamic adaptation process: The system learns the current speaker characteristics while the
speaker is using the system. This may be done by manually correcting errors during the adaptation, or the
system may automatically take into consideration the speech data uttered by the present speaker.
It can be distinguished between three speaking modes each characterised by different recognition
performance:
• Isolated: The words are pronounced in isolation with pauses between two
successive words, this gives the highest recognition rate.
• Connected: Usually used when spelling names or giving phone numbers digit by
digit, a lower rate of recognition is achieved.
• Continuous: Fluent speech, the mode with lowest recognition rate.
The speaking rate: varies from one speaker to another. And depends on various factors such as,
stress, culture, emotion, etc. it can be slow, normal, or fast and a measure for it is the average
number of speech frames within a given set of sentences.
Non-speech sounds: Sounds like, coughing, sneezing, clearing one's throat, etc. may represent a
challenge for the software These (or non-linguistic utterances) must be considered as part of the
speech modelling.
DNV N
ORWEGIAN
U
NIVERSITY OF
S
CIENCE AND
T
ECHNOLOGY

E
VALUATION OF COMMERCIAL AVAILABLE SYSTEM COMPONENTS

Page

33
The lexicon size: The size of the vocabulary is one of the main characteristics of automated speech