Thesis - WordPress.com

wellofflimpetMobile - Wireless

Dec 14, 2013 (3 years and 6 months ago)

62 views

An Android Mobile Phone Platform as a
Communications Aid for the
Speech/Hearing Impaired





Ross Duignan

B.E. Electronic & Computer Engineering Project Report

4BP1


Project
Supervisor
-

Dr. Edward Jones

Co.
A
ssessor


Prof. Gearoid O Laighin

30
th

March 2012



ii






Abstract

The almost ubiquitous nature of mobile phone technology

allows for the possibility of using
this technology in many ways to assist those who may suffer from
some form of health
problem or who may have some form of disability. This project will investigate a number of
ways in which smart phone technology can be used to assist with communication by speech
-

or hearing
-
impaired users.

The basic framework for
the
application

is

based on the

automatic speech recognition (ASR)
and text
-
to
-
speech (TTS) functionality provided for some modern smart phones. This
functionality
is

leveraged in a number of ways

i.e. the

app translate
s

incoming speech from

everyday spontaneo
us conversations
or

phone call
s

into text on the screen
for the user to
read
.
Using digital signal processing, the app also

capture
s

the “emotion” underlying spoken
language

which

aims to

provide additional contextual information to a hearing
-
impaired

user
. On the other hand, for a person with a speech impediment,
the application

allow
s

the
user to type in text, which is then “synthesised” into speech
using TTS
.

The application also
uses

the accelerometer integrated into
most smart phones
for

detecting gest
ures
of

common
sign language
phrases t
o be synthesised by TTS
.

The result

of this project demonstrate
s

the
feasibility of using a smart phone platform as an
effective aid fo
r hearing/speech impaired users
.

The project
was

carried out with advice
from Deaf
Hear.



iii






Acknowledgements

First of all, I would like to acknowledge the help and support I received throughout my
project from my project supervisor
,

Dr. Edward Jones
.

I would also like to express my gratitude to Martin Hynes for his patience and support
whenever I needed a helping hand.

Tony Dolan from DeafHear Ireland also helped me when I needed a person with a lot of
experience with the deaf community to make some design decisions and I would like to
extend my gratitude to him.

This project would not h
ave succeeded without the debugging help and conversation I
received from my fellow 4
th

ECE/EE students.

Finally, I would like to thank my family and my housemates for their continued support over
the last 4 years.













iv






Declaration of Originality

I
declare that this thesis is my original work except where stated.

Date: ___________________________________

Signature: ___________________________________


















v






Contents

Abstract

................................
................................
................................
................................
.....

ii

Acknowledgements

................................
................................
................................
..................

iii

Declaration of Originality

................................
................................
................................
.........

iv

Gl
ossary

................................
................................
................................
................................
...

viii

1

Introduction
................................
................................
................................
........................

1

2

Project Outline

................................
................................
................................
...................

3

3

About Android

................................
................................
................................
....................

4

4

Background

................................
................................
................................
.........................

5

4.1

Activity

................................
................................
................................
........................

5

4.2

Layout
................................
................................
................................
..........................

5

4.3

Services

................................
................................
................................
.......................

5

4.4

Shared preferences

................................
................................
................................
.....

5

4.5

Br
oadcastReceiver

................................
................................
................................
......

5

5

Description of Work Carried Out

................................
................................
.......................

6

5.1

Implementation of basic ASR and TTS functionality

................................
...................

6

5.1.1

Android Speech
Recognition

................................
................................
...............

7

5.1.2

Text
-
to
-
Speech
................................
................................
................................
.....

9

5.1.3

Integrating ASR and TTS

................................
................................
....................

10

5.1.4

Technical Difficulties Encountered

................................
................................
....

11

5.2

Intercepting Phone Call

................................
................................
.............................

13

5.2.1

Recording Messages

................................
................................
..........................

14

5.2.
2

Technical Difficulties Encountered

................................
................................
....

16

5.3

Motion Detection for Sign Language Phrases and Simple Gestures

........................

18

5.3.1

Android’s Sensors

................................
................................
..............................

18

5.3.2

Sign Language Motion Detection

................................
................................
......

19

5.3.3

“Excuse me” phrase detection

................................
................................
..........

20

5.3.4

Simple Motion Recognition

................................
................................
...............

22

5.3.5

Technical Difficulties Encountered

................................
................................
....

23

5.4

Simple Emotion Recognition

................................
................................
.....................

25

5.4.1

Emotion Characteristics

................................
................................
.....................

25

vi






5.4.1.1

Intensity

................................
................................
................................
......

25

5.4.1.2

Pitch

................................
................................
................................
............

25

5.4.1.3

Zero Crossing Rate (ZCR)

................................
................................
............

25

5.4.1.4

Spectral Slope

................................
................................
.............................

26

5.4.2

Extracting Speech Characteristics from Audio File

................................
............

26

5.4.3

Technical Difficulties Encountered

................................
................................
....

29

5.5

Power Consumption/Computational Complexity of Application

.............................

31

6

Recommendations and Conclusions

................................
................................
................

34

7

Appendices

................................
................................
................................
.......................

36

7.1

Appendix 1

................................
................................
................................
................

36

8

References

................................
................................
................................
........................

37




vii






Table of Figures

Figure 4.1.1
-

Survey results on U.S. smartphone subscribers by comScore

............................

4

Figure 5.1.1


Application’s main screen UI

................................
................................
..............

7

Figure

5.1.2


UI for basic ASR

................................
................................
................................
...

7

Figure 5.1.3


Code


ASR implementation with continuous translation

................................
.

8

Figure 5.1.4


Code
-

Ensuring the four most recent sentences are displayed on screen

........

9

Figure 5.1.5


Basic TTS UI

................................
................................
................................
.........

9

Fi
gure 5.1.6


UI for integrated ASR and TTS

................................
................................
..........

10

Figure 5.1.7


Code
-

Set phrases to memory using shared preferences

...............................

11

Figure 5.2.1


Code
-

Initializes app when call is received

................................
......................

13

Figure 5.2.2


Menu shown during call

................................
................................
...................

13

Figure 5.2.3


UI during message being recorded

................................
................................
...

14

Figure 5.2.4


Code
-

Creates MobileInterpreter folder if it doesn’t already exist

.................

14

Figure 5.2.5


Code


Ex
it app when call is over

................................
................................
.....

15

Figure 5.2.6


Messages screen showing user time of call and caller’s number

....................

15

Figure 5.2.7


Code
-

Translate pre
-
recorded message

................................
..........................

16

Figure 5.2.8


Voice message translated

................................
................................
.................

16

Figure 5.3.1


The value changes with each respective movement

................................
.......

20

Figure 5.3.2


Flowchart


How the phrase “excuse me
” is detected

................................
....

21

Figure 5.3.3


Code


Shows how the “goodbye” motion is detected

................................
...

22

Figure 5.3.4


Coffee shop scenario UI

................................
................................
....................

22

Figure 5.3.5


Edit phrases that can be outputted in “Scenarios” option

..............................

23

Figure 5.3.6


Code


Saves message to be outputted and launch the new activity

..............

24

Fi
gure 5.4.1


Linear approximation of magnitude spectrum

................................
................

26

Figure 5.4.2


Code


Using audioInputStream to extract bytes from aud
io file

...................

27

Figure 5.4.3


Code


Loops though bytes and makes

................................
............................

27

Figure 5.4.4


Graph of FFT results

................................
................................
..........................

29









viii






Glossary

TTS


Text to Speech

AS
R


Automatic Speech Recognition

FFT


Fast Fourier Transform

GUI


Graphical User Interface
1






1

Introduction

According to the European Union of the Deaf [1], there are
roughly 5000 Irish Sign Language
(ISL) users and only 40 registered interpreters

(see email from Sign Language Interpreting
Service, Ireland in appendix 1)
.
Of the 40 registered, 26 are based in Dublin
[2]
which leaves
the rest of the country extremely und
erstaffed. Also when you take into consideration the
cost of hiring an interpreter, which can cost between €170 and €340 for a full day [2], it can
be very expensive
for
a person
who
needs interpreting on a regular basis. This project aims
to provide an al
ternative to this necessary but costly service.

Technology has provided support for people with all types of day to day issues and there is a
definite gap in the market when it comes to communication aids for people with
hearing/speech impairments.

The ma
jority of deaf people communicate through sign
language which is a great alternative to speaking until a conversation needs to take place
with a person who is incapable of interpreting it.

Using TTS and ASR,
which

comes as
standard with most smart phones,
the
purpose

of this

project is to create an intuitive and
simple application to allow that conversation to be possible.

Modern smart phones have an array of options that allow the user to enter a command i.e.
voice commands, motion detection etc. and this
project aims to take advantage of these
to
overcome their limitations in certain social situations. Using ASR and TTS, there is a “General
Conversation” function which translates incoming speech and also synthesise
s

text, which
has been inputted to the phone,

to speech. The application

also intercept
s

phone calls so
that people with hearing/speaking impairments can hold conversations over the phone

and
also record messages which can be translated at a later time
. Des
pite the desire to make the
application as multi
-
functional as possible, it still had to
have a shallow learning curve if it
was going to be successful. For this reason, the application has “sign language motion
detection” capabilities where a certain amou
nt of sign language phrases can be recognized
by the phone and the speech for the respective phrase is synthesised by the phone. There is
also a simpler motion detection functionality implemented where simple motions are
registered with the phone and phras
es that the user can input are synthesised

to speech
.

Movies and TV often use italics and symbols etc.
in subtitles
to provide the viewer with
certain context of the conversation that’s occurring.
Possibly the most innovative aspect of
2






this project is the

“emotion recognition”
because it

gives the user context
to the
conversation
that simple voice recognition wouldn’t be capable of interpreting.
Combining
this set of functionalities, this application could be a genuine tool to assist people with
hearing/sp
eaking impairments.



3






2

P
roject Outline



Pass

-

Implementation of basic ASR and TTS functionality on an Android platform and
verification of performance, including a basic “app” to capture input speech and either carry out
ASR in real
-
time, or else store the
speech and carry out ASR in a semi offline fashion, and
display the results to screen. Also, develop a basic app to synthesis typed commands using TTS



Average

-

Development of app to “intercept” a phone call (either in real
-
time or in a
“store and recall”
framework) and convert to text using ASR for display on the phone.




Good

-

Using the on
-
board accelerometer, implement a “gesture recognition” system,
perhaps based on a small subset of sign language gestures, to map gestures to a set of
commonly used phrases, and synthesise these using TTS.




Very

Good

-

Characterisation of
computational complexity of the apps on the Android
platform, and analysis of impact on the battery life of the phone. Also, Investigation (in
Matlab) of a system for emotion recognition for ASR, with the intention of providing
additional contextual inform
ation for the user.





Excellent



Integration of emotion recognition into the ASR system on the Android platform, and
system evaluation in real environments.



4






3

About Android

Android is a Linux
-
based operating system
for mobile devices owned by Google. Developers
primarily write code for applications through Java which is a very common and well
supported language. It is an open source language, therefore a developer can download the
Android SDK (Software Development K
it) for free and plugins are available for Java
programming platforms such as Eclipse. Code can be debugged using a phone emulator built
into the Eclipse plugin or
it can be run
on an Android powered phone. The Emulator
provides the developer with internet

and GPS functions but it cannot simulate every action
of a phone i.e. sensor data
or phone calls.

According to data collected
in the U.S.
by comScore MobiLens service [3], Android has a
current market share of 48.6% at the start of this year which is qui
te a considerable amount
away from its closest competitors Apple who hold a 29.5% share. This is due to the fact that
there are a number of companies who make Android powered phones (HTC, Samsung etc.)
compared to Apple being the only company who makes pho
nes that are powered by IOS
(
IPhone’s

software). The major difference between coding for Android and IOS is the fact
that a developer has to pay to be a developer for IOS and
has to develop code on an Apple
powered computer. This is compared to Android’s

f
ree SDK and the fact that coding

for an
Android powered phone can be done on any computer
.

For these reasons, it was a simple
choice to code this project for an Android powered phone.


Figure
4.1
.
1

-

Survey
results on U.S. smartphone subscribers by comScore

5






4

Background

This section introduces some key concepts on Android:


4.1

Activity

An
activity

represents a single screen with a user interface. Integrating multiple activities
work together to form a cohesive user experience.

4.2

Layout

Your layout is the architecture for the user interface in an Activity. It defines the layout
structure and holds all the elements that appear to the user. The Android framework gives
the developer the flexibility set the default layout of an activity in th
e layout XML but the
layout can be then modified based on events in the Java code.

4.3

Services

A
service

is a component that runs in the background to perform long
-
running
operations or to perform work for remote processes. A service does not provide a user
interface. I will be using a service to constantly monitor the phone’s “call state” and
intercept calls.

4.4

Shared preferences

Is used to store private primitive data in key pairs i.e. a primitive object value and an
object value (usually an int or String) t
o access that entry.

4.5

BroadcastReceiver

A broadcast receiver is a class which extends
BroadcastReceiver

and which is
registered as a receiver in an Android Application via the
AndroidManifest.xml
file.
This class will be able to receive intents, in this pro
ject monitor the phone’s state.






6






5

Description of Work Carried Out

5.1

Implementation of basic ASR and TTS functionality

The very first thing that was needed for the project was to develop a simple user interface,
which is the foundation to any application.
In Android, this is done using XML to design the
layout for each screen. Layout design in Android is written in XML, and Eclipse has a
graphical editor to create these screens using “drag and drop” to place components. Despite
having some experience in And
roid development, this didn’t include any work on user
interfaces or layout XML so an alternative solution would have been in
the application’s

best
interest.


After some research, the website,
http://www.app
inventorbeta.com/
, seemed to provide
the best alternative for UI development. Its main selling point is that it lets novice
developers create a basic app through the browser without touching code. It allows the
developer to design screens with everything t
hat is offered in Eclipse but AppInventor uses
“blocks to specify the app's behaviour”. This means that the user develops the app’s “flow”
by linking up blocks visually instead of actually writing code which is a massive advantage for
beginner Android deve
lopers. This seemed like

a viable solution to develop the application’s

user interface and it would save time in the long run. Unfortunately, after working with this
site for some time, it became apparent that its service had some minor flaws and that it
w
ould be easier in the long run to develop the UI using XML as originally intended.

To integrate the ASR and Text
-
to
-
Speech functionalities, the initial aim was to first get both
working independently to obtain a solid understanding of both functions. This
would be vital
when the time came to integrate them to create the foundation of my application. Figure
5.1.
1 shows the main menu

UI design that was settled on.






7







With each UI screen, an Android “activity” must be created. The activity’s “onCreate”
method is where the processing necessary to set up the screen is completed. The menu
screen’s capabilities are very simple, once a butto
n is pressed a new “Intent” is set up which
specifies the new activity that needs to be initialized.

5.1.1

Android Speech Recognition

Figure 5.1.3 shows

the UI that was decided on fo
r the simple ASR implementation.


Figure
5.1
.
2



UI for basic ASR

When the microphone is pressed, a new intent is initialized for speech recognition.
Figure
5.1.4 shows a fragment

of code
that sets

the speech recognition inte
nt

to record speech for
six

seconds. A “Timer” object is then used to restart the speech recognition intent so that
the app is constantly recording until the user toggles the microphone button to the off
position.


Figure
5.1
.
1



Application’s main screen UI

8






This is needed because
Google’s speech recognition only takes in the

user’
s voice until there
is a pause, then it compares that with data collected
and stored on

Google’s servers to find
a match which is received by the phone and then written to text.

This was an issue when it
came

to general conversation because it takes
roughly two seconds to receive the match
from the servers and if the person had a brief pause and continued talking, the phone would
have missed two seconds of speech. Missing two seconds of speech every time the user
takes a slight pause in speech was not

efficient enough for the purposes of the application
so that is why
the application

records for six seconds and is

then

looped constantly until the
user stops recording so nothing is missed.

Figure 5.1.4 shows that until the user indicates
that person has

stopped talking (i.e. conversationEnded = True) then the code will keep
translating the incoming speech.


Figure
5.1
.
3



Code


ASR implementation with continuous translation

An array of possible matches is

generated from this speech recognition intent which the
entries of the matches are positioned from the most likely down to the least likely. My code
picks the most likely match to write to the screen and ensures that only the four most
recent results/sent
ences are displayed to the user

and this

code segment

is shown in figure
5.1.5.

9







Figure
5.1
.
4



Code
-

Ensuring the four

most recent sentences are displayed on screen

5.1.2

Text
-
to
-
Speech

Figure 5.1.6

shows

the UI which was decided on fo
r the simple TTS implementation.


Figure
5.1
.
5



Basic TTS UI

The Text
-
to
-
Speech functionality is a simpler task and is achieved using a lot less code than
ASR. The “onCreate”
method initializes the TTS object which will then be used to output the
10






text entered by the user. Once the speaker button is pressed, a method is called which
extracts what has been entered in the text box. The TTS object is then used to output it to
speec
h like so:


Once the TTS is completed, the TTS object needs to be stopped and shutdown.

5.1.3

Integrating ASR and TTS

Figure 5.1.7 shows the UI for integrated ASR and TTS.


Figure
5.1
.
6



UI for integrated ASR and TTS

ASR or TTS is initialised by toggling the “Start Conversation” button. This brings up a menu
which offers the user to begin the conversation with ASR or TTS. If the user chooses ASR, a
new intent is initialized
and the spe
ech recognition

code
, as seen in figure 5.1.4, starts
translating speech
. Once the user wants to stop recording speech, the “End Speech
Recognition” button should be pressed which brings up the menu again where the user can
choose to continue with ASR or s
witch to TTS.

When using TTS, fundamentally, it is the same as the basic TTS as previously described
but
ha
s the extra functionality where the user can save phrases. The “Save Phrase” button gives
the user the option to save the text in the textbox as a p
hrase so that it doesn’t need to be
typed in repeatedly. To do this and save the results so that they can be accessed every time
the app is opened up, some form of memory on the Android device was needed. After some
11






research, the simplest and most compatib
le solution was to use Android’s
“SharedPreferences” which is used to save a group of strings when the app is first used and
then can be changed based on the user saving their own strings.
Figure 5.1.8 shows

the

method used to set the phrases.



Figure
5.1
.
7



Code
-

Set phrases to memory using shared preferences

5.1.4

Technical Difficulties Encountered

The m
ain issue dealt with in this section of the project was the time limit on the Google’s
standard speech recog
nition. As explained above, I needed to set it to record for six seconds
and constantly loop until the user indicated to stop. To do this I needed to use the following
line of code:


This method is not commonly used and on the Android developer website
[
6
] have this not
regarding this method:

“Note that it is extremely rare you'd want to specify this value in an intent. If you don't have
a very good reason to change these, you should leave them as they are. Note also that
certain values may cause undesired

or unexpected results
-

use judiciously! Additionally,
depending on the recognizer implementation, these values may have no effect.”

12






This was a method the application needed to use for continuous translation so a solution
needed to be found for whatever i
ssue that was going to arise. So once this method was
included in the application’s code, it crashed every time it was implemented. Having tried
every solution to the problem
I could imagine, nothing worked.
After consulting with PHD
student Martin Hynes,
t
he issue turned out to be that the phone that I was using to debug
the application was powered by Android 2.2
which didn’t support the method in question,
but Android 2.3 did. This means that only Android phones powered with 2.3 or higher will
be able to
run the application so that was the type of phone that was needed for all future
debugging of the application.



13






5.2

Intercepting Phone Call

To intercept a call on Android, a “service” needed to be set up to constantly monitor the
phone’s call state so that when a phone call is made/received the application is initialized in
the background. When the call state shows that there is an incoming ca
ll being received the
fragment

of code

shown in figure 5.2.1

initializes the
app.


Figure
5.2
.
1



Code
-

I
nitialize
s

app when call is
received

Once this meth
od is called it initializes the menu shown in figure 5.2.2.


If the user wants to implement the app for the duration of the call, they choose between
using simple ASR, the “general conversation” option (which is the integrated ASR and TTS)
or the “record message” option.

Figure
5.2
.
2



Menu shown during call

14






5.2.1

Recording Messages

The “record message” opti
on first plays a pre
-
recorded message which explains to the
person who has made the call that the app user is not available to take the call and the app
will record a message that later the user can translate to text and reply. When this option is
chosen,
the mp3 file, which explains that a message is going to be recorded, is played

and
figure
5.2.3

is the screen that’s shown
. Another “timer” object is used to delay the recording
until after the “explanation” mp3 file has been played.


Figure
5.2
.
3



UI during message being recorded

Before messages can be saved
,

there needed to be a file location set up so they can
be
accessed for future reference. Every time the app is opened, it checks if the “Mobile
Interp
reter” folder exists on the phone’s SD card and if not creates it
, figure 5.2.4 shows
how this is done
.


Figure
5.2
.
4



Code
-

Creates MobileInterpreter folder if it doesn’t already exist

Android has its own

sound file extension which is .
3gp
. To record a .
3gp

file, the output file
location needs to exist before anything can be written to it. Once the program has decided
15






which file to write to (i.e. the “oldest” file) and the “explanation” file is finished pl
aying, the
recording starts
.

The same telephonyManager service is used to monitor the call state to know when the call
has ended, which in turn ends the recording.


Figure
5.2
.
5



Code


Exit app when call
is over

Once the call is over, the recording is accessible for translating. When the user presses the
“Translate recorded messages” but
ton on the main screen (figure 5.1.1
)
, the user is
presented with a list of recorded messages which has the phone number
of the person who
called and also the time of the call as shown in figure
5.2.7
.


Figure
5.2
.
6



Messages screen showing user time of call and caller’s number

To translate the message, it needs to be played
as a media file loudly on the speakers so the
phone can detect it
. So an instance of MediaPlayer containing the file recorded needs to be
set up and played while the app is translating any speech it detects
,

as shown in figure
5.2.8
.
Once the recording is over the popup window can be displayed with the translated text.

16







Figure
5.2
.
7



Code
-

Translate
pre
-
recorded

message

Then the user can click on whatever message they want translate
d and they will be
presented with the translated text on screen which is shown below in figure
5.2.9
. To make
this work effectively, the recorded message needs to be quite clear as is the same case for
translating normal speech.


Figure
5.2
.
8



Voice message translated

5.2.2

Technical Difficulties Encountered

One of the main difficulties with this section, in a practical sense, is when the recordings are
played to the person calling explaining the delays on the phone call i.e. waiting for the user
to set the app up at the start of the call or recording a voice
message, they aren’t loud
enough for the person calling to hear. This is due to the fact that Android phones will not
17






allow audio files to be played on the phone’s speakers when the caller’s voice is also being
amplified through them.

This is an issue that

could fix itself like the problem in the previous
section i.e. a higher level of Android software is required
.

Another issue that I was faced with in this section was figuring out how to stop the
recording when the call is over. When the call has ended the application hits the
“phoneState = Idle”
method in the TelephoneyReciever class which could not access the
rec
order object to stop the recording. This problem could not be solved by saving the object
to memory so it could be accessed because none of the Android memory techniques are
capable of storing objects. Also, it could not be accessed using a public method i
n the
Messages class called in the TelephoneyReciever class because the object was null once the
TelephoneyReciever class was initiated. To get around this problem, the System.exit(0)
method is used to exit the application completely which in turn ends the

recording. This
method of exiting an application is not recommended as Android is designed to manage the
running applications
itself and close them as needed but in this case it is necessary and
causes no further issues.



18






5.3

Motion Detection for S
ign Langua
ge Phrases and Simple
Gestures

This aspect of the project was to use the accelerometer and other sensors built into Android
smartphones to detect the orientation and the motion of

the

phone. This would be done so
that sign language gestures could
be
replic
ated while holding the phone and it would then
synthesis
e

the speech of that respective phrase. Even before research had been done on
this topic
,

it was clear that the phone would not be able to detect some of the subtle
movements used in sign language usi
ng this method of detection. Finger movements would
not be able to be detected and getting the phone to
figure out its distance relative the
user’s body in most cases would not be picked up on by any of the

phone’s

sensors. This
would mean

a limitation on
how many phrases could be directly mapped to motions that
could be detected by the phone.

A combination of different motions is what makes up a sign language phrase and even a
simple phrase like “excuse me” is three separate motions.
This is the reason wh
y a simple
motion detection system needed to be implemented so that phrases like “My name is Joe”
could be easily synthesised with a flick of the phone. This creates a simple and fast way to
output such a sentence instead of trying to get the phone to dete
ct a combination of more
than ten motions for such a common sentence.

5.3.1

Android’s Sensors

Here is a list of sensors that are built into most Android smartphones:



Accelerometer



Ambient temperature



Gravity



Gyroscope



Light



Linear acceleration



Magnetic field



Ori
entation



Pressure



Proximity



Relative

19








Rotation vector



Temperature

Most of these sensors will not apply to motion detection but the sensors that have most
relevance to this application are the accelerometer, the orientation and the proximity
sensors. The ac
celerometer and the orientation sensors are self
-
explanatory in how they
can be implemented to detect motions but the proximity sensor isn’t as clear and is vital for
a
high percentage of signs

that can detected. The reasons why this is the case

will be
de
scribed later in this chapter.

5.3.2

Sign Language Motion Detection

As described above each sign language “phrase” is made up of a combination of different
motions so these motions needed to be mapped in a wa
y that the phone could detect them
.
Depending on the
motions of each “phrase” the phone’s orientation and its accelerometer
readings are vital to detecting each individual motion.

For code to register changes in sensor readings, the Java class

in question needs to
implement SensorEventListener. This has a de
fault method called onSensorChanged which
is called whenever a sensor value changes. A sensorManager object needs to be initiated to
register listeners for each of the sensors that a developer needs to extract data from and
these are set in the onResume me
thod which is another default method attached to
SensorEventListener.
Figure 3.2.2

shows how the orientation values change with each
rotation on each axis. This diagram is very helpful in order to understand the orientation of
each motion detected by the phone.

20







Figure
5.3
.
1



T
he value c
hanges
with each respective movement

5.3.3


Excuse me
” phrase detection

To show the implementation of the motion detection, here is an explanation of how the
phrase “excuse me” is detected.

To sign “excuse me” the person needs to place the tips of
their fingers
on their chin then lower them and place them on their chest. So the aim is to
replicate those same motions with the phone in the user’s hand

to detect the phrase
.

Regardless of
the sign language phrase
, the main objective was to break it down into a
numbe
r of simple motions/positions. In this case, the first position is the phone is vertical
and it is also in
close
proximity
(proximity value =

9 if in close proximity as seen in flowchart)
to the user’s chin.

The next motion/position the phone tries to det
ect is a downward
motion of the phone. To detect this, the phone’s accelerometer is used
to check

if the y
-
axis
(vertical plane) value is
less than 8.5 i.e. if the phone is moving downwards there is less than
gravity forcing the phone to the floor. Once th
is has been detected, the phone checks again
if the phone is vertical and

if

it is also in close proximity to the user’s chest. If that is the
case, the speech for “excuse me” is synthesised.

The

flow chart
, as shown in figure 5.3.3,

describ
es

each
decision needed to ensure

the

“excuse me” motion has been detected.




21







Figure
5.3
.
2



Flowchart


How the phrase “excuse me” is detected

Figure 5.4.4 shows a fragment

of code that shows how after each motio
n is detected, it
checks what Boolean variables have been set to true representing
motions that have
already been detected. If all the variables related to a phrase have been detected synthesise
the speech for the respective phrase or else

set
a variable t
o true.

22







Figure
5.3
.
3



Code


Shows
how the “goodbye” motion is detected

5.3.4

Simple Motion Recognition

To give the user a
n

easier and faster way

of synthesising phrases, there is a “Scenarios”
option. This presents the user with a list of social scenarios that

user can choose from to
synthesise phrases related to that scenario which can be seen

in the “coffee shop” example

seen in figure 5.3.4.


Figure
5.3
.
4



Coffee shop scenario UI

23






As can be seen in this diagram, all the user needs to do is tilt the phone in the direction of
the arrow to output the phrase.
When the phrase is outputted, the phone al
so vibrates and
shows the phrase coming up on the screen so that the user actually knows the phrase has
been “said”.
The user can also create their own scenarios with their own set phrases by
clicking the “custom” option on the scenarios menu. This will pr
esent the user with a screen
where each phrase is clickable and once clicked can be edited and saved to memory for
future use as shown
in figure 5.3.6
.


Figure
5.3
.
5



Edit phrases that can be outputted in “Scenarios” option

5.3.5

Technical Difficulties Encountered

The biggest issue that was addressed in this section was the fact that it is not possible for
one class to implement both SensorEventListener (necessary for registering changes in
sensor data) and onInitListener (necessary for synthesising text to speech).

This meant that
there was a need to write two separate
activities
, one monitoring the sensor values and
once a motion has been detected save the string of the phrase to shared preferences so the
other class can output it to speech.

The method
seen in figu
re 5.3.7
is called every time a
motion is detected.

24







Figure
5.3
.
6



Code


Saves message to be outputted and launch the new activity

It shows the issue that a new activity needs to be launched every time a m
otion is detected.
But this new activity does not need a user interface because all it is doing is outputting the
text of the phrase to speech and then returning to the motion detection class. This means
that each activity needed to call the finish(); comm
and which finishes the activity in
question so there is not a build
-
up of unused activities that would cause a memory leak and
the eventual crashing of the application.












25






5.4

Simple Emotion Recognition

While researching this section, it became apparen
t that it was not possible to use the
microphone for two different functions i.e. it couldn’t translate the incoming speech and
also take samples of the speech for some kind of emotion recognition. Taking this into
consideration, the decision was made to
i
mplement emotion recognition on the voice
messages recorded from the voice mail system. This would solve t
he previously mentioned
issue

due to the fact that the speech that needs emotion recognition is already recorded so
samples can be taken and then it c
an be played as an audio file which the phone can then
take in and translate.

5.4.1

Emotion Characteristics

To obtain the basic emotion in the voice that has been recorded, there is four characteristics
that need to be
retrieved

and exam
ined. The four character
istics are:

5.4.1.1

Intensity

When the phrase voice intensity is used it refers to the energy contained in the speech as it
is produced. Voice intensity can differ from person to person due to a number of
factors.
The vibration of the vocal chords when a person
speaks can determine the resulting volume.
If there are more vibrations and if the amplitude of said vibrations is large, the resulting
speech will have a larger intensity and vice versa. This is due to the fact that a vibration of
higher amplitude puts mo
re pressure on the glottis, which contains your vocal chords and
larynx so therefore is responsible for speech [6].

The formula to calculate this is:


5.4.1.2

Pitch

The pitch of a person’s voice is the determined by the rate of vibration of the vocal chords
i.e.
the greater the rate of vibration the higher the pitch. The rate of vibration is determined
by the length and the thickness and their movement. Women tend to have shorter vocal
chords and this is the reason why women usually have higher pitched voices than

men.

5.4.1.3

Zero Crossing Rate (ZCR)

This is the rate of change of signs across the signal i.e. going from positive to negative and
vice versa. The formula to calculate this is:

26







5.4.1.4

Spectral Slope

This describes, in terms of speech, the distribution of energy in t
he magnitude spectrum,
given by the slope of a linear approximation of the magnitude spectrum [7].

An example of
this linear approximation can be seen in
figure 5.4.3.


Figure
5.4
.
1



Linear approximation of

magnitude spectrum

5.4.2

Extracting Speech Characteristics from Audio File

The application needs to convert the audio file to bytes which then needs to be converted
to samples so the audio wave can be obtained. Once the application has that data, that’s
when
the digital signal processing can be done to extract emotion characteristics.

To convert the audio file to bytes I need to use audioInputStream which takes in a
byteArrayInputStream and creates an array of bytes as shown in
figure 5.4.4.


27







Figure
5.4
.
2



Code


Using audioInputStream to extract bytes from audio file

Once this byte arr
ay has been obtained, they need

to be conve
rted to samples that can be
done as shown in the
figure 5.4.5.


Figure
5.4
.
3



Code


Loops though bytes and makes


As can be seen in the code above, a pair of bytes make up a sample. Each byte in java is
made up of 8 bits and a sample is made up of 16 bits. The line “sample =
(high<<8)+(lo
w&0x00ff)” means that the high byte is pushed 8 bits to the left
so it is residing
in bit 16 to 9 and the low bit takes up 0 to 8. The “0x00ff” part is needed to remove the sign
of the byte value.
In Java, all numbers are implicitly signed, which means tha
t each type of
integer (byte, short, int or long) and the floats (float, double) can be expressed as p
ositive or
negative. For this scenario, unsigned bytes are needed because they can only be expressed
28






as positive.

Once an array of samples has been obtain
ed, the intensity of the voice inputted
can be calculated by getting the mean of each sample value squared.

Say the byte array of samples is of size X,

I needed to calculate the closest N to the power of
2 value to X to get the Fast Fourier Transform of the signal.
This is done by

incrementing the
value of N and
calculating its power of 2 value

and then taking that value away from X. If the
resulting value is negative then the most appropriate value for N is the one previous i.e.

for(int i=0; i<=30; i++){



double twoVal = Math.pow(2, i);



currentno = n
-

twoVal;



if(currentno < 0){




return
Math.pow(2, i
-
1);



}



if(i==30){




return 2^(30);



}


}

Once this value, call it Y, has been obtained create an array of the same size and take the
first Y samples to populate it.
It also has to be mentioned that the resulting array has
mirrored points

after Y/2 so there needs to be windowing preforming. In this case, that
implies that only the points from 0 to Y/2 are to be taken into
consideration
.

Then the fast Fourier transform is applied to the array of bytes, which will return a real and
imaginary

part for each byte. The magnitude of each real and imaginary part of the byte is
placed in a new array to represent the resulting array from the fast Fourier transform
. This is
the data we extract the zero crossing rate from, so therefore every time the d
ata goes from
positive to negative and vice versa, the ZCR variable is incremented.

The spectral slope is also obtained from this graph. This is extracted by calculating the slope
between the highest point (which will be one of the peaks at the start of t
he graph) and the
last point.

The last characteristic needed was pitch and
after some research into which method could
achieve pitch detection, the autocorrelation method is the one that the application uses.
2
9






This method takes the resulting array from the
FFT
and loops though each result and at each
result multiplies the current result with the next result in the array. It calculates this for
every result and sums them up to get one piece of data. This happens X amount of times
(the amount of results in the

FFT array) to provide X amount of results. This data is then
graphed which the result can be seen below:



Figure
5.4
.
4



Graph of FFT results

From the graph, 3 spikes can be seen at 1341, 2681 and 4021 and

these 3 points are
equidistant apart which means the
spikes are periodic. To obtain the pitch from this data,
the pitch period needs to gotten which is the distance between the spikes which in this case
is 1340. This represents the pitch period and the re
ciprocal of this number is how to
determine the pitch, which in this case is 74.6Hz
.

5.4.3

Technical Difficulties Encountered

The description of this section is all based on code that had been implemented in Java on
my laptop and did everything that was
required. However when I tried to port to code to
Android, I realized that support for a class called “audioInputStream” has stopped being
supported on Android in the last 6 months. This caused a major issue because this class is
how the bytes were extract
ed from the audio file
.

The code which is implemented in Java
takes in a recording of voice from the microphone and writes the recording to a
byteArrayOutputStream. This data is then placed in an audioInputStream and this is the byte
data used for the cal
culations described above.

To try and solve this issue, I used a class called dataInputStream that is supported by
Android because it had the same methods as audioInputStream so it
is

potentially

a viable
replacement.
This wasn’t to be the case as it was n
ot returning the correct values.

30






The cause of this issue could lie in a couple of different places. The code implemented in
Java uses a direct recording to obtain the data that is going to be processed as opposed to
on Android the aim was to use a .3gp fil
e as the audio source for the emotion recognition.
Seeing as the Java code worked, I tried to use the same code to extract the appropriate data
from the .3gp file I was trying to use in the Android code however this returned the
incorrect data also. This i
ndicated that the file format was the problem and I needed a file
format that was supported in Java to try and at least get the right data from any file type
instead of a direct recording. This is when I tried to use a .wav file that is fully supported by
audioInputStream however this was still not returning the correct data.

This issue has yet to be solved but is still being investigated by Martin Hynes (a PHD student)
and myself to try and come up with a solution.
Unfortunately this problem could be
just a
simple file format problem but it could also be something that is not possible on Android
without audioInputStream being supported. Either way, the code for the voice
characteristics extraction works on both platforms the only issue is the byte extr
action from
the files.






31






5.5

Power Consumption/Computational Complexity of Application

To obtain power consumption details of the application, an application called “PowerTutor”
was used. This is the description of the application on its website [8]:

“Power
Tutor is an application for Google phones that displays the power consumed by
major system components such as CPU, network interface, display, and GPS receiver and
different applications. The application allows software developers to see the impact of
desi
gn changes on power efficiency”

The final design of each function of my application was designed to be as efficient as I could
see possible

and the following figures show

the power consumption of each function:





1.

2.

3.

4.

5.

6.

32








These are the graphs that show the power consumption statistics of the following functions:

1.

A
pplication starting up

(roughly 100mW to process
ing
)

2.

Text to Speech being used (
just over 1
00mW to process
ing
)

3.

Speech Recognition (less tha
n 100mW to process
ing
, nearly 600mW for the display)

4.

Both TTS and ASR (basically same stats as before

but one after another
)

5.

Sign language detection (roughly 150mW to process
ing
, nearly 600mW for 6 seconds
for display)

6.

Scenarios motion detection (roughly 1
50mW for 4 seconds to process
ing
, nearly
600mW for 16 seconds for display)

7.

Translation of recording (roughly 50mW for 5 seconds to process
ing
, nearly 600 for
about 13 seconds on display)

8.

Translation of recording with emotion recognition (over 400mW to proc
ess
ing,
nearly 600mW for 12 seconds for display)

9.

Launch application during call (roughly 100mW for 10seconds for processing, nearly
600mW for 10 seconds for display)

The GUI presented to the user for speech recognition shows a consistent rise in wattage
needed for the display when is it used in each function. The data also shows that motion
detection takes up more of the phones energy than speech recognition
. The function that
7.

8.

9.

33






takes up the most energy is the translating message with emotion recognition,
which is
down to the necessary complexity of the code to obtain each characteristic of the signal.

The average power consumption of the application since I have been using PowerTutor,
which is about 4 months ago
, is only 5mW

which compared to the stock Fac
ebook
application which has an average of 24mW. Considering stock batteries in Android phones
range from 1230mAh to 1800mAh, this application could be run for a large portion of the
day without any battery issues.



34






6

Recommendations and Conclusions

Despite
this being a very successful project in terms of the aims being met, there are some
changes/improvements that could

be

implemented in the application to increase its
ef
ficiency as a communication aid.

First of all, the biggest thing I would like to change
about the project currently is the
emotion recognition not functioning on the phone. This could be an issue that genuinely
cannot be solved but also could be an issue that could be solved with more research. If the
issue ends up being about the format of t
he file used in Android and the classes use all work
well then it would not take too much time and effort to implement. On the other hand, if
the issue that an algorithm for emotion recognition needs the class that is not supported in
Android it
would

be a

much bigger problem

that might not be able to be solved without
that support.

I would also like to research into background noise cancelation so that when a user records
a voice message to translate later, so the quality of the translation would be consi
stent
regardless of noise in the background. This could be a big ask but unless the clarity of the
voice recording is near perfect the speech recognition will not be as effective compared to
talking near the phone.

On the same note of efficient speech reco
gnition, I would like to implement a certain level
of intelligence to the application. I would do this by giving the user an option to change a
word the speech recognition constantly gets wrong with the correct word. This would be
very effective if the app
lication was being used on diverse accents as the application would
definitely have trouble getting the right match for each word. So if a person said “red” and
the speech recognition constantly picked it up as “spread”, the user could input this
correctio
n so that whenever spread was mentioned it would need to be in the right context
for it to be used. If not it would be replaced by the correction. Deciphering the right context
for each word that was in the corrections database (that could be just a file w
ith each line
having a correction)

could be quite difficult but if I had the time I would like to try and
implement it.


35






Despite these potential changes, this application could be put on the Android Market now
and, while not being the perfect solution, it

could help countless people in their constant
struggle to communicate with people outside the deaf community.



36






7

Appendices

7.1

Appendix 1

Email sent to SLIS (Sign Language Interpreting Service) Ireland:


Hi,


I'm doing a college project on a communication a
id for people with hearing impairments
and I was wondering if you could give me the number of how many interpreters that are
registered with SLIS?


Thanks in advance,

Ross Duignan


Reply:


Hi Ross

Thank you for your email.


In general there are 72 trained

interpreters in Ireland with a
further 12 due to graduate from college in the coming 2 years.


There are currently
approximately 40 interpreters working throughout Ireland.


Some of these work par
t
-
time,
some work full
-
time and some work occasionally.



SLIS no longer operates a booking service and does not employ interpreters but rather we
provide a referral service.


This means we put Clients in touch with suitably qualified, trained
and experienced interpreters to match the requested assignment.

All
interpreters operate as freelance, self
-
employed contractors.

I hope this helps you.

Regards

Audrey Campbell

37






8

References

[1]

http://www.eud.eu/Ireland
-
i
-
188.html
. Last accessed March 29
th

2012

[2]

Comhairle (Sept
ember 2006) “Review of Sign Language Interpretation Services and
Service Requirements in Ireland” Available:
http://www.citizensinformationboard.ie/downloads/Sign_Lan
guage_Report.pdf
.
Last accessed March 29
th

2012

[3]

http://www.comscore.com/Press_Events/Press_Releases/2012/3/comScore
_Report
s_January_2012_U.S._Mobile_Subscriber_Market_Share
. Last accessed March 29
th

2012

[4]

http://www.vogella.de/articles/AndroidServices/article.ht
ml#broadcastreceiver_defi
nition
. Last accessed March 29
th

2012

[5]

developer.android.com/. Last accessed March 29
th

2012

[6]

http://mathematics.blurtit.com/q7475571.html
. Last accessed March 29
th

2012

[7]

Vi
dhyasaharan Sethu

(November 2009) “Automatic Emotion Recognition: An
Investigation of Acoustic and Prosodic Parameters” Available:
http://www.google.ie/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCcQFj
AA&url=http%3A%2F%2Funsworks.unsw.ed
u.au%2Ffapi%2Fdatastream%2Funswork
s%3A7917%2FSOURCE01&ei=mslzT4jAH86KhQf8ssimBQ&usg=AFQjCNEpvPptHIpAeK
usKf86lPbJuDnmMw&sig2=DvVPuVLqIeUCp3pcrS4m1g
. Last accessed March 29
th

2012

[8]

http://powertutor.org/
. Last accessed
March 29
th

2012