Aiding categorization by grounding spoken words

embarrassedlopsidedAI and Robotics

Nov 14, 2013 (4 years ago)

76 views

Aiding categorization by grounding
spoken
words

(
an
infant inspired approach to concept
formation and language
acquisition)


Aneesh Chauhan and Luís Seabra Lopes


Transverse Activity on Intelligent Robotics (ATRI),

IEETA/DETI,
Universidade

de
Aveiro
,

Aveiro

3810
-
193, Portugal

{aneesh.chauhan, lsl}@ua.pt


Motivations


There is an increasing need to support language
-
based
communication between robots and their users. Main
reasons include:


For some applications (e.g. domestic), robots must be really easy to
command and teach;


The common user will prefer robots that understand her/his language;


Spoken language is probably the most powerful communication modality.


Symbol grounding


Meanings of symbols (e.g. words) lie in their association with the
entities of the world they refer to.


Language is a cultural product that is acquired socially


Learning a human language will require the participation of humans as
language instructors.

Motivations (cont.)


An inspiration from early language acquisition in
children


Most of the early vocabulary in children consists of common nouns
(that name objects such as food items, toys etc.).


The infants at a very young age start showing an
attentional

bias
towards categories that have clearly defined shapes.


Naming facilitates/aids in concept building and categorization.



Robotic agent

Experimental setup


Features
:


Embodied

with

a

camera

and

a

robotic

arm
.


There

is

a

linear

relation

between

camera

and

arm

coordinates
.


Cognitive

functions

are

carried

out

on

an

attached

computer
.

SG6
-
UT educational arm

from Crust Crawler Robotics


Human
-
robot interaction



Interaction supports following actions:

1.
Teach
ing the category name of a selected object;

2.
Ask
ing category name of the selected object;

3.
Provide
correct
ion in the case of an incorrect prediction by the
robot;

4.
Asking the robot to locate an instance of a category;

5.
Manipulation requests (independent as well as in combination with
all of the previous actions).


Robot’s

abilities

involve
:

1.
Linguistic

response
;

2.
Visual

response
;

and

3.
Manipulation

actions
.

Feature
extraction

Feature extraction from the

speech signal


From the raw speech signal two sets of features are extracted:


the
MFC (Mel
-
Frequency
Cepstrum
) set
; and


the
phoneme set
.


MFCC (Mel
-
Frequency
Cepstral

Coefficients) provide a good approximation of
the response of human auditory system to a sound stream.


This set is obtained using the
wave2feat

tool provided with Sphinx3
1

speech recognition
engine.


To extract the phoneme set,
allphone

mode of Sphinx3 (using acoustic and
language models from voxforge
2
) is used. This mode predicts:


the most probable “sequence of phonemes” (the phoneme set) for a given speech
signal; and


the elements of the previously calculated MFC set associated with each predicted
phoneme.

1
An HMM (Hidden Markov Models) based toolkit for speech recognition:
http://cmusphinx.sourceforge.net

2
An open
-
source speech corpus and acoustic model repository:

http://www.voxforge.org


Word representation


Once the sound features have been extracted, a
word
W

is represented as:

W =
{
<ph
1
, m
1
>,<ph
2
, m
2
>,..,<
ph
n
,
m
n
>
}


where
ph
i

is the
i
th

phone in the phoneme feature set
and
m
i

is the subset of the elements of MFC feature
set associated with the
i
th

phone;
i
=
1,2,....,n

where
n

is the number of predicted phones.

Phoneme prediction results

CUP

SCISSOR

STAPLER

TRAIN

Word similarity evaluation

(a greedy search algorithm)

SIL

S

V

IH

Z

ER

UW

N

SIL

DH

SIL

0

d
(SIL,S)

S

d
(S,SIL)

0

d
(S,V)

IH

d
(IH,S)

d
(IH,V)

0

Z

d
(Z,V)

d
(Z,IH)

0

EH

d
(EH,IH)

d
(EH,Z)

d
(EH,ER)

P

d
(P,Z)

d
(P,ER)

d
(P,UW)

SIL

d
(SIL,ER)

d
(SIL,UW)

d
(SIL,N)

DH

d
(DH,UW)

d
(DH,N)

d
(DH,SIL)

S

Y

M

B

O

L

S

0

d
(S,Y)

d
(S,M)

d
(S,B)

d
(S,O)

d
(S,L)

Y

0

d
(Y,M)

d
(Y,B)

d
(Y,O)

d
(Y,L)

M

0

d
(M,B)

d
(M,O)

d
(M,L)

B

0

d
(B,O)

d
(B,L)

O

0

d
(O,L)

L

0

W
1

W
2

Word similarity measure


The final similarity measure between two words is
therefore calculated as:

S_word
(
W
1
,
W
2
)

= 1/(
min(
D
(
W
1
,
W
2
),
D
(
W
2
,
W
1
)))


where
D
(
W
1
,
W
2
) is the distance measured using the greedy
search algorithm.


---------------------------------------



Visual feature extraction and categorization of visual
concepts


Seabra

Lopes L,
Chauhan

A (2008) Open
-
ended category
learning for language acquisition. Connection Science, 20
-
4:277
-
297.


Classification

Category
representation

Feature
extraction

Category representation


Representations of a
visual category

and its name (
speech
category
) are coupled together such that


each instance describing the category is associated with a word that is
part of the set describing the speech category.

Word clusters and Keywords


Word clusters


For a given speech category, locate the nearest neighbor
(using the
S_word

metric) of each word representation and cluster the
representations that are connected to each other through their nearest
neighbors.


Key words


are the representations with the highest number of neighbors
in a given cluster or a speech category description.

Speech
Category

Word
Clusters

Legends

Instance


Word

Categorization

(Asking action)

Name

Instance


V
1

S
1


V
2

S
2


V
3

S
3

Legends

Instance


Word

Learning

(Teaching/Correction action)

Name

Instance


V
1

S
1


V
2

S
2


V
3

S
3


V
3

S
3

Learning

(Nearest
-
neighbor Clustering)


V
3

S
3

Learning

(Conceptual organization)

Legends

Instance


Word


V
1

S
1


V
2

S
2

Learning (cont.)

Legends

Instance


Word


V
1

S
1


V
2

S
2


V
3

S
3


V
4

S
4

System architecture in summary

Feature
extraction

Experimental evaluation

_______________________

introduce

Class 0
;

n

= 1;

repeat

{


introduce

Class n
;


k

= 0;


repeat


{



Evaluate and correct classifiers;



k
++;


}
until

( (average precision >
precision




threshold


and

k

n
)
or



(user sees no improvement in




precision) );


n
++;

}
until

(user sees no improvement in precision).

Teaching protocol for performance
evaluation and the performance
measures

Classification
Precision



Number of correct predictions

Total Number of Predictions

X 100

Average

system

precision




An

average

over

all

the

classification

precision

values

for

all

the

question
-
correction

iterations
.

Question/Correction

Iteration

Evolution of classification precision versus number of
question/correction interactions in the third experiment.


0
20
40
60
80
100
0
40
80
120
160
200
240
0
20
40
60
80
100
0
20
40
60
80
100
120
140
160
Recovery

Recovery

Question/Correction iterations

Classification

precision

Summary of experiments

Conclusions I


We presented a
physically embodied robot with the
abilities to
:


communicate with its users;


learn throughout its life; and


respond intelligently to external stimuli.


The robot's learning system is able to incrementally
ground the names of new objects as taught by a
human user.


A novel set of features using phonemes and associated
MFCC were developed for speech signal (word)
representation.


A novel similarity measure to compare two word
-
representations was developed using a greedy search
algorithm.

Conclusions II


A learning model was developed where naming
aids in visual concept development.

Thank you for your

attention