Learning Models for Object Recognition from Natural Language Descriptions

addictedswimmingAI and Robotics

Oct 24, 2013 (3 years and 9 months ago)

51 views

Learning Models for Object Recognition

from Natural Language Descriptions



Presenters:

Sagardeep Mahapatra


108771077

Keerti Korrapati
-

108694316

.

Goal



Learning

models

for

visual

object

recognition

from

natural

language

descriptions

alone


Why

learn

model

from

natural

language?



Manually

collecting

and

labeling

large

image

sets

is

difficult



New

training

set

needs

to

be

created

for

each

new

category



Finding

images

for

fined

grained

object

categories

is

tough


Ex
-

species

of

plants

and

animals


But

detailed

visual

descriptions

may

be

readily

available




.

Outline




Datasets

for

training

and

testing



Natural

Language

Processing

methods




Template

Filling



Extraction

of

visual

attributes

from

test

images



Score

an

image

against

the

learnt

template

models



Results



Observations


.

Dataset



Text

descriptions

associated

with

ten

species

of

butterflies

from

the

eNature

guide

to

construct

the

template

model


Butterflies,

because

they

have

distinctive

visual

features

like

wing

colors,

spots,

etc



Images

downloaded

from

google

for

each

of

the

ten

butterfly

categories

form

the

testing

set







»


Danaus plexippus


Heliconius charitonius Heliconius
e
rato



Junonia coenia


Lycaena phlaeas




Nymphalis
antiopa
Papilio
cresphontes
Pieris
rapae
Vanessa
atalanta
Vanessa
cardui


.

Natural Language Processing



Goal: Convert unstructured data in descriptions into structured templates



Factual but unstructured






data in text












Information















Extraction











………..

…….….

………..


.

Template Filling






Text

is

tokenized

into

words



Tokens

are

tagged

with

parts

of

speech

(using

C&C

tagger)



Custom

transformations

are

performed

to

correct

known

mistakes


Required

because

eNature

guide

tends

to

suppress

some

information




Chunks

of

texts

matching

pre
-
defined

tag

sequence

are

extracted


Ex
-

noun

phrases

(‘wings

have

blue

spots’),

adjective

phrases

(‘wings

are

black’)



Extracted

phrases

are

filtered

through

a

list

of

colors,

patterns

and

positions

to

fill

the

template

slots



Tokenization



Part
-
of
-
Speech
Tagging


Custom
Transformation


Chunking



Template
Filling

Visual

Processing

Performed

based

on

two

attributes

of

butterflies



Dominant

Wing

Color


Colored

Spots


1
)

Image

Segmentation









Variation

in

the

background

can

pose

challenges

during

image

classification




Hence,

the

butterfly

image

was

segmented

from

the

background

using

the

‘star

shape’

graph

cut

approach

2
)

Spot

Detection

(Using

a

spot

classifier)




Hand

marked

butterfly

images

with

no

prior

class

information

form

the

training

set

for

the

spot

classifier




Candidate

regions

likely

to

be

spots

are

extracted

by

using

Difference
-
of
-
Gaussians

interest

point

operator




Image

descriptors

(SIFT

features)

are

extracted

around

the

candidate

spot

to

classify

it

as

a

spot

or

non
-
spot


3
)

Color

Modelling





Required

to

connect

color

names

of

dominant

wing

colors

and

spot

colors

in

learnt

templates

to

image

observations




For

each

color

name

c
i
,

probability

distribution

p(
z|c
i
)

was

learnt

from

training

butterfly

images

,where

z

is

a

pixel

color

observation

in

the

L*a*b*

color

space

Generative

Model








Given

an

input

image

I


the

probability

of

the

image

given

a

butterfly

category

Bi

as

a

product

over

the

spot

and

wing

observations
:

Spot color name prior


Equal priors to all spot colors


Dominant color name prior

.

Experimental Results



Two set of experiments were performed



Performance

of

human

beings

in

recognizing

butterflies

from

textual

descriptions


Because

this

may

be

reasonably

considered

as

an

upper

bound



Performance

of

the

proposed

method
















Human Performance

Performance

of

proposed

method

Observations






Accuracy

of

proposed

method

was

comparable

to

accuracy

of

non
-
native

English

speakers



Accuracy

of

proposed

method

was

more

than

80

percent

for

four

categories



Classification

of

‘Heliconius

charitonius’

was

the

toughest

for

humans

and

also

with

the

ground
-
truth

and

learnt

templates



Performance

with

ground
-
truth

templates

was

comparable

to

that

with

the

learnt

templates


Errors

in

templates

due

to

NLP

methods

did

not

have

much

impact







Thank You