On Visual Recognition

unclesamnorweiganAI and Robotics

Oct 18, 2013 (3 years and 10 months ago)

69 views

Computer Vision Group

University of California


Berkeley

On Visual Recognition


Jitendra Malik

UC Berkeley

Computer Vision Group

University of California


Berkeley

From Pixels to Perception

Tiger

Grass

Water

Sand

outdoor

wildlife

Tiger

tail

eye

legs

head

back

shadow

mouse

Computer Vision Group

University of California


Berkeley

Object Category Recognition

Computer Vision Group

University of California


Berkeley

Defining Categories


What is a “visual category”?


Not semantic





Working hypothesis: Two instances of the same category
must have “correspondence” (i.e. one can be morphed into
the other)


e.g. Four
-
legged animals


Biederman’s estimate of 30,000 basic visual categories

Computer Vision Group

University of California


Berkeley


Facts from Biological Vision


Timing


Abstraction/Generalization


Taxonomy and Partonomy



Computer Vision Group

University of California


Berkeley

Detection can be very fast


On a task of judging animal vs no
animal, humans can make mostly
correct saccades in 150 ms
(Kirchner & Thorpe, 2006)



Comparable to synaptic delay in the
retina,
LGN, V1, V2, V4, IT
pathway.


Doesn’t rule out feed back but shows
feed forward only is very powerful


Computer Vision Group

University of California


Berkeley

As Soon as You Know It Is There, You Know What It Is


Grill
-
Spector & Kanwisher, Psychological Science, 2005

Computer Vision Group

University of California


Berkeley


Abstraction/Generalization


Configurations of oriented contours


Considerable toleration for small deformations



Computer Vision Group

University of California


Berkeley

Attneave’s Cat (1954)

Line drawings convey most of the information

Computer Vision Group

University of California


Berkeley


Taxonomy and Partonomy


Taxonomy: E.g. Cats are in the order Felidae which in turn is
in the class Mammalia


Recognition can be at multiple levels of categorization, or be
identification at the level of specific individuals , as in faces.


Partonomy: Objects have parts, they have subparts and so on.
The human body contains the head, which in turn contains the
eyes.


These notions apply equally well to scenes and to activities.


Psychologists have argued that there is a “basic
-
level” at which
categorization is fastest (Eleanor Rosch et al).


In a partonomy each level contributes useful information fro
recognition.


Computer Vision Group

University of California


Berkeley


Matching with Exemplars


Use exemplars as templates


Correspond features between query and exemplar


Evaluate similarity score

Query

Image

Database of

Templates

Computer Vision Group

University of California


Berkeley

Matching with Exemplars


Use exemplars as templates


Correspond features between query and exemplar


Evaluate similarity score

Query

Image

Database of

Templates

Best matching template is a helicopter

Computer Vision Group

University of California


Berkeley

3D objects using multiple 2D views

?
View selection algorithm from

Belongie, Malik & Puzicha (2001)

Computer Vision Group

University of California


Berkeley

Error vs. Number of Views

Computer Vision Group

University of California


Berkeley


Three Big Ideas


Correspondence based on local shape/appearance
descriptors


Deformable Template Matching


Machine learning for finding discriminative features


Computer Vision Group

University of California


Berkeley


Three Big Ideas


Correspondence based on local shape/appearance
descriptors


Deformable Template Matching


Machine learning for finding discriminative features


Computer Vision Group

University of California


Berkeley

Comparing Pointsets

Computer Vision Group

University of California


Berkeley

Shape Context

Count the number of points
inside each bin, e.g.:

Count = 4

Count = 10

...


Compact representation
of distribution of points
relative to each point


(Belongie, Malik & Puzicha, 2001)

Computer Vision Group

University of California


Berkeley

Shape Context

Computer Vision Group

University of California


Berkeley

Geometric Blur

(Local Appearance Descriptor)

Geometric Blur

Descriptor

~

Compute sparse

channels from image

Extract a patch

in each channel

Apply spatially varying

blur and sub
-
sample

(Idealized signal)

Descriptor is robust to

small affine distortions

Berg & Malik '01

Computer Vision Group

University of California


Berkeley


Three Big Ideas


Correspondence based on local shape/appearance
descriptors


Deformable Template Matching



Machine learning for finding discriminative features


Computer Vision Group

University of California


Berkeley

Modeling shape variation in a category


D’Arcy Thompson:
On Growth and Form
, 1917


studied transformations between shapes of organisms

Computer Vision Group

University of California


Berkeley

Matching

Example

model

target

Computer Vision Group

University of California


Berkeley

Handwritten Digit Recognition


MNIST 60 000:




linear: 12.0%


40 PCA+ quad: 3.3%


1000 RBF +linear: 3.6%


K
-
NN: 5%


K
-
NN
(deskewed)
: 2.4%


K
-
NN
(tangent dist.)
: 1.1%


SVM: 1.1%


LeNet 5: 0.95%


MNIST 600 000
(distortions):




LeNet 5: 0.8%


SVM: 0.8%


Boosted LeNet 4: 0.7%



MNIST 20 000:




K
-
NN, Shape Context
matching: 0.63%

Computer Vision Group

University of California


Berkeley

Computer Vision Group

University of California


Berkeley

EZ
-
Gimpy Results


171 of 192 images correctly identified: 92 %

horse

smile

canvas

spade

join

here

Computer Vision Group

University of California


Berkeley


Three Big Ideas


Correspondence based on local shape/appearance
descriptors


Deformable Template Matching


Machine learning for finding discriminative features


Computer Vision Group

University of California


Berkeley

Discriminative learning

(Frome, Singer, Malik, 2006)

weights

on patch features in training images

distance functions

from training images to
any other images

browsing, retrieval, classification

83/400

79/400

Computer Vision Group

University of California


Berkeley

triplets


learn from relative similarity

image
i

image
j

image
k

want:


image
-
to
-
image distances based on feature
-
to
-
image distances

compare

image
-
to
-
image

distances

Computer Vision Group

University of California


Berkeley

focal image version

image i (focal)

0.3

0.8

0.4

0.2

image j

image k

-

...

0.8

0.2

...

0.3

0.4

=

x
ijk

...

0.5

-
0.2

d
ik

d
ij

Computer Vision Group

University of California


Berkeley

large
-
margin formulation

slack variables like soft
-
margin SVM

w

constrained to be
positive

L2 regularization

Computer Vision Group

University of California


Berkeley

Caltech
-
101 [Fei
-
Fei et al. 04]


102 classes, 31
-
300 images/class













Computer Vision Group

University of California


Berkeley

retrieval example

query image

retrieval results:

Computer Vision Group

University of California


Berkeley


Caltech 101 classification results


(see Manik Verma’s talks for the best yet..)

Computer Vision Group

University of California


Berkeley

15 training/class, 63.2%

Computer Vision Group

University of California


Berkeley


Conclusion


Correspondence based on local shape/appearance
descriptors


Deformable Template Matching


Machine learning for finding discriminative features


Integrating Perceptual Organization and Recognition