COMPUTER VISION: SOME

jabgoldfishΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

65 εμφανίσεις

COMPUTER VISION: SOME
CLASSICAL PROBLEMS

ADWAY MITRA

MACHINE LEARNING LABORATORY

COMPUTER SCIENCE AND AUTOMATION

INDIAN INSTITUTE OF SCIENCE

June 24, 2013

WHAT IS COMPUTER VISION and WHY IS IT

DIFFICULT?


Computer Vision, obviously, aims to build computers that can
see
!


In other words, it deals with
analyzing/understanding images
and
videos

through computers


Aim of analysis is to
find known patterns

in images
-

Detection
, or

match images

with known patterns
-

Recognition


For analysis of image we first need a
representation

for it


An image is stored in a computer as a 2 or 3 dimensional matrix, each
element a pixel


A single pixel carries very little, if any, semantic information!!!
!

Representation with Features


For most applications of machine learning, the first and foremost step is to find
features


Features are used for
representation of the data


Features should be such that we can have a metric space for them
-

usually they are vectors


Very elaborate features (high
-
dimensional) need to be avoided for computational reasons

Feature Vector
-


Difficult to process

Smaller Feature

Vector

Representation

Dimensionality Reduction

Features for Computer Vision


Pixel values

can serve as features, but are often not very meaningful


Groups of pixels

can have more meaning
-

but how to form such groups??


Groups
-
of
-
pixels/sub
-
images

at large number of scales and positions


Image gradients
/edges


Various
Filter Outputs

have also been explored


Difficult to interpret semantically, but found to work well in certain
applications


Finding concise, semantically meaningful features still a very major issue in
Computer Vision

SIFT Interest Points


A
filter

is an operator which processes a signal and removes some
undesired components


Difference
-
of
-
Gaussian Filters

-

a popular filter for images


Positions of
local maxima

of this filter output are the

interest points


Some interest points, like those on the edges, are discarded


At each interest point, a feature vector is computed using
image gradients
and their orientations

inside
small windows around the interest point


This feature is invariant to orientation and scale of the image


SIFT: Scale
-
Invariant Feature Transform

SIFT INTEREST POINTS

FACE DETECTION
-
PROBLEM


Given an image, find the faces in it.


Used in many places like digital cameras and photo sharing albums,
including Facebook


Given a rectangular region in an image, say if it is a face or not!


Repeat this process for every location and every size of the rectangular
region

FACE DETECTION
-
GENERAL
APPROACH


Basically a
binary classification

problem


Requires building
model for face


Needs training samples
-

both positive and negative


Positive samples are face images, negative samples are non
-
face images

FACE images

NON
-
FACE images

FACE DETECTION
-
GENERAL
APPROACH


Basically a binary classification problem


Requires building model for face


Needs training samples
-

both positive and negative


Positive samples are face images, negative samples are non
-
face images


Learning algorithm

finds
boundary

between face and non
-
face images

FACE images

NON
-
FACE images

FACE DETECTION
-
GENERAL
APPROACH


Basically a binary classification problem


Requires building model for face


Needs training samples
-

both positive and negative


Positive samples are face images, negative samples are non
-
face images


Learning algorithm

finds
boundary

between face and non
-
face images

FACE images

NON
-
FACE images

Candidate

FACE DETECTION
-

BENCHMARK
and EVALUATION


Standard face
-
detection benchmark datasets available


FDDB: Face Detection dataset for
unconstrained setting


Performance usually measured using
Precision

and
Recall


Precision
: Of the reported face detections, how many were actually faces?


Recall
: Of the faces actually present, how many were detected?


F
-
score: Harmonic mean of precision and recall

FACE RECOGNITION
-
PROBLEM


Consists of a training phase and a testing phase


In the
training phase

we are given many face images, each marked with the
identity

of the person


In the
testing phase
, we are given a new face image, belonging to one of
these persons


The task is to
find out the identity

of the person


This is a simple
Classification

problem in Machine Learning


First suitable features and representations have to be found

FACE RECOGNITION
-
PROBLEM


One approach is to build a
model for each person
, using the training
images provided for him


Second approach is to
compare the test image

to each of the training
images, and find the
closest match


It may be observed that
not every part of face image helps in
recognition
-

certain things about faces are common to everyone


A good strategy is to find the features that are most
distinctive

and
represent images only by them


Eigenfaces (1991) uses the last two strategies


Recognition accuracy is the obvious evaluation criteria


A good recognition algorithm should work well with less number of training
images

FACE RECOGNITION
-
CURRENT
STATUS


Face recognition has traditionally been done with
well
-
cropped, focussed

face images
-

Controlled Environment


Considered a
solved problem
.


Nowadays face recognition is being revisited for
semi
-
controlled or
uncontrolled environments
.


LFW (Labelled Faces in Wild)

-

a dataset of face images taken in such
settings
-

a new benchmark

OBJECT RECOGNITION
-
PROBLEM


Classification task like face recognition


Practically much more complex


Large number of images given from many object categories


Classify a test image into one of these categories


Problem made very difficult by
intra
-
class variations

OBJECT RECOGNITION
-
GENERAL APPROACH


Once again the idea is to build models for different objects


No single feature may be enough for classification


Some objects may have a distinctive color, others may have a distinctive
shape


Multiple Kernel Learning

-

a sophisticated machine learning formulation,
generally considered the best approach for this problem


Caltech
-
101: a dataset of 101 object categories


Close to 80 % accuracy obtained by Multiple Kernel Learning


Caltech
-
256: a dataset of 256 object categories
-

Accuracy of 50 %
considered good!


Intra
-
class variations continue to pose significant challenge and even
scepticism
-

is it at all a valid problem???

OBJECT DETECTION


Given an image find all the birds, trees, and cars in it!


Requires building models for each of these objects


Once again search entire image at
multiple positions and scales


Part
-
based Models

of objects considered efficient


Instead of modelling whole object, model different parts separately


Helps to handle
occlusion

and perhaps
intra
-
class variations

IMAGE SEGMENTATION


Given an image, divide it such that each segment contains an object


Basically a
clustering

problem


Does not require features and is done purely with pixel values


Has inspired advanced clustering techniques like spectral clustering


Graph
-
based method
-

models image as graph with each
pixel
representing a node

and
adjacent pixels connected by edges


Each edge is given a
weight

according to
similarilty

of the corresponding
pixel values


Requires number of segments to be specified

IMAGE SEGMENTATION


Segmentation evaluated with respect to a gold standard segmentation


Every pair of pixels coming in the
same segment in the gold standard

should also be in
same segment in the segmentation



(and similarly for each pair of pixels coming in
different segments
)

Video Problems


Videos are
collections of images

taken over an interval of time
-

successive images are quite similar


Having to handle several images rather than one may make video problems
tougher


But the
temporal continuity

of videos provides a way out


Joint modelling

of multiple similar images can, in fact, give better
performance than modelling single image


For video tasks, additional
motion
-
based features

like optical flow can be
used


Concept of Interest
-
points for images is extended to
Space
-
Time Interest
Points

for videos


Face Recognition, Face Detection etc can also be done in videos, often
more effectively than in images

OBJECT TRACKING
-
PROBLEM


Given a video which shows a person/object moving


Need to find it in each frame


Naive approach
-

reduce it to object detection problem


If object is at position (x, y) in frame t, it will be very close in frame (t + 1)


So if we know the position in time t, we need to search only
around that
same position


Reduces search space greatly!!


Main idea is to build an
appearance model

for the object


The appearance may change over time due to variations in size, illumination,
viewpoint etc


The appearance model must be
adaptive
-

and
recomputed

throughout the
video

OBJECT TRACKING
-

BENCHMARK and EVALUATION


Performance measured with respect to gold standard, where in each frame a
bounding box is provided


Proportion of overlapping areas of the gold standard and reported bounding
boxes

OBJECT TRACKING
-
CURRENT
STATUS


Considered a solved problem under controlled illumination and background


Current research aims to handle
occlusion

of the object, and
sudden
changes in background and illumination


Tracking
multiple objects at the same time

is another important problem


Tracking is a
real
-
time application
. Efforts are on to process as many
frames as possible per second


To adapt or not adapt
-

remains the fundamental problem in vision.


A single miss can make the whole tracking go wrong.


Detection and correction of miss is an important problem to solve

ACTION RECOGNITION IN
VIDEOS


Surveillance cameras are nowadays available at many sensitive public
locations


The aim is to record activities of people


Requires use of
dynamic features
, which make use of the motion in videos


Some image
-
based features can be extended to videos, like
space
-
time
interest points


These can be used by viewing the video as a
space
-
time volume


The features can also be in the form of time
-
series

ACTION RECOGNITION IN
VIDEOS


In presenece of a benign background, static camera and a single actor, the
problem is considered solved


Current research aims to handle complex environments, like crowded
places, where the persons frequently get
occluded


Multi
-
person interaction recognition is another recent branchout of the
problem