Representing Videos using Mid-level

hartebeestgrassAI and Robotics

Nov 7, 2013 (3 years and 11 months ago)

140 views

CVPR2013 Poster


Representing Videos using Mid
-
level
Discriminative Patches

Outline


Introduction


Mining Discriminative Patches


Analyzing Videos


Experimental Evaluation & Conclusion

1. Introduction


Q.1:What does it mean to understand
this video ?


Q.2:How might
we achieve such an
understanding?

1. Introduction


Video
single feature vector


semantic action



object


bits and pieces


General framework




detect

object

p
rimitive actions

Bayesian networks

storyline

1. Introduction







Drawback:

computational models
for identifying semantic entities
are not
robust enough
to serve as a basis for video analysis

1. Introduction


Represent video

use

n
ot use

Discriminative
spatio
-
temporal patches

global feature
vector or set
of semantic entities

Discriminative
spatio
-
temporal patches

p
rimitive human action

s
emantic object

human
-
object pair

r
andom but informative patches

correspond

a
utomatically

mined

f
rom training data consisting of hundreds of videos

1. Introduction


spatio
-
temporal
patches

act as a discriminative vocabulary for action classification

establish strong
correspondence

between patches in training and test videos.

Using label transfer techniques

align the videos and
perform tasks

(Ex. object localization, finer
-
level action detection etc.)

1. Introduction


1. Introduction


2. Mining Discriminative Patches


Two conditions




Challenge

(1)They occur
frequently within a
class.

(2)They
are distinct
from patches
in other classes.

(1)Space of potential
spatio
-
temporal patches is extremely
large
given that
these patches can occur over a range of scales
.

(2)

Overwhelming
majority of video patches
are uninteresting.

2. Mining Discriminative Patches


Paradigm : bag
-
of words






Major drawbacks

Step1:Sample
a few thousand patches, perform
k
-
means clustering
to find representative
clusters

Step2:Rank
these clusters based on membership in different
action
classes.

(1)High
-
Dimensional Distance Metric

(2)Partitioning

2. Mining Discriminative Patches











(1)High
-
Dimensional Distance
Metric

K
-
means use
standard distance metric

(Ex. Euclidean or normalized cross
-
correlation)

Not well in high
-
dimensional spaces



We use HOG3D

2. Mining Discriminative Patches






(2)Partitioning

Standard clustering algorithms partition the
entire feature
space.
Every data point is assigned to one of
the clusters
during the
clustering procedure. However, in
many cases
, assigning cluster
memberships to rare
background patches
is hard. Due to the
forced clustering they
significantly diminish
the purity of good
clusters to which
they are
assigned

2. Mining Discriminative Patches


Resolve these
issues





Using Exemplar
-
SVM(e
-
SVM) to learn


1.Use an exemplar
-
based clustering approach

2.Every patch is considered as a possible cluster center

Drawback : computationally infeasible

Resolve
: use motion


use Nearest Neighbor

2. Mining Discriminative Patches







Training videos

Validation partition : rank
the clusters based on representativeness

Training partition: (form cluster)

(ⅰ)
Using simple nearest
-
neighbor approach(typically k=20)

(ⅱ)
Score each patch and rank

(

)select a few patches per action class and use the e
-
SVM to learn

(

)e
-
SVM are used to form clusters

(

)re
-
rank

2. Mining Discriminative Patches


Goal : smaller
dictionary(set of
representative patches
)


Criteria






(a)Appearance Consistency

(b)Purity

Consistency score

tf
-
idf

(score): same class/different class



All patches are ranked using a linear combination of the two score

2. Mining Discriminative Patches


3. Analyzing Videos


Action Classification




Beyond Classification: Explanation via
D
iscriminative Patches

Top n e
-
SVM detectors

i
nput : test videos

f
eature vector

SVM classifier

output : class

Q. How
we can use detections of
discriminative patches
for establishing
correspondences
between training
and test
videos?

Q. Which
detections
to select
for establishing
correspondence?

3
. Analyzing
Videos


Context
-
dependent Patch Selection

Vocabulary size : N

candidate detections
:{D
1
,D
2
,…,D
N
}

whether or not the detection
of e
-
SVM
i

is
selected : x
i

Appearance term(A
i
):e
-
SVM score for patch
i

Class Consistency term(
C
li
):This term promotes selection of certain
e
-
SVMs
over others
given the action class. For example, for
the weightlifting
class it
prefers selection of the
patches with
man and bar with vertical motion. We learn
C
l

from
the training data by counting the number of
times that
an e
-
SVM fires
for each class.

3. Analyzing Videos






Optimization

Integer Program is an NP
-
hard problem

use

IPFP algorithm

5~10 iterations

Penalty term(
P
ij
):
is the penalty term for selecting
a

pair
of detections together
.


(1)e
-
SVMs

i

and j do not fire frequently together in the
training

data.

(
2
) the e
-
SVMs
i

and j are trained from
different

action
classes.

4. Experimental Evaluation


Datasets :UCF
-
50,Olympics Sport


Implementation Details:




Classification Results


Our
current
implementation considers
only
cuboid patches


Patches

are
represented with
HOG3D features
(4x4x5 cells
with 20
discrete orientations).

4. Experimental Evaluation


4. Experimental Evaluation


4. Experimental Evaluation


Correspondence and Label Transfer

4. Experimental Evaluation

4. Experimental Evaluation


Conclusion

1.A
new representation for
videos .

2.Automatically mine these
patches
using exemplar
-
based
clustering approach.

3.Obtaining strong correspondence and
align the videos for
transferring annotations.

4.As
a vocabulary
to achieve
state of the art results for action
classification.