Recognizing Human Actions by Attributes

thunderingaardvarkAI and Robotics

Nov 18, 2013 (3 years and 8 months ago)

69 views

Daniel Arnett

CAP 4453


Recognizing Human Actions by Attributes

Jingen Liu, Benjamin Kuipers, Silvio Savarese

Dept. of Electrical Engineering and Computer Science

University of Michigan

fliujg,kuipers,silviog@umich.edu


In this paper
the authors explore the idea that
attributes

can be used

to represent human

actions from videos and
argue that attributes
will improve action recognition over current techniques. They created a f
ramework where

they use

manually specified attributes
that
are: i) selected in a discriminative

fashion so as to account for intra
-
class variability; ii) coherently integrated with data
-
driven attributes to make the

attribute set more descriptive.
Data
-
driven attributes are

automatically inferred fro
m the training

data using an in
formation theoretic approach.
Their
framework is built upon

a latent SVM formulation where latent variables capture

the degree of
importance of each attribute for each action

class.
They also show

that
their

attribute
-
based action

represen
tation can be effectively used to design a recognition procedure for classifying novel action classes for
which

no training samples are available

(Zero Shot Learning)
.
They

test
th
e
ir

approach on

several publicly
available datasets and obtain
results that
validate

their theories.



Experiments and Results


Datasets:


UIUC action dataset:



532 videos



14 actions

classes



22 manually specified action attributes


MIXED
-
action dataset
:

combination of the



KTH dataset

(6 action classes in 2300 videos)



Weitzmann

dataset (10 action classes in 100 videos)



UIUC dataset (14 action classes in 532 videos)




Zero Shot Learning Experimental Results


The first set of experiments were completed to validate the claims that t
he first contribution, the concept of zero
shot l
earning
,

could be applied to video recognition to recognize novel action classes (classes with no training
examples) using only human
-
specified attributes.


Experiment 1:



They
use
d

the
leave
-
two
-
classes
-
out
-
cross
-
validation
strategy in experiments

on the UIUC dataset.
First, the
manually specified attributes and actions were listed in the following table.







Specifically, for each run they left two classes out as novel classes (
|
Z
|
= 2). So, for instance, the in the first
iteration all of the v
ideos for hand
-
clap and crawl would be tested, while all of the other videos would be used
for training. The videos of the 2 classes would have no training examples, but the action attributes for that
specific action were known as seen in Figure 1.

All 91
possible configurations of training and testing classes
were used.
Figure 3 shows the average accuracy of each action over all runs. The majority of classes were
recognized with a success rate of over 70%, and 8 of the classes approached 90%.







Experiment 2:


They used the same UIUC dataset, but this time left 4 classes out for testing, and 10 for training. They listed the
average accuracy comparing their zero
-
shot learning to an alternative algorithm that used one
-
shot learning
(occurs when each

testing class has a single training example)
. The alternative algorithm used KNN as a
classifier.


In pattern recognition, the
k
-
nearest neighbor algorithm

(
k
-
NN) is a method for classifying objects based on
closest training examples in the feature space.


The results are shown below:




For 6 out of the 8 cases, the zero shot learning performed better.




Experiment 3:


Used the MIXED data set.



Testing classes: All 6 action classes from the KTH dataset

Training classes: Action classes that are from the UIUC and Weizmman datasets that were not included in the
testing classes.


The classification results are shown in Fig.5. For comparison, experiments are conducted with

1, 10 and 20 training examples from
each action classes. Again, a KNN classifier is used for classification and
actions

are represented without using attributes. The results are shown in Fig.5.





The zero
-
shot “0
-
shot” generally performs much better than “1
-
shot” and “10
-
shot”, and is
competitive with
“20
-
shot”. Fig.5 also shows some results with the bag
-
of
-
words approaches on KTH. The zero
-
shot learning
results compare well to these approaches, and although the zero
-
shot learning overall performance is slightly
weaker, it
s advantage is

that

the zero
-
shot learning results come without any training examples in six of the
action classes

that were used for testing, where alternative methods like the bag
-
of
-
words approach typically use
more than 90% of the data for training.



Results of exp
eriments where attributes boost traditional action recognition
:


Create MIXED Data set from KTH, Weitzmann, and UIUC datasets:



21 actions



2910 total videos




Training the attribute classifiers:

40% of the 2910 total videos were used to train the attribu
te classifiers: So, for each of the 34 attributes defined
above, each of the 1164 videos would be labeled as positive examples if they contained that attribute and the
negative otherwise. For example, looking at the first row in the above table, any of the

1164 videos that were
labeled containing actions of “bend”, “raise
-
1
-
hand”,” skip”, or “wave1”, would be labeled as positive examples
of the attribute “one
-
arm motion”, and any that did not would be labeled as negative examples.

In the above example (Fig
ure 2), the mapping
of the attributes
were done manually, so it was manually
specified that the action “bend” contains the attributes “one
-
arm motion”, “arm
-
shape straight”, and “torso bend
motion”.

A binary classifier action classifier was trained using t
he raw video alone, manually specified attributes, the raw
video and manually specified attributes, data driven attributes only, and the raw video with all attributes. The
results below show the performance comparison between varied combinations of differ
ent types of features
.



Each row corresponds to a diagonal entry of

a 21 by 21 confusion table

(
confusion table

is a specific table
layout that allows visualization of the performance of an algorithm)
. U
sing raw

features

alone,
their

system
obtained 53.1% average recognition

accuracy, while with human
-
specified attributes alone,

the average
accuracy is increased to 72.1%. Clearly, correctly

specified attributes help traditional recognition significantly.

This can be seen especially we
ll for the action

classes that do not have enough training examples (e.g.,

“bend”,
“jack”, and “wave1”). This is because the attributes

transfer knowledge from other classes to compensate for

fewer available training examples. Combining both raw

features a
nd human
-
specified attributes, the
performance

is improved to 78.1%. By adding data
-
driven attributes

with the human
-
specified attributes, the
performance can

be further improved by about 4.5%.
It was
conclude
d

that the

data
-
driven attributes provide
cues
that are complementary

to the human
-
specified attributes.


They further demonstrated the correlation between manually
-
specified attributes and data
-
driven attributes, by
showing a
dissimilarity
map in Fig.8, where colder colors indicate less
dissimilarity, namely stronger
correlation. This map is constructed from the training data (i.e., the action
-
to
-
data
-
driven
-
attribute matrix and
the action
-
to
-
specified
-
attribute matrix). The dissimilarity between two attributes is computed as the Euclidea
n
distance between their corresponding column vectors. From this map, you can see that some specified attributes
(e.g., the human
-
specified attribute set ā

=
{
1
,
8
,
9
,
10
,
11
}
, columns of Fig.8 (a)) are more correlated with data
-
driven attributes. The effe
ct of this correlation can be seen in the following experiments.


As Fig. 8 (b) shows, for recognition using manually
-
specified attributes only, removing ā

decreases the
performance from 72% (i.e., “specified attribute B” in (b)) to 64%. However, for reco
gnition using both
manually specified and data
-
driven attributes, removing ā

doesn’t cause an obvious performance decrease (i.e.,
“Mixed Attributes B” vs. “Mixed Attributes A” in (b)). This shows that data
-
driven attributes can make up the
information loss

caused by removing some human
-
specified attributes.






Similar experiments were tested on the Olympic dataset which contains 16 action classes and 781 videos. The
results of
five cases of
recognizing novel action testing classes are shown below (using
4 classes for testing, and
12 action classes for training):




The results were reasonable, and significantly beat a classifier with 10 training examples. The comparison
results of the average accuracies are shown below:




Case 1

Case 2

Case 3

Case 4

Cas
e 5


Zero
-
shot

69.8%

77.6%

59.7%

67.6%

66.3%

10
-
shot

47.7%

60.1%

52.8%

58.3%

60.0%



The results of an experiment on the Olympic Data set to see if both manually specified and data driven
attributes can improve recognition
, as well as a performance
comparison with other state
-
of
-
the
-
art approaches

are shown below.






Theory:

Representing actions by a combination of attributes.

Attribute based action representation:

By introducing the attribute layer between the low
-
level

features and action class labels, the classifier
f
which
mapsx to a class label, is decomposed into:


where
S
consists of
m
individual attribute classifiers

, and each classifier maps x to the
c
orresponding

i
-
th axis (attribute) of A
m
,
L
maps a point a


A
m
to a

class label
y


Y. The attribute classifiers
are learned from

a training dataset. Specifically, classifier
f
a
i

(x) is trained by

labeling the examples of all action
classes whose attribute

value
a
i

= 1 as positive examples and the rest as
negative.

The mapping
L can be defined
manually or learned from a training data set.


Attributes as latent variables
.

Given a training set
they

want
ed

to learn a classification model for
recognizing

an unknown action x. Although the attribute values of
each training example are already known
(they are inherited

from their corresponding classes), intra
-
class variability may cause the attribute values to be
inaccurate for specific

members of the training class. Consequently, some action instances may be as
sociated
with subtly different sets of attributes even if they belong to the same action class. For example, the “jumping
forward” actions from the UIUC action

dataset have the “big pendulum
-
like arm motion” attribute, while the
instances from the

Weizmman

dataset do not. This

is a consequence of the inherent intra
-
class variability and
the fact that associating attribute labels is a subjective process.

They

address this difficulty by treating attributes
as latent variables.

Latent variables are variables t
hat are inferred, not directly observed.


They

consider
ed

each attribute as an

abstract “part” of an action
, so

the location of an

attribute in the space
Am
is interpreted as a latent variable,

a
i



[0
,
1]. Larger values of
a
i

indicate a higher probability

that a video
possesses this attribute.


Their goal was to learn a classifier f
w
, such that

F
w
: X
d

x

Y
-
> R where w is the parameter vector.

X
d

is the
video vector, which is cross producted with each action class, which results in a score

They want to find

the
max score for each video, to learn that the video x contains a certain class:

Which they denote as y*(the best action class for the video)


They
want to: for each action class y, in the complete set of action classes, apply the classifier f
w

to x and

y :

f
w
(x,y), and maximize the score, and the y that yields the maximum score will be y*

Since they are recognizing actions by their attributes, the prediction will not only depend on the video, x and the
action class y, but also on its associated values a

in the action attribute space A
m
. So they define f
w
(x,y) as:


give them weights
dot producted with the function


Score for best

attribute

configuration


2
things to optimized in this function:
given each action class
and

its best attributes su
pporting it, which is
correct

class

For each action class y, try out different attribute configurations, looking for the best combo that supports this
class label.

Where

is a feature vector depending on

the:

Raw feature
video


x

Its class label



y

Its

associated attributes


a


W is the parameter vector providing weight for each feature.

***************** Still not 100% sure on how the apply the following weights



where w
is broken down into

and
A
is an attribute set.

Wx

is the action class template which is the set of coefficients learned from the raw features x.

Waj provides the weight for individual attributes

Waj;ak provides weights for attributes with dependences, for example if we were looking at object attribute
pa
irs, then aj and ak
might correspond to “head” and “ear”, respectively. Then their values are highly correlated,
since an object that “have head” tends to “have ear” as well.


wx
φ
1(x) provides the score measuring how well the raw feature
φ
1(x) of a video m
atches

the action class
template wx which is a set of coefficients learned from the raw features x.
If the second and third terms above
are ignored, wx can be calculated using a binary linear svm.


In other words,
they

first ignore the attributes in the training data and train a binary SVM from {(x(n), y(n))}
N

n=1. Then
they

use
φ

(x; y) to denote the SVM score of assigning x to class y.
Th
e
y

use
φ

(x; y) as the feature
vector. In this case, wx is a scalar used to

re
-
we
ight the SVM score corresponding to class y. This significantly
speeds up

the learning algorithm with their

model.

********Still not sure how the scalar is used


B
inary svm
:



A

set of related supervised learning methods that analyze data and recognize patterns, used for classification
and regression analysis. The standard SVM takes a set of input data and predicts, for each given input, which of
two possible classes comprises th
e input, making the SVM a non
-
probabilistic binary linear classifier. Given a
set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds
a model that assigns new examples into one category or the other. An

SVM model is a representation of the
examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap
that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a

category based on which side of the gap they fall on.



The potential function w
T ajφ
2(x
, aj
) provides the score of an individual attribute, and is used to indicate the
presence of an attribute in the video x. The initial value of
aj
is inherited from its

class label in the training
phase, and is given by a pre
-
trained attribute classifier when testing
.

Specifically, classifier
fai
(x) is trained by
labeling the examples of all action classes whose attribute value
ai
= 1 as positive examples and the rest
as
negative. So, golf swinging would be labeled a positive example for the attribute “torso twist”.




The edge potential w
T

aj;akφ
3(
aj , ak
) captures

the co
-
occurrence of pair of attributes
aj
and
ak
. If

each attribute
ai
has
A
statuses (e.g.,
{
0,1
}
), the
n each edge

has
A × A
configurations. As a result, the feature vector

φ
3(
aj ,
ak
) of an edge is a
A × A
dimensional indicator

for edge configurations. The associated w
T

aj;ak
contains the

weights for all configurations.

This basically says that
it
looks at

pairs of attributes and sees if there are
dependencies, and uses this information to enhance the efficiency of the algorithm.


NOTE:
is the defined latent SVM.
The SVM,

which is used for
testing and for classifying new data, is then trained on the whole
training set using the selected parameters.


where w

is a vector of model parameters and a

is a set of latent values.

is the score of

placing the
model according to the attribute configuration a.


The parameter vector w is learned from a training datase
t
D
by solving the following objective function:



where the second term implements a soft
-
margin.


A soft margin suggested a modified maximum margin idea that allows for mislabeled examples.
[2]

If there exists
no hyperplane that can split the "yes" and "no" examples in an SVM, the
Soft Margin

method will choose a
hyperplane that splits the example
s as cleanly as possible, while still maximizing the distance to the nearest
cleanly split examples.


Note that

as defined above is a maximum of functions each of which is linear in w. Hence

is
convex

in w. This implies that the hinge loss

is convex in w

when
yi
=

1. That is, the loss
function is

convex in w

for negative examples. We call this property of the loss function
semi
-
convexity
.



The objective

function

is semi
-
convex due to the inner
max in
f
w.


A semi
-
convex function is a function such that, if one draws a horizontal line anywhere in the Cartesian plane
corresponding to the graph of the function, the set of
x

such that
F
(
x
) is below the line is empty or forms a
single interva
l.
An example of semi
-
convex functions are shown below.



A local optimum can be obtained by the coordinate descent

[11], as follows:



Holding w fixed, find the best attribute configuration

a


such that
is maximized
in the
equation

Then, once
they

have the best attribute configuration,
they

hold that attribute configuration fixed, and search the
best parameters w such that the following equation is minimized:


***********still not sure how they put the attribute configuration into the following fun
ction



To find the best attribute configuration a for
f
w(x
, y
) =

max
a

w
T
(x
, y,
a),
they use belief propagation
.



Belief propagation is used because to find the best attribute configuration for the above function would be too
computationally time
consuming but by using belief propagation, the number of computations reduces
considerably.

Be
lief propagation

is a message passing algorithm for performing inference on graphical models
.
I
t calculates the marginal distribution for each unobserved node, co
nditional on any observed nodes
. It can also
be described as an

iterative process in which neighboring variables “talk” to each other, passing
messages
such
as:

“I (variable
x3
) think that you (variable
x2
)

belong in these states with various likelihoods…”

After enough iterations, this series of conversations is likely to converge to a consensus that determines the
marginal probabilities of all the variables. Estimated marginal probabilities are

called
beliefs.
BP algorithm:
update messages until convergenc
e, then calculate beliefs
.



Learning Data
-
Driven Attributes


They propose that

manually
-
specified attributes can assist
action
recognition
, but point out that

the manual
specification

of attributes is subjective, and potentially useful (discriminative) attributes may be ignored.
Since
this

may significantly

affect the performance of classifiers
, they proposed a framework that combines manually
specified attributes with data driven attri
butes (attributes that

are

automatically learn
ed). They argued that
they
have a complementary role in providing a more complete characterization of human actions.
They
propose
d

to
discover data

driven attributes by clustering low
-
level features while maxim
izing the system information gain.
The intuition is that attributes may be characterized by a collection of low
-
level

features that tend to co
-
occur in
the training data.


Given two random variables
X


X
=
{x
1
, x
2
, ..., xn}

and
Y


Y
=
{y
1
, y
2
, ..., ym}
,
where
X
represents a set

of
visual
-
words
, and
Y
is a set of action videos. The Mutual

Information (MI) [4]
MI
(
X
;
Y
) between
X
and
Y
expresses

how much information from variable
visual
-
words

is contained in
action videos
, which provides a
good measurement

to evaluate the quality of low
-
level features grouping.


Definition of mutual information

Formally, the mutual information of two discrete random variables
X

and
Y

can be defined as:


where
p
(
x
,
y
) is the joint probability distribution function of
X

and
Y
,

and
p
(
x
)

and
p
(
y
)

are the marginal
probability distribution functions of
X

and
Y

respectively.


It is clear that if two features
x
i

and
x
j

are semantically

similar, then merging them will not cause significant loss
of

the information that
X
and
Y
share. Given a set of features
,
they
obtain
ed

a set of clusters
The quality of clustering is measured by the loss of MI,


Where

is the training set and

is the

partition of
. After some mathematical
derivation,

they came up with







where
KL
(
a, b
) is the KL
-
divergence between two distributions.

They

treat
ed

the distribution

as the
cluster prototype

(e.g., the centroid), and the prior
p
(
xi
) is uniformly

distributed, then the loss of MI is the
distance from the distribution


to the cluster pr
ototype.

They

integrate
d

the discovery of data
-
driven
attributes into

the framework of latent SVM. Suppose h


H
l

(where H
l

is

the data
-
driven attribute space, with
the basis

is

the data
-
driven attribute vector associated with x, then
their
model is
extended as
follows,

***************** Still not 100% sure on how the apply the following weights




Where

provides prediction of a class label by

a data
-
driven attribute
h
s
, and

measure

the dependency between pairwise data
-
driven attributes,

H
is a s
et of data
-
driven attributes.
Since they

consider
ed

both

human
-
specified attributes and data
-
driven attributes to be

latent variables, then for each
example x,
they

search
ed

for

the best configuration of a and h such that
f
w(x
, y
) =


The extended objective is,

***********still not sure how they put the attribute configuration
a and h
into the following function



where
λ
and
η
are tradeoff parameters. The rationale behind

this model is that by minimizing the integrated
objective

function,
they

can find a set of latent data
-
driven attributes

and the classification model w which i)
predicts the data

correctly with a large margin and ii) minimizes the loss of

mutual information caused by
feature merging.

Discovering data
-
driven attri
butes while treating both

human
-
specified and data
-
driven action
attributes as latent

variables makes the objective function
intractable(Problems that can be solved in theory
(e.g., given infinite time), but which in practice take too long for their soluti
ons to be useful, are known as
intractable

problems. In complexity theory, problems that lack polynomial
-
time solutions are considered to be
intractable for more than the smallest inputs)
.



To simplify

this process,
they

use
d

two separate steps.
F
irst
they
find

the best partition

of

X
by minimizing
the loss of MI
Linf
.

This process produces the

data
-
driven attributes. Once they
have the data
-
driven attributes
they
only need to solve the latent

SVM problem. Gradient descent methods can be used

to minim
ize the loss of
MI.


A tutorial on energy based learning

4 Latent Variable Architectures

Energy minimization is a convenient way to represent the general process of reasoning and inference. In the
usual scenario, the energy is minimized with respect to the

variables to be predicted Y , given the observed
variables X. During training, the correct value of Y is given for each training sample. However there are
numerous applications where it is convenient to use energy functions that depend on a set of hidden
variables Z
whose correct value is never (or rarely) given to us, even during training. For example, we could imagine
training the face detection system depicted in Figure 2(b) with data for which the scale and pose information of
the faces is not availab
le. For these architectures, the inference process for a given set of variables X and Y
involves minimizing over these unseen variables Z: E(Y,X) = min Z

Z
E(Z, Y,X). (42)

Such hidden variables are called
latent variables
, by analogy with a similar concep
t in probabilistic modeling.
The fact that the evaluation of E(Y,X) involves a minimization over Z does not significantly impact the
approach described so far, but the use of latent variables is so ubiquitous that it deserves special treatment. In
particul
ar, some insight can be gained by viewing the inference process in the presence of latent variables as a
simultaneous minimization over Y and Z: Y


= argminY

Y
,Z

Z
E(Z, Y,X). (43)

Latent variables can be viewed as intermediate results on the way to finding the best output Y . At this point,
one could argue that there is no conceptual difference between the Z and Y variables: Z could simply be folded
into Y . The distinction arises d
uring

training: we are given the correct value of Y for a number of training samples, but we are never given the
correct value of Z. Latent variables are very useful in situations where a hidden characteristic of the process
being modeled can be inferred f
rom observations, but cannot be predicted directly. One such example is in
recognition problems. For example, in face recognition the gender of a person or the orientation of the face
could be a latent variable. Knowing these values would make the recognit
ion task much easier. Likewise in
invariant object recognition the pose parameters of the object (location, orientation, scale) or the illumination
could be latent variables. They play a crucial role in problems where segmentation of the sequential data mu
st
be performed simultaneously with the recognition task. A good example is speech recognition, in which the
segmentation of sentences into words and words into phonemes must take place simultaneously with
recognition, yet the correct segmentation into pho
nemes is rarely available during training. Similarly, in
handwriting recognition, the segmentation of words into characters should take place simultaneously with the
recognition. The use of latent variables in face recognition is discussed in this sectio
n, and Section 7.3 describes
a latent variable architecture for handwriting recognition.



Knowledge Transfer Across Classes

Representing actions with a set of human
-
specified action

attributes makes it possible to recognize a novel
action

class even when
training examples are not available. This

is accomplished by transferring knowledge
from known

classes (with training examples) to a novel class (without

training examples), and using this
knowledge to recognize

instances of the novel class. To formulate t
he problem,

let
be a training set where

consists of
K
training action classes. Given

a set of novel classes
that is disjoint
from

Y,
they

seek to obtain a classifier
f
: X
d →
Z. Traditional

classification fails to solve this problem since
there are no

training examples for Z.

As Fig. 2 demonstrates,


with human
-
specified attributes

any action class is an
m
dimensional vector, say
ay
=(
ay
1
, ..., aym
)


A
m
for a
training class
y
and
az
=(
az
1
, ..., azm
)


A
m
for a novel class
z
. An action instance is

also represented by a
point in A
m
. Ideally, the positions of

action instances will be close to the positions of their corresponding

action
classes. The attribute vector of an action

class is specified manually, while the attribute vector of an

action
insta
nce is provided by
m
attribute classifiers. These

attribute classifiers, namely the mapping
S
: X
d →
A
m
,

are learned from training dataset
T
in the manner previously described
. Given

an unknown action x belonging to
one of the action classes

Z, it is firs
t encoded it into the attribute space by
S
(x)


A
m
. It is then measured by its
Euclidean distances to all novel classes Z, and assign it to the nearest class (in these experiments, the K
-
Nearest
Neighbors (KNN) technique was used for classification). Notic
e that this assignment is possible because they
know the mappings A
m →
Z (manually specified) and X
d →
A
m
(learned from the training data), even

if no training samples are available for the novel classes Z.