Formulating Semantic Image Annotation

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 10 months ago)

90 views

Formulating Semantic Image Annotation
as a Supervised Learning Problem


Gustavo Carneiro and Nuno Vasconcelos

CVPR ‘05

Presentation by:

Douglas Turnbull

CSE Department, UCSD


Topic in Vision and Learning

November 3, 2005

What is Image Annotation?

Given an image, what are the words that describe the image?

What is Image Retrieval?

Given a database of images and a query string (e.g. words),
what are the images that are described by the words?

Query String: “jet”

Problem: Image Annotation & Retrieval

Based on the low cost of both digital camera and hard disk
space,
billions of consumer

have the ability create and
store digital images.


There are already
billions of digital images

stored on
personal computers and in commercial databases.


How do store images in and retrieve images from a large
database?


Problem: Image Annotation & Retrieval

In general, people do not spent time
labeling, organizing

or
annotating

their personal image collections.


Label:


Images are often stored with the name that is produced by the digital camera:


“DSC002861.jpg”


When they are labeled, they are given a vague names that rarely describe the
content of the image:


”GoodTimes.jpg”, “China05.txt”


Organize:


No standard scheme exists for filing images


Individuals use ad hoc methods: “Chrismas2005Photos” and “Sailing_Photos”


It is hard to merge image collections since the taxonomies (e.g. directory
hierarchies) differ from user to user.


Problem: Image Annotation & Retrieval

In general, people do not spent time
labeling, organizing

or
annotating

their personal image collections.


Annotate:


Explicit Annotation:

Rarely do we
explicitly

annotate our images with
captions.


An exception is when we are create web galleries


i.e. My wedding photos on
www.KodakGallery.com



Implicit Annotation:
Sometimes we do
implicitly

annotate images we
imbed images into text (as is the case with webpages.)


Web
-
based search engines make use of the implicit annotation when they
index images.


i.e. Google Image Search, Picsearch




Problem: Image Annotation & Retrieval

If we can’t depend on human labeling, organization, or
annotation, we will have to resort to
“content
-
based image
retrieval”:


We will extract features vectors from each image


Based on these feature vectors, we will use statistical models to
characterize the relationship between a
query

and image features.



How do we specify a meaningful
query

to be able to navigate
this image feature space?



Problem: Image Annotation & Retrieval

Content
-
Based Image Retrieval
: How do we specify a query?


Query
-
by
-
sketch
: Sketch a picture, extract features from the pictures, we
the features to find similar images in the database.


This requires that

1.
we have a good drawing interface handy

2.
everybody is able to draw

3.
the quick sketch is able to capture the salient nature of the desired
query


Not a very feasible approach.





Problem: Image Annotation & Retrieval

Content
-
Based Image Retrieval
: How do we specify a query?


Query
-
by
-
text
: Input words into a statistical model that models models the
relationship between words and image features.


This requires that:


1. A keyboard


2. A statistical model that can relate words to image features


3. Words can be used to capture the salient nature of the desired query.


A number of research systems have been develop that find a relationship
content
-
based image features and text for the purpose of image
annotation and retrieval.



-

Mori, Takahashi, Oka (1999)



-

Daygulu, Barnard, de Freitas (2002)



-

Blei, Jordan (2003)



-

Feng, Manmantha, Lavrenko (2004)



Outline

Notation and Problem Statement


Three General Approaches to Image Annotation

1.
Supervised One vs. All (OVA) Models

2.
Unsupervised Models using Latent Variables

3.
Supervised M
-
ary Model


Estimating P(image features|words)


Experimental Setup and Results


Automatic Music Annotation



Outline

Notation and Problem Statement


Three General Approaches to Image Annotation

1.
Supervised One vs. All (OVA) Models

2.
Unsupervised Models using Latent Variables

3.
Supervised M
-
ary Model


Estimating P(image features|words)


Experimental Setup and Results


Automatic Music Annotation



Notation and Problem Statement



Notation and Problem Statement



x
i

= vector of image features

x

=

{x
1
,

x
2

, …

}




w
i

= one word

w
=

{
w
1
, w
2

, …

}




= vector of feature vectors

= vector of words

Image and Caption

Image Regions

Notation and Problem Statement



Notation and Problem Statement



-

Notation and Problem Statement



Image Regions

Multiple Instance Learning
: this regions
has no visual aspect of “jet”

Weak Labeling
: this image depict sky
eventhough the caption does contain “sky”

Outline

Notation and Problem Statement


Three General Approaches to Image Annotation

1.
Supervised One vs. All (OVA) Models

2.
Unsupervised Models using Latent Variables

3.
Supervised M
-
ary Model


Estimating P(image features|words)


Experimental Setup and Results


Automatic Music Annotation



Supervised OVA Models

Early research posed the problem as a supervised learning
problem: train a classifier for each semantic concept.


Binary Classification/Detection Problems:


Holistic Concepts
: landscape/cityscape, indoor/outdoor scenes


Object Detection
: horses, buildings, trees, etc



Much of the early work focused on
feature design

and used
existing models developed by the machine learning
community (SVM, KNN, etc) for classification.





Supervised OVA Models





Supervised OVA Models





Pro:



Easy to implement



Can design features and tune learning algorithm for each classification task



Notion of optimal performance on each task



Data sets represent basis of comparison


OCR data set


Con:


Doesn’t scale well with a large vocabulary



Requires train and use L classifier



Hard to compare posterior probabilities output by L classifier



No natural ranking of keywords.






Weak labeling is a problem:



Images not labeled with keyword are placed in D
0


Unsupervised Models

The goal is to estimate the joint distribution


We introduce a latent (e.g. hidden) variable L that encode S hidden states of the
world.


i.e. “Sky” state, “Jet” state

A state defines a joint distribution of image features and keywords.


i.e. P(x=(blue, white, fuzzy), w=(“sky”, “cloud”,”blue”) | “Sky” State) will have high probability.


We can sum over the S states variable to find the joint distribution







Learning is based on the expectation maximization (EM):


1) E
-
step: update strength of association between image
-
caption with state


2) M
-
step: maximize likelihood of joint distribution for each state


Annotation involves the most probable words under the joint distribution model



Unsupervised Models

Multiple
-
Bernoulli Relevance Model (MBRM)


(Feng, Manmantha, Lavrenko CVPR ’04)


Simplest unsupervised model which achieves best results


Each of the D images in the training set is a “not
-
so
-
hidden” state


Assume conditional independence between image features and keywords
given state





MBRM eliminates the need for EM since we don’t need to find the strength
of association between image
-
caption and state
.

Parameter estimation is straight forward


P
X|L

is estimated using a Gaussian kernel


P
W|L

reduces to counting


The algorithm becomes essentially “smoothed k
-
nearest neighbor”.



Unsupervised Models

Pros:


More scaleable than Supervised OVA


Size of vocabulary


Natural ranking of keywords


Weaker demands on quality of labeling


Robust to a weakly labeled dataset


Cons:


No guarantees of optimality since keywords are not explicitly treated
as classes


Annotation: What is a good annotation?


Retrieval: What are the best images given a query string?

Supervised M
-
ary Model

Critical Idea:

Why introduce latent variables when a keyword directly
represents a semantic class.


A random variable W which takes values in {1,…,L} such that W = i if
x

is
label with keyword w
i
.

The class conditional distributions PX|W(x|i) are estimated using the
images that have keyword w
i
.

To annotate a new image with features
x
, the Bayes decision rule is
invoked:







Unlike Supervised OVA which consist of solving L binary decision
problems, we are solving one decision problem with L classes.

The keyword compete to represent the image features.

Supervised M
-
ary Model

Pros:


Natural Ranking of Keywords


Similar to unsupervised models


Posterior probabilities are relative to same classification problem.



Does not require training of non
-
class models


Non
-
class model are the Yi = 0 in Supervised OVA


Robust to weakly labeled data set since images that contain concept but
are not labeled with the keyword do not adversely effect learning.


Non
-
class models are computational bottleneck



Learning a density estimates P
X|W
(
x|
i) is computationally equivalent
to learning density estimates for each image in MBRM model.


Relies on Mixture Hierarchy method (Vasconcelos ’01)



When vocabulary size (L) is smaller than the training set size (D),
annotation is computationally more efficient than the most efficient
unsupervised algorithm.

Outline

Notation and Problem Statement


Three General Approaches to Image Annotation

1.
Supervised One vs. All (OVA) Models

2.
Unsupervised Models using Latent Variables

3.
Supervised M
-
ary Model


Estimating P(image features|words)


Experimental Setup and Results


Automatic Music Annotation



Density Estimation

For Supervised M
-
ary learning, we need to find the class
-
conditional
density estimates P
X|W
(x|w
i
) using a training data set D
i
.


All the images in D
i

have been labeled with w
i


Two Questions:

1)
Given that a number of the image regions from images in D
i

will not
exhibit visual properties that relate to w
i
, can we even estimate these
densities?


i.e An image labeled “jet” will have regions where only sky is present.



2)

What is the “best” way to estimate these densities?


“best”


the estimate can be calculated using a computationally efficient
algorithm


“best”


the estimate is accurate and general.

Density Estimation

Multiple Instance Learning
: a bag of instance receive a label for the
entire bag if one or more instances deserves that label.


This makes the data noisy, but with enough averaging we can get a good
density estimate.


For example:

1 Suppose all images has three


regions.

2 Every image annotated with “jet”


have one region with jet
-
like


features (i.e. mu =20, sigma = 3).

3 The other two regions are uniformly


distributed with mu ~ U(
-
100,1000)


and sigma ~ U(0.1,10)

4 If we average 1000 images, the


“jet” distribution emerges

Density Estimation

For word w
i
, we have D
i

images each of which is represented
by a vector of feature vectors.


The authors discuss four methods of estimating P
X|W
(x|i).

1.
Direct Estimation

2.
Model Averaging

1)
Histogram

2)
Naïve Averaging

3.
Mixture Hierarchies



Density Estimation


1)
Direct Estimation


All feature vectors for all images represent a distribution


Need to does some heuristic smoothing


e.g. Use a Gaussian
Kernel


Does not scale well with training set size or number of vector per
image




Feature 2

Feature 1

Smoothed kNN

Density Estimation

2)

Model Averaging


Each image

l


in D
i

represents a individual distribution


We average the image distributions to find one class distribution




The paper mentions two techniques

1)
Histograms


partition space and count


Data sparsity problems for high dimensional feature vectors.

2)
Naïve Averaging using Mixture Models


Slow annotation time since there will be KD Gaussian if each image mixture has K
components



Feature 2

Feature 1

Feature 2

Feature 1

Histogram

Smoothed kNN

Feature 2

Feature 1

Mixtures

Density Estimation

3) Mixture Hierarchies


(Vasconcelos 2001)


Each image

l


in D
i

represents a individual mixture of K Gaussian
distributions


We combine “redundant” mixture components using EM


E
-
Step: Compute weight between each of the KD components and the T
components


M
-
Step: Maximize parameters of T components using weights


The final distribution is one Mixture of T Gaussians for each keyword w
i

where T << KD.





Di

l
1

l
3

l
Di

l
2


. . .

Outline

Notation and Problem Statement


Three General Approaches to Image Annotation

1.
Supervised One vs. All (OVA) Models

2.
Unsupervised Models using Latent Variables

3.
Supervised M
-
ary Model


Estimating P(image features|words)


Experimental Setup and Results


Automatic Music Annotation



Experimental Setup

Corel Stock Photos Data Set


5,000 images


4,500 for training, 500 for testing


Caption of 1
-
5 words per image from a vocabulary of L=371 keywords


Image Features


Convert from RGB to YBR color space


Computes 8 x 8 discrete cosine transform (DCT)


Results is a 3*64 =192 dimensional feature vector for each image region


64 low frequency features are retain so that

Experimental Setup

Two (simplified) tasks:


Annotation:

given a new image, what are the best five words that
describe the image


Retrieval:

Given a one word query, what are the images that
match the query.




Evaluation Metrics:


|w
H
|
-

number of images that have been annotated with w by
humans


|w
A
|
-

number of images that have been automatically annotated
with w


|w
C
|
-

number of images that have been automatically annotated
with w AND where annotated with w by humans



Recall = |w
C
|/|w
H
|


Precision = |w
C
|/|w
A
|


Mean Recall and Mean Precision are average over all the words found in
the test set.



Other Annotation Systems

1. Co
-
occurrence (1999)


Mori, Takahashi, Oka


Early work that clusters sub
-
images (block
-
based decomposition) and counts
word frequencies for each cluster


2. Translation (2002)


Duygulu, Barnard, de Freitas, Forsyth


“Vocabulary of Blobs”


Automatic Segmentation
-
> Feature Vectors
-
> Clustering
-
> Blobs


An image is made of of Blobs, Words are associated with Blobs
-
> New
Caption


“Blobs” are latent states

Block
-
Based Decomposition

Automatic Decomposition

Other Annotation Systems

3. CRM (2003)
-

Lavrenko, Manmatha, Jeon


Continuous
-
space Relevance Model


“smoothed KNN” algorithm



image features are modeled using kernel
-
based densities




automatic image segmentation




color, shape, texture features



word features are modeled using multinomial distribution


“Training Images” are latent states.




4. CRM
-
rect(2004)


Feng Manmantha, Lavrenko


Same as CRM but using block
-
based decomposition rather than
segmentation


5. MBRM (2004)


Feng, Manmantha, Lavrenko


Multiple
-
Bernoulli Relevance Mode


Same as CRM
-
rect but uses multiple
-
Bernoulli distribution to model
word features



shifts emphasis to
presence

of word rather than
prominence

of word.

New Annotation Systems

6. CRM
-
rect
-
DCT (2005)


Carneiro, Vasconcelos


CRM
-
rect with DCT features


7. Mix
-
hier(2005)
-
Carneiro, Vasconcelos


Supervised M
-
ary Learning


Density estimation using Mixture Hierarchies


DCT features


Annotation Results

Examples of Image Annotations:

Annotation Results

Performance of Annotation system on Corel test set


500 images, 260 keywords, generate 5 keywords per image


Recall = |w
C
|/|w
H
|, Precision = |w
C
|/|w
A
|






Gain of 16% recall at same or better level of precision

Gain of 12% in words with positive recall


i.e. a word is found in both human and automatic annotation at least once.

Annotation Results

Annotation computation time for Mix
-
Hier scales with training set size.


MBRM is O(TR), where T is training set size


Mix
-
Hier is O(CR), where C is the size of the vocabulary


R is the number of image regions per image.


Complexity is measured in seconds to annotated a new images.





Retrieval Results

First five ranked images for “mountain”, “pool”, “blooms”, and “tiger”




Retrieval Results

Mean Average Precision


For each word w
i
, find all n
a,i

images that have been automatically annotated
with word w
i
.


Out of the n
a,i

images, let n
c,i

be the number of images that have been
annotated with w
i

by humans.


The precision of w
i

is n
c,i

/ n
a,i
.


If we have L words in our vocabulary, mean average precision is




Mix
-
Hier does 40% better on words with positive recall.

Outline

Notation and Problem Statement


Three General Approaches to Image Annotation

1.
Supervised One vs. All (OVA) Models

2.
Unsupervised Models using Latent Variables

3.
Supervised M
-
ary Model


Estimating P(image features|words)


Experimental Setup and Results


Automatic Music Annotation



Automatic Music Annotation

Annotation: Given a song, what are the words that describe the music.


Automatic Music Reviews


Retrieval: Given a text query, what are the songs that are best describe by the
query.


Song Recommendation, playlist generation, music retrieval


Features extraction involves applying filters to digital audio signals


Fourier, Wavelet, Gammatone are common filterbank transforms


Music may be “more difficult” to annotate since music is inherently subjective.


-
Music evokes different thoughts and feeling to different listeners


-
An individual experience with music changes all the time


-
All music is art unlike most digital images.



-
The Corel data set consists of concrete “object” and “landscape” scene



-
An similar dataset might focus on Modern Art (Pollack, Mondrian, Dali)

Automatic Music Annotation

Computer Hearing

(aka Machine Listening, Computer Audition):




Music is one subdomain of sound


Sound Effects, Human speech, Animal Vocalization, Environment Sounds all
represent other subdomains of sound



Annotation is one problem


Query
-
by
-
humming, Audio Monitoring, Sound Segmentation, Speech
-
to
-
Text
are examples of other Computer Hearing Problems



Automatic Music Annotation


Computer Hearing

and
Computer Vision

are closely related:

1.
Large public and private database exist that are rapidly
growing in size

2.
Digital Medium


Sound

is 2D


intensity (amplitude) & time or frequency & magnitude


Sound is often represented in 3D


magnitude, time and frequency


Image

is 3D


2 spatial dimensions, an intensity (color)


Video

is 4D


2 spatial dimensions, an intensity, time

3.
Video is comprised of both images and sound

4.
Feature extraction techniques are similar


Applying filters to digital medium

Work Cited:

Carneiro, Vasconcelos. “Formulating Semantic Image Annotation as a
Supervised Learning Problem” (CVPR ’05)


Vasconcelos. “Image Indexing with Mixture Hierarchies” (CVPR ’01)


Feng, Manmatha, Lavernko. “Multiple Bernoulli Relevance Models for
Image and Video Annotation” (CVPR ’04)


Blei, Jordan. “Modeling Annotated Data” (SIGIR ’03)