MACHINE LEARNING FOR MULTIMEDIA

wonderfuldistinctΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

56 εμφανίσεις

MACHINE LEARNING FOR MULTIMEDIA
INFORMATION RETRIEVAL

Zengchang

Qin (zengchang.qin@gmail.com)


Intelligent Computing and Machine Learning

Beihang

University


IHSMC, Hangzhou, Aug. 2013

About Us

ICMLL @ BUAA: icmll.buaa.edu.cn


Outline



Motivation & Introduction


Semantic Gaps


Text Modeling


Content
-
based Image Retrieval


Cross
-
modal Information Retrieval


Future Work


Motivation




Massive explosion of
“content”

on the
web.


Content rich in
multiple modalities



Text, Images, Videos, Music etc.


There is a need for retrieval systems
that are robust to any modalities.


Cross modal text query,
eg
. retrieval of
images from
photoblogs

using textual
query.


Finding
images

to go along with
a text
article


Finding music
to enhance videos.


Position

an image in the text.


Etc.






Motivation




Current

retrieval systems are
predominantly
uni
-
modal
.


The
query
and

retrieved results
are from
the
same modality







Is
Google Image search
cross
-
modal
retrieval?


No
,
text is matched to text metadata

for
the image


The
operation would fail
, in absence of text
modality for the retrieval set.











Text

Images

Music

Videos

Text

Images

Music

Videos

BBC’s Problem



Related Works




Several
multi
-
modal

systems have been proposed
[TRECVID, ImageCLEF,
Iria’09, Wang’09, Escalante’08, Pham’07, Snoek’05, Westerveld’02, etc.]


Given a query consisting of
multiple modalities
, retrieve examples containing the
same multiple modalities
.


Eg.
Combining the modalities
into a single modality,
combining the outputs
of
multiple uni
-
modal systems.




Annotations systems
[TRECVID,
ImageCLEF
, Carneiro’07, Feng’04, Lavrenko’03,
Barnard’03, etc]


Given a query from a
modality
(say
image
), assign
text labels
.


Are
true cross
-
modal systems
.


However
, text modality
is constrained to a few keywords
.


Text

Images

Music

Videos

Text

Images

Music

Videos

Cross
-
modality




Given
query

from
modality A, retrieve
results
from modality B
.


The query and retrieved items
are not required
to share a
common modality.






In this work


I restrict to
text and image modalities


Although
similar ideas can be applied to other modalities
.


Thus,


the
retrieval of text
in response

to a
query image.


And, the
retrieval of images
in

response to a
query text.














Text

Images

Music

Videos

Text

Images

Music

Videos

.

.

.

SEMANTIC GAP

An Example


Search Engine


An Example


Search Engine


An Example


Search Engine


Conceptual Idea



Multimedia Information



Multimedia information is universal. The basic idea is how
to convey the semantic concept hidden in different
modalities.


Semantic Gap




P
opulation of the Capital of
China.


P
opulation, Capital, China

TEXT MODELING

Road Map




D(1)

D(2)

D(3)


… …

D(n
-
1)


D(n)

Query (Q)


Rock


(geology, music)?


Context could help to solve
the polysemy of words.



Similarity:

Latent Semantic Index (Analysis)






A is the original term
-
by
-
document matrix, S is
a

diagonal matrix,
T and D are
o
rthonormal matrices

Ref: K.
Gimpel

(2006), Modeling topics, CMU Technical Report

LSI Results





Generative Models




Topics for generating observable words

Ref: M.
Steyvers
, and T. Griffiths (2004), Finding scientific topics, PNAS, 101 (sup. 1), 5228
-
5235.

Simplex Representation





Topic models
--
hierarchical Bayesian models for language
modeling and document analysis.


A document can be represented by a mixture of latent
topics and each topic is a distribution over the vocabulary.

Topic Model



Ref: D.
Blei
, A. Ng, and M. Jordan (2003), Latent
Dirichlet

allocation, Journal of Machine Learning
Research, 3:993
-
1022 (cited over 3107, from Google scholar)

LDA
-

Latent
Dirichlet

Allocation




Latent Dirichlet Allocation (LDA) is one of the most popular
and commonly used topic models

Ref: D.
Blei
, A. Ng, and M. Jordan (2003), Latent
Dirichlet

allocation, Journal of Machine Learning
Research, 3:993
-
1022 (cited over 3107, from Google scholar)

Semantic Distance










D(1)

D(2)

D(3)


… …

D(n
-
1)


D(n)

Query (Q)

Ref: Z. Qin, M.
Thint

and Z. Huang (2009), IEA/AIE, LNCS 5579, pp. 103
-
112.

KL
-
Distance





Query can be extended using WordNet to

Improve the quality of
t
opic modeling.

Chinese Language





Its basic structural unit is character instead of word.


Chinese words are written without spaces between them




计算机会下象棋





Computer can play chess




(Computing the chance of playing chess)


Character is expressive in meaning though word has more
specified meaning.





” (illness) “

” (person) “
病人
”(patient)


If two words share the same character, they potentially
have sort of semantic relations.






” (game) “
锦标赛
(tournament)”


Ref: T.
Xu

(2001), Fundamental structural principles of Chinese semantic syntax in terms of
Chinese characters, Applied Linguistics, 1, 3
-
13.


Test Data




NEWS1W:


This corpus is a news archive for classification provided by the
Sogou

laboratory.





We selected 10000 documents with 1000 documents per class from
the corpus as our experiment data which is named as NEWS1W



BIL3200:


A Chinese
-
English bilingual corpus collected from the Yeeyan.com.
We select 3200 documents for experiments and named the corpus as
BIL3200

Chinese Characters and Words



Bilingual Experiments




Train LDA model based on Chinese character, Chinese word and English respectively.


Extracted topics on bilingual corpus BIL3200

Top 20 terms of 3
topics extracted
independently from 3
models.


These 3 topics are
semantically close
which are talking
about internet service
or technology

Word
-
Character Relations




Former comparison experiments showed topic models based on Chinese
character can also capture the semantic meaning of text.



Chinese words “


”(study) and “
学生
”(student) are literally related by
sharing the same character “

”.



And these two words are semantically related and may have a high
probability to occur in the same context.



New words are created constantly while the characters constitute the new
word probably already appeared in the history docs



We proposed Chinese character
-
word topic model (CWTM) to consider
the character
-
word relation by placing an asymmetric prior on the topic
-
word distribution of the standard LDA.



Word
-
Character Relation Model



The document
-
topic (left) part of the model is exactly the same as LDA.


--

k
th

topic over Chinese word

--

k
th

topic over Chinese character

W
d,n
--

n
th

word in the d
th

document.

R
--

Character
-
Word relation



Ref
:
Q. Zhao, Z. Qin and T. Wan (2011), ICONIP.

Discussions





Some of the words relevant to sports, even not appeared in the
training data, were assigned relatively higher probabilities in
topics about sports by CWTM.



For example, the word “
主客

” (home and away), though
never appeared in training documents, its component
characters “

”, “

”, “

”are the components of those words
appeared in training data. Therefore, it obtained a higher
probability in CWTM.



On the other hand, not surprisingly, none of these words were
ranked into top1000 words in the corresponding topic
extracted by standard LDA model.

Topics





Classification Results



Topic
-
based Browser



CONTENT
-
BASED IMAGE RETRIEVAL

Bag
-
of
-
Features






Ref: X. Yuan, J. Yu, Z. Qin and T. Wan, A SIFT
-
LBP image retrieval model based on a bag of
features,

ICIP 2011

SIFT
-
LBP Feature



Our Framework

Retrieval Results

Ref: Jing Yu,
Zengchang

Qin, Tao Wan and Xi Zhang
, Feature Integration Analysis of Bag
-
of
-
Features Model for Image Retrieval
,
Neurocomputing
., 2013.

What Color is an Object?



Single Label and Multi
-
labels



CROSS
-
MODAL INFORMATION RETRIEVAL

Basic Idea




No natural correspondence
between representations of different
modalities.


For example, we use
Bag
-
of
-
Features

and
Topic Model
to represent



images and texts, respectively.


Images: vectors over
visual textures
( )


Text: vectors of
topics
( )











How do we compute
similarity
?












Text Space

Like

most

of

the

UK,

the

Manchester

area

mobilised

extensively

during

World

War

II
.

For

example,

casting

and

machining

expertise

at

Beyer,

Peacock

and

Company's

locomotive

works

in

Gorton

was

switched

to

bomb

making
;

Dunlop's

rubber

works

in

Chorlton
-
on
-
Medlock

made

barrage

balloons
;

Image Space

Martin

Luther

King's

presence

in

Birmingham

was

not

welcomed

by

all

in

the

black

community
.

A

black

attorney

was

quoted

in

''Time''

magazine

as

saying,

"The

new

administration

should

have

been

given

a

chance

to

confer

with

the

various

groups

interested

in

change
.



In

1920
,

at

the

age

of

20
,

Coward

starred

in

his

own

play,

the

light

comedy

''I'll

Leave

It

to

You''
.

After

a

tryout

in

Manchester,

it

opened

in

London

at

the

New

Theatre

(renamed

the

Noël

Coward

Theatre

in

2006
),

his

first

full
-
length

play

in

the

West

End
.
Thaxter,

John
.

British

Theatre

Guide,

2009

Neville

Cardus's

praise

in

''The

Manchester

Guardian''


The

population

of

Turkey

stood

at

71
.
5

million

with

a

growth

rate

of

1
.
31
%

per

annum,

based

on

the

2008

Census
.

It

has

an

average

population

density

of

92

persons

per

km²
.

The

proportion

of

the

population

residing

in

urban

areas

is

70
.
5
%
.

People

within

the

15

64

age

group

constitute

66
.
5
%

of

the

total

population,

the

0

14

age

group

corresponds

26
.
4
%

of

th

SkyBomb

Terrorist

India

Success

Weather

Prime

President

Navy

Army

Sun

Books

Music

Food

Poverty

Iran

America


In

1920
,

at

the

age

of

20
,

Coward

starred

in

his

own

play,

the

light

comedy

''I'll

Leave

It

to

You''
.

After

a

tryout

in

Manchester,

it

opened

in

London

at

the

?

?

Semantic Space




Learn mappings
that

maps different
modalities
into

intermediate spaces .












Given an image query in the Topics of Features Space, the cross
-
modal
retrieval system finds the nearest neighbor of text in the Semantic Space.


Similarly for image query.


The task now is to design these mappings
.





System Overview



Original Text

Original Image

Correlated
Semantic Space

Semantic
Concept 1

Semantic
Concept 2

Semantic
Concept V

Text Classifiers

Image Classifiers

Bag
-
of
-
Features

Topic Model

Image Space

Text Space

Retrieval Procedure

Sample Results



Data Sample



Despite agreeing on most issues regarding the protection of national parks, friction between the NPA
and NPS was seemingly unavoidable. Mather and Yard disagreed on many issues; whereas Mather
was not interested in the protection of wildlife and accepted the Biological Survey's efforts to
exterminate predators within parks, Yard vehemently criticized the program as early as 1924 (Fox, p.
204). Yard was also highly critical of Mather's administration of the parks. Mather advocated plush
accommodations, city comforts and various entertainments to encourage park visitation. These plans
clashed with Yard's ideals, and he considered such urbanization of the nation's parks misguided. While
visiting Yosemite National Park in 1926, he stated that the valley was "lost" after finding crowds,
automobiles, jazz music and even a bear show (Sutter, p. 126). In 1924, the United States Forest
Service initiated a program to set aside "primitive areas" in the national forests that protected
wilderness while opening it to use. (…)


A number of variants were built on the same chassis as the TAM tank. The original program called for
the design of an infantry fighting vehicle, and in 1977 the program finished manufacturing the
prototype of the ''Vehículo de Combate Transporte de Personal'' (Personnel Transport Combat Vehicle),
or VCTP. The VCTP is able to transport a squad of 12 men, including the squad leader and nine riflemen.
The squad leader is situated in the turret of the vehicle; one rifleman sits behind him and another six
are seated in the chassis, the eighth manning the hull machine gun and the ninth situated in the turret
with the gunner. All personnel can fire their weapons from inside the vehicle, and the VCTP's turret is
armed with Rheinmetall's Rh
-
202 20 millimeter (.79 in) autocannon. The VCTP holds 880 rounds for
the autocannon, including subcaliber armor
-
piercing DM63 rounds. It is also armed with a 7.62
millimeter FN MAG 60
-
20 mounted on the turret roof. Infantry can dismount through a door on the rear
of the hull. (…)

Warfare

The population of Turkey stood at 71.5 million with a growth rate of 1.31% per annum, based on the
2008 Census. It has an average population density of 92 persons per km². The proportion of the
population residing in urban areas is 70.5%. People within the 15

64 age group constitute 66.5% of the
total population, the 0

14 age group corresponds 26.4% of the population, while 65 years and higher of
age correspond to 7.1% of the total population. Life expectancy stands at 70.67 years for men and 75.73
years for women, with an overall average of 73.14 years for the populace as a whole. Education is
compulsory and free from ages 6 to 15. The literacy rate is 95.3% for men and 79.6% for women, with
an overall average of 87.4%. The low figures for women are mainly due to the traditional customs of the
Arabs and Kurds who live in the southeastern provinces of the country. Article 66 of the Turkish
Constitution defines a "Turk" as "anyone who is bound to the Turkish state through the bond of
citizenship"; therefore, the legal use of the term "Turkish" as a citizen of Turkey is different from the
ethnic definition. (…)

Geography and Places

Culture and Society

Data for Experiments




Test the model on a dataset
English Wikipedia
built
by UCSD SVCL using
Wikipedia’s featured articles


2700 articles
, selected and reviewed by Wikipedia’s editors
since 2009.


The articles are
accompanied by one or more pictures
from
the Wikimedia Commons


Each
article is split into sections
that may or may not have an
assigned image (sections without images were dropped)


Each
article is categorized
into one of 29 categories (only the
10 most populated categories were chosen)


Each ‘
document’
in the proposed set is a
‘section of
Wikipedia featured article
’ and its ‘
associated image’.





Test the model on a dataset
TVGraz

built by I. Khan
et. al. using
Google image search results.


Containing
2596 image
-
text pairs

coming
from 10 categories
.




TVGraz




TVGraz

, built by I. Khan et. al. using
Google image search results
, contains
2596 image
-
text pairs

coming from
10 categories .




Experiments on the English Datasets

Mean Average Precision (MAP) of the Retrieval Performance

[1]

[1]

[1] N. Rasiwasia, J. Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos.


A new approach to cross
-
modal multimediaretrieval[A]. In ACM Multimedia[C], pp. 251
-
260,2010.

Experimental Results

Experiments on the English Datasets

Comparison of retrieval MAP on all the categories on the

En
-
Wikipedia dataset.
(Left) MAP of the query images; (Middle) MAP

of the query texts; (Right)
Average performance for both the query

images an the query texts.

Experimental Results

Experiments on the English Datasets

Comparison of retrieval MAP on all the categories on the

TVGraz dataset. (Left)
MAP of the query images; (Middle) MAP

of the query texts; (Right) Average
performance for both the query

images an the query texts.

Experimental Results

Visually Comparision of TCM for image query task on the

En
-
Wikipedia dataset.

Experiments on the English Datasets

Sample Results

Visually Comparision of TCM for text query task on the

En
-
Wikipedia dataset.

Experiments on the English Datasets

Sample Results

Study on the Chinese Case



Most of the models proposed for cross
-
modal information retrieval
are focused on
English corpus

that can be easily extended to other alphabetic languages as their
basic units are words




However, the
basic units in Chinese are not words, but characters.




Both words and characters
play indispensable roles in building the Chinese
structure.


An example of compound words of using “

” as the center character in Chinese .

包(
bag


钱(
浯湥y






扯潫






lea瑨er



……



钱包(
wa汬et



书包(
sc桯潬扡h



皮包


扲be晣ase



……



Chinese Experiments

Study in the Chinese Case

Semantic illustration of word
-
based and character
-
based Chinese corpus representations.



Text representation
: texts are represented as
the distributions
over topics of words or topics of characters

based on Latent
Dirichlet

Allocation (LDA).

Chinese Experiments


The curves show the performance with increasing topic numbers.


The model trained by
character
-
based

topic model achieved
better

performance
than
word
-
based

one.





Topic Model

Image
Query

Text
Query

Average
Performance

Word
-
Based

0.241

0.281

0.261

Character
-
Based

0.314

0.298

0.306

Chinese Experiments

Visually Comparision of TCM for image query task on the

Ch
-
Wikipedia dataset.

Sample Result in Chinese

Visually Comparision of TCM for text query task on the

Ch
-
Wikipedia dataset.

Sample Result in Chinese

Future Work: Nonparametric
Bayesian
Model for
Crowdsoucing



Method

[9
]
Renjie

liao
,
Zengchang

Qin,
Dual
-
view
Dirichlet

Process Mixture Models for Cross
-
modal Data Analysis. Submitted to
PAMI.


Take the advantages of
crowdsourcing properties

to build the correlation
between texts and images.

Motivation

We propose Dual
-
view
Dirichlet Process Mixture
Models for analyzing cross
-
modal data.

THANK YOU!