Discovering Context for Personal Photos

splashburgerInternet και Εφαρμογές Web

22 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

64 εμφανίσεις

Discovering Context for Personal Photos

Introduction

This report

address
es

the problem of tagging faces in personal photos. Since the advent of social
networking websites, personal photographs have become first
-
class citizens of the web.

Applications of
photos have ranged from personal experience sharing on Facebook to studying ecological changes with
large datasets [1].
Facebook reported that 300 million photos were uploaded to their site on a daily
basis, and this number steadily increa
ses with time.
Instagram handles 5 million mobile phone photo
uploads per day.
A major factor which made websites like Facebook popular is photos

[6]
. People
enjoy

seeing pictures of
themselves and
their friends. As of now, this
face
tagging is done almost

entirely by
the owners themselves
.

In this report
,
the
CueNet
F
ramework

is presented
which utilizes contextual information to tag people in
the
photos

of a user
. The people present in such photos could be friends from a social network, friends
of friends
, or attendees of a professional event like conferences or

of

a
personal event like wedding
s
.
The CueNet framework

model
s

smartphone apps and websites like
LinkedIn or Facebook

as data sources
which describe a user's
context space
. This space contains info
rmation about the user, her social
connections, the places she visits and the events occurring around her.
T
he CueNet framework
encapsulates
a

C
ontext
D
iscovery
A
lgorithm

which identifies relevant parts of the context space f
or a
given photo containing
EXIF metadata
.

Using this context we use out
-
of
-
the
-
box face verification
techniques to
verify if
the people
are present in the photo or not. When a new person is associated, th
e
description of the photo changes, and allows CueNet to explore for
further

co
ntext.

This process is
continued till all the people are tagged in the photo.

The First
-
K Discoveries
Problem

Given:

1.

A photo, H with a set of faces P = {p1, p2, … pm},

2.

A set of People L = {l1, l2, l3…. Ln}.

3.

A real valued threshold, d (0 < d <= 1).

4.

A con
stant, k (0 < k << |L|)

We define tagging confidence, cij, as the confidence with which a person, li
ϵ

L can be associated with a
face pj. This is mathematically expressed as:

c
ij

= C(l
i

→ p
j
), where 0 <= c
ij

<= 1

Here, C is any real valued function to com
pute the certainty.

Given these definitions, we state the
first K
discoveries problem as follows.
For each person pi in H,

find
a subset of L
,

S,

where the tagging
confidence of each person in S, is greater than d. The cardinality of S must not be greater

than k
.

In Figure 1, a photo H has a face p2. L has 7 people (l1


l7). A simple way of verifying who is in the photo
is to iterate through L, till k (=2) discoveries are made. The tagging confidence for each person in L is
listed next to it. If we assume

d = 0.5, then the algorithm discards l1, l2, l3 and l4 are possible tags for p2.
Since, l5 and l6 have confidence values greater than d, they are considered valid tags. At this point, l7
can be ignored, even though it has the highest confidence value. As
we will see later,
the discovery

algorithm uses context to order this list such that the relevant tags are found early in of the list.


Figure
1
:
In a naïve approach to t
he First
-
K Discoveries Problem
,
after finding that l5 and l6

are valid tags, t
he
algorithm need not proceed to I7, as k = 2.

Practical Difficulties

The above

naïve approach to verify each and every person in L until k answers can be obtained

has
certain practical difficulties
.
It

is both expensive and incorrect.
A
s an extreme example, if we assume
that there are 1 million candidates in L, and a single verification takes about 0.5ms, it would take over 11
days to complete verification. If the tagging confidence is low (allowing a larger dissimilarity between
the can
didate’s photo and the input photo’s features), this would result in a large number of false
positives.

The culprit here is the
ever increasing
size of L. An average user has hundreds of friends on Facebook.
There might be a hundred other connections
thro
ugh other data sources
.

Moreover, new people are
always encountered during visits to events like conferences, meetings, weddings or parties.

They may
never be directly connected to a user through
any
networking

websites.

With the context discovery algorith
m, we progressively discover relevant entities that might appear in
the picture. This helps us overcome both obstacles. Users can start seeing results more quickly. And
since the whole list is verified only in the worst case scenario, the accuracy is highe
r. The next section
explains

this procedure.

The CueNet Framework

The context space consists of a large number of events and entities
.

The CueNet framework models the
photograph as an event graph. An event graph is a directed

acyclic

graph where the nodes
are events
and entities, and associated literal values and the edges are relationships among them. The framework
constructs and executes multiple queries which associates minimal parts of the context space with the
graph. The number of candidates increases

with
such associations
. Upon each update to the graph, the
face verification module is invoked to verify if the update provides any useful information in guessing
who might be in the picture.

Since each update is
relatively
small
er

than the entire list of candidates
, it
is
quickly processed by the verifier.
T
his

progressive navigation

through

the context space

keeps the list of
candidates extremely small, and at the same time increases the overall knowledge about the
photograph
.

Specifically, using
EXIF metadata and user information of a photograph,
CueNet constructs and executes
two types of queries. The first is where it uses known event information to query sources about sub
-
events or participating entities. The second is where

it uses known entity information to query sources
about all other events they are participating in. For each query, the sources which can accept the
current query predicates and retrieve the required data are
identified
.
By querying these sources
, the
fra
mework
learns more possible entities who could be participating in the photo capture event. Each
such candidate is verified

to be in
the photo

or not
.
With each
updat
e to the photo information
, we
can
query data sources

with a new set of predicates
. This w
ill give a different set of results than before. The
above steps
are
repeated until all the faces have been tagged
.
The termination criterion of the first
-
k
discoveries problem is finding at least one and at most k tags for each person
.

Here is an
example

to illustrate the
discovery
process. Anne takes a photo during a professional event.
With the EXIF GPS tag, CueNet can estimate the place where the photo was taken. Using Anne’s profile
at

LinkedIn, the framework is able to declare that the photo was take
n at her current work location.
With the timestamp, it can make a further inference that the photo was taken during work hours. Using
this extended knowledge about the photograph, CueNet searches two calendars to gain further context
about the event. The f
irst is Anne's personal calendar. If this calendar provides a title like ``Meet Bob
and Carey'' or ``Meeting with Database Group'', it can infer the event type to be a Meeting.

Since the
calendar information provides two possible participants, the system c
hecks if they are in the photo, and
tags them appropriately if found.

I
f

the personal calendar does not provide any such context, then
CueNet looks at the workplace calendar to search for events of importance to the whole organization.
Let's say it finds a

conference titled ``International Multimedia Conference'', whose schedule is also
available. Given that Anne (from her LinkedIn profile) is interested in Multimedia Information Retrieval,
she might be attending this event. The framework scans this schedul
e to see which session/talk is going
on. Given the N parallel sessions at the time the photo was taken, it now needs additional context to
extract the right session and talk. CueNet can verify if the person in the photo is any of these speakers
,
or people
who have co
-
authored the paper with them, or any of the other conference attendees (whom
Anne maintains regular email communications, is a friend on a social network or has authored some
papers herself with).



Figure
2
: The
conceptual architecture of the CueNet framework.

Figure 2 shows the major components present in CueNet.
The
Ontological Event M
odels

describe
events and entities, and the different relations between them. These declared types are used to define
the
D
ata

So
urce Interface

which provides access to different types of data sources. The context
discovery algorithm is given a photo with EXIF metadata. The
V
erification
M
odule

consists of
a database
of people,

their profile information and photos containing these pe
ople. When this module is presented
with a candidate and the input photograph, it compares the features extracted from the candidate’s
photos and the input photo to find the confidence threshold. We are currently using face.com[3] for our
face comparison w
ork.

Figure 3 on the right shows the different components
constituting the data source interface. The
S
ource
M
appings

allow one to declaratively add sources using a
LISP dialect built exclusively for defining source
attributes, and their relations with cl
asses and properties
defined in the event models. Queries
(written in a subset
of SPARQL [4])
from the discovery algorithm are
processed by the
Query Engine
, who analyzes the source
descriptions to choose appropriate sources to query, and
routes the query
to the respective
accessors

(A1, A2..
A4)
.
An accessor

transform
s

a given query

to a form
native to the data source. The results are transformed
into triples and sent back to the query engine.


Figure
3
: The Data Source Interface

A sample source mapping is
presented

in the Appendix.

Context Discovery Algorithm

Figure 4 below outlines the tail recursive discovery algorithm.
The input to the algorithm is a photo (with
EXIF tags) and an associated owner (the user). An event graph is
created where each photo is modeled
as a photo capture event. Each event and entity is a node in an
event graph
. Each event is associated
with time and space attributes. All relationships are edges in this graph.
All E
XIF

tags are literals,
related

to the photo
with

data

property edges.


Figure
4
:
Pseudocode for

the
Tail Recursive Context Discovery Algorithm

The event graph is traversed to produce a queue of entity and event nodes, which we shall refer

to as
DQ

(discovery queue)
. Two main functions,
discover

(
node) and
merge

(
) are then invoked upon each
node. The discover method takes three different forms. The first, which takes as input the entire event
graph is shown in the figure above. The second and third

accept a single node from this event graph, and
depending on whether it is an event node or entity node, discover context with its attributes. The
discover

(Event)

function consults the ontology to find any known sub
-
events, and queries all data
sources t
o find all sub
-
events, properties of this event and participants of the give event.
On the other
hand, a
discover (Entity)

function finds all the events the given entity is participating in. One new events
or entities have been found, they enter the merge
stage. The merge function is responsible for merging
duplicate events or entities (if needed), joining events with their sub
-
events.
Prune up

is responsible for
removing entities from an event when its sub
-
event lists him as a participant.
Push down

is a v
erification
step if the number of entities in the parents of the photo
-
capture events is small. Push down will try to
verify if any of the entities is present in the photo. The maximum number of entities for which push
down is initiated is around 3
-
5. On t
he other hand, if the count is larger, then we initiate the
vote and
verify

method, which selects the candidates which are most likely to be in the photo. The verification
runs only on the top ranked candidates.

The
discovery queue is reconstructed and the
algorithm is recursively invoked until all candidates are
tagged.

During the execution algorithm, it is possible that certain photos are never associated with any
people
. Or more precisely, an iteration of the algor
ithm produces no new data, and there are still
untagged faces in the photo. At this point, the algorithm produces an
impulse response
. This is the
result of voting over all known entities in the event graph. If there are no entities in the event graph, the

voting algorithm falls back to only the social relationships of the owner. Using the location attribute of
the photo, these entities are ordered by their location (location of an entity could mean the city where
they were living at the time of photo captu
re, or if available, any check
-
in information around the time
of photo capture).

Next Steps

The following are
some possible

directions
which can be undertaken
:


1.

Source Selection
: Currently the framework views all sources
equally. During an event discovery
all sources which could provide information about events are queried, whereas during an entity
discovery, all sources which provide information about entities are queried. We will sample
these sources, to make better decisions on which sources to query whe
n.


2.

Improved Voting
: Currently, voting is done exclusively
based
on
location or event context. Prior
occurrences in photos are not considered. [2] outlines a
probabilistic

approach to voting using
social networking relations, prior co
-
occurrence. We want t
o apply such techniques to study the
gains in accuracy and efficiency of the discovery algorithm.


3.

Hierarchical Event Merge
: This sub
-
problem is centered on querying multiple sources and
associate

the
ir respective

results with the “mother” event graph.
Pri
or work on
similar
data
structures
and techniques
include [7
,8
].

Appendix

Source Mapping

Following is a simplified conference event data source declaration.

(:source conferences


(:attrs url name time location ltitle stitle)


(:rel conf

type
-
of conference)


(:rel time type
-
of time
-
interval)


(:rel loc type
-
of location)


(:rel attendee type
-
of person)


(:rel attendee participant
-
in conf)


(:rel conf occurs
-
at location)


(:rel conf occurs
-
during time)


(:axioms


(:map ti
me time)


(:map loc location)


(:map conf.title ltitle)


(:map conf.name stitle)


(:map conf.url url)


(:map attendee.name name)))


A

declaration
comprises

of a single nested s
-
expression. We will refer to the first symbol in each
expression as a

keyword
, and the following symbols

as operands. This above declaration uses five
operators (source, attrs, rel, axioms, map). The
source

keyword is the root operat
or, and declares a
unique name of the data source. The source mapper can be queried for find
ing

accessors using this
name. The
attrs

keyword is used to list the attributes of this source. Currently we assume a tuple based
representation, and each operand i
n the attrs expression maps to an element in the tuple. The
rel

keyword allows construction of a relationship graph where the nodes are instances of ontology
concepts. And edges are the relationships described by this particular source. In the above exampl
e, we
construct individuals conf, time, loc and attendee who are instances of the conference, time
-
interval,
location and person class respectively. We further say that attendee is a participant of the conference,
which occurs at location loc and occurs du
ring the interval time. Finally, the mapping
axioms

are used to
map nodes in the relationship graph to attributes of the data source. For example, the first axiom
(specified using the
map

keyword)

maps the time node to the time attribute. The third map exp
ression
creates a literal called title, and associates it to the conference node. The value of this literal comes from
the ltitle attribute of the conference data source.

Event models

Table 1

(Page 8)

shows

screenshot
s

of the Protégé environment showing di
fferent event and entity
concepts and property declarations created for this project.

To create our event models, I used DOLCE
-
Lite as the domain ontology. Various event classes like conference, dinner, concert, meeting among
others were added. Object prop
erties like occurs
-
at, occurs
-
at
-
n, occurs
-
during, occurs
-
during
-
n, sub
-
event
-
of were added as object properties. Their definitions are available in [5].

References

[1]
Zhang, H., Korayem, M., Crandall, D., LeBuhn, G.

Mining Photo
-
sharing Websites to Study

Ecological
Phenomena,

International World Wide Web Conference

2012.

[2]

Stone, Z., Zickler, T., Darrell, T.

Autotagging Facebook: Social network context improves photo
annotation
,
Computer Vision and Pattern Recognition Workshops, 2008
.

[3] Face Verificat
ion at Face.com:
http://developers.face.com/docs/api/faces
-
recognize/

[4] SPARQL:
http://www.w3.org/TR/rdf
-
sparql
-
query/

[5]

Gupta, A. and Jain, R.
Managing Event Information: Modeling,
Retrieval, and Applications
,
Synthesis
Lectures on Data Management
, 2011.


[6]
http://gigaom.com/2012/04/09/here
-
is
-
why
-
did
-
facebook
-
bought
-
instagram/


[7]
Arge, L.
,
De Berg, M.
,
Haverkort, H.J.
,
Yi, K.
,
The Priority R
-
tree: a practically efficient and wor
st
-
case
optimal R
-
tree
,
2004 ACM SIGMOD

[8]
Gao, D.
,
Jensen, C.S.
,
Snodgrass, R.T.
,
Soo, M.D.
,
Join operations in temporal databases
,
VLDB 2005.



Table
1
: Figure on the left shows the different classes added to DOLCE
-
Lite. The
figures on the right list the different object
properties and data properties.