Event Identification in Social Media - Columbia University

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

77 εμφανίσεις

EVENT IDENTIFICATION IN
SOCIAL MEDIA


Hila Becker, Luis Gravano


Mor Naaman


Columbia University



Rutgers University

Social Media Sites Host Many

“Event” Documents

Photo
-
sharing: Flickr


Video
-
sharing: YouTube


Social networking: Facebook



2

“Event”= something that occurs at a certain time in a certain place
[Yang et al. ’99]


Popular, widely known events

Presidential Inauguration, Thanksgiving Day Parade


Smaller events, without traditional news coverage

Local food drive, street fair







Social media documents for “All Points West” festival,
Liberty State Park, New Jersey, 8/8/08

Identifying Events and Associated
Social Media Documents


Applications


Event search and browsing


Local search




3


General approach:
group similar documents via
clustering

Each cluster corresponds to one event and its associated
social media documents



Event Identification: Challenges


Uneven data quality


Missing, short, uninformative text


… but revealing structured context available: tags,
date/time, geo
-
coordinates


Scalability


Dynamic data stream of event information


Unknown number of events


Necessary for many clustering algorithms


Difficult to estimate


4

Clustering Social Media Documents


Social media document representation


Social media document similarity


Social media document clustering


Clustering task: definition


Ensemble algorithm: combining multiple
clustering results


Preliminary evaluation


5

Social Media Document Representation

Title

Description

Tags

Date/Time

Location

All
-
Text

6

Social Media Document Similarity


Text: tf
-
idf weights, cosine similarity

7

Title

Description

Tags

Date/Time

Location

All
-
Text

Title

Description

Tags

Date/Time
-
Keywords

Location
-
Proximity

All
-
Text

Location
-
Keywords

Date/Time
-
Proximity

time


Location: geo
-
coordinate proximity

A

A

A

B

B

B


Time: proximity in minutes

Social Media Document Clustering Framework

Document feature

representation

Social media

documents

Event clusters

8

Consensus Function:

combine ensemble

similarities


Clustering: Ensemble Algorithm

W
title

W
tags

W
time

9

f(C,W)

C
title

C
tag
s

C
time

Ensemble
clustering
solution

Learned in a
training step

Clustering: Measuring Quality


Homogeneous clusters



10






Complete clusters




Metric: Normalized Mutual Information (NMI)

Shared information between clustering solution and
“ground truth”


Experimental Setup


Data: >270K Flickr photos


Event labels from Yahoo!’s “upcoming” event database


Split into 3 parts for training/validation/testing


Clusterers: single pass algorithm with centroid similarity


Weighing scheme: Normalized Mutual Information
(NMI) scores on validation set


Consensus function: weighted average of clusterers’
binary predictions


Final prediction step: single pass clustering algorithm


11

Preliminary Evaluation Results


Individual clusterer performance


Highest NMI: Tags, All
-
Text


Lowest NMI: Description, Title


Ensemble performance, compared against all
individual clusterers


Highest overall performance in terms of NMI


More homogenous clusters: each event is spread
over fewer clusters


12

Details in paper

Document similarity metric


Ensemble approach


Weight assignment


Choice of clusterers


Train a classifier to predict document similarity


Features correspond to similarity scores


All
-
text, title, tags, time, location, etc.


Numeric values in [0,1]


State
-
of
-
the
-
art classifiers: SVM, Logistic Regression, …






13

Future Work: Alternative Choices

Future Work: Alternative Choices


Final clustering step


Apply graph partitioning algorithms


Requires estimating the number of clusters


Evaluation metrics: beyond NMI


Datasets


Flickr LastFM, YouTube


Exploit social network connections

14

Conclusions


Identified events and their corresponding social
media documents


Proposed a clustering solution


Leveraged different representations of social media
documents


Employed various social media similarity metrics


Developed a weighted ensemble clustering approach


Reported preliminary results of our event
identification approach on a large
-
scale dataset of
Flickr photographs


15