Status.get - UMBC ebiquity research group

obtainablerabbiΔιαχείριση Δεδομένων

31 Ιαν 2013 (πριν από 4 χρόνια και 6 μήνες)

108 εμφανίσεις

Understanding RSM:

Relief Social Media

-
William
Murnane

-
Anand

Karandikar


15 September 2009

Objective


Build

better

sensors

into

emerging

social

media

environments
.

These

environments

are

increasingly

important

in

Humanitarian

and

Disaster

Relief

(HADR)

and

Security,

Stability,

Transition

and

Reconstruction

(SSTR)

scenarios,

providing

real
-
time

situational

awareness
.

Deliver

an

analytic

toolkit

that

can

be

integrated

into

the

Human,

Social,

Cultural

and

Behavioral

(HSCB)

computational

infrastructure



Project Overview


Joint venture by Lockheed Martin Advanced
Technology Laboratories (LM ATL) and the
University of Maryland, Baltimore County (UMBC)


Team members

Prof.
Finin
, Prof. Joshi


Principal Faculty, UMBC, CS Dept

Dr. Brian Dennis


Staff Computer Scientist
, LM ATL

William
Murnane
,
Anand

Karandikar



Graduate

students, UMBC, CS Dept


Project Overview


What It’s Like Today



HADR/SSTR response has focused on highly centralized, tightly
coordinated organization.


Responders


Domestic: FEMA, DHS, National Guard, State and Local,
NGOs


International: Army, Navy, USAID, NGOs


Centralization slows response, throttles critical information,
limits situational awareness


Adopted from Dr. Brian Dennis's slides

Project Overview

What’s Changing


Response at the edge


Affected populace is using Internet/Web for communication


Assuming network availability


Social media tools are being used for communication &
coordination


Example social media platforms:


Twitter,
Flickr
, YouTube, open blogs


Social visibility + coordination + content


Adopted from Dr. Brian Dennis's slides

Technical Approach


Harvesting of Data

i.
Focus on social media like Twitter, Flickr

ii.
Capture data that has relief contexts


Computational models

i.
Generative model of social connections that can help building
forecasting tools


Building Analytics Toolkits

i.
Capabilities to analyze and mine sentiment

ii.
Automated generation of appropriate confidence levels for
information extracted



Twitter


Lots and lots of data:


Lots and lots of stuff nobody cares about: "omg, when I get
home I am so going to blog about your new haircut."
--
Nick
Taylor


... but maybe some stuff someone might care about. People

talk about getting sick, wild fires, floods, etc, so maybe we
can track that.


Dataset #1: Twitter


Nicely segmented into tables:
users, locations, statuses.



Referential integrity needs work:



select count(*) from (select
follower_id

from
user_relationships

except select id
from users) as
missing_uids
;


Count



-------



24201



Fairly big: roughly 1.5M users,
150M statuses, 1M locations.
30GB on disk.

Current Progress


Dataset loaded into
PostgreSQL

from
MySQL


Fixed corruption problems


Gave full
-
text indexing on tweets a try in
Postgres



Too slow: 72 hours for CREATE INDEX and no progress


May try again on new hardware


Lucene
-
based app to build and search indices




Current Progress

Status and speed of query


Pretty Good performance:


~35k rows/second while creating index on current hardware, quick
queries


Easy to write: 459 LOC counting the GUI, half that without it.


Tweet index design


Index only statuses: that's all we need to search
quickly so far.



Document ID: maps to SQL primary key on
statuses


Text: Analyze for words, do TF
-
IDF to order
results.


UID: Can filter by user at the query level rather
than have to go ask the database. We don't know
if this will be useful, but it doesn't hurt.

Raw data for events of interest

Example chosen here is ‘California Wildfires’


1.
Twitter tweets for California wildfires

2.
Technocrati

search for California wildfire videos

3.

Yahoo! Pipes
mashup

for California wildfires

using
Flickr

data



Twitter API methods


Search
-

Returns tweets that match a specified
query.


statuses/
public_timeline

-

Returns the 20 most
recent statuses
from users


statuses/show
-

Returns a single status, specified
by the id parameter


Trends
-

Returns the top ten topics that are
currently trending on Twitter



GeoLocation

API from Twitter by October 2009




Facebook API methods


Users.getStandardInfo



Returns users current
location, timezone etc.


Stream.get



if an user ID is specified it can return
the last 50 posts from that user's profile stream.


Status.get

-

Returns the user's current and most
recent statuses.



YouTube Data API


To search for videos, submit an HTTP GET request to
the following
http://gdata.youtube.com/feeds/api/videos


Example:
California Fires



Other parameters like location, location
-
radius can
be added while building the query.




GeoCoding API


GeoCoding is a process of converting addresses like ‘1000
Hilltop Circle Baltimore MD’ to geographical co
-
ordinates
which can be used to mark that address on the map.



Google Map API: via GClientGeocoder object. Use
GClientGeocoder.getLatLng() to convert a string address into
latitudes and longitudes.



Yahoo! Maps web service:


Example:
701 First Ave Sunnyvale CA

Similar Initiatives

AirTwitter

(Started in August 2009)


Designed to harvest user
-
generated content like tweets,
delicious bookmarks,
flickr

pictures and
youtube

videos that
are relevant to Air Quality Uses


Yahoo! Pipes for aggregated feed generation.


When events are identified, the location will be harvested
from contextual information in the feed such as a place name
or as development evolves IP address of tweet.


To further automate event identification, Air Twitter feeds will
be archived in order to conduct temporal trend analysis that
can be used to separate the background noise from AQ events
in the social media stream.




Similar Initiatives

Crisis Informatics


ConnectivIT Research Group at University of
Colorado, Boulder


Investigates the evolving role of information and
communication technologies (ICT) in emergency and
disaster situations.


Particular focus on information dissemination and
the implications of ICT
-
supported public participation
on informal and formal crisis response

To Do


Index locations, too?
Lucene

or SQL?


Better Analyzer: discard non
-
English (tricky!) and do
stemming (simple!)


Test on new hardware: SSD versus disk, for what parts?



Higher
-
level abstractions: what Tweets are similar?
Build an ontology that things fit into, or search for
particular things?


Run human classifier for a while, then train machine
classifier off that data.


Geo
-
location in Twitter space


Thanks