TEXT MINING AND SOCIAL MEDIA AT CSEE

hurriedtinkleAI and Robotics

Nov 15, 2013 (3 years and 7 months ago)

72 views

TEXT MINING AND SOCIAL
MEDIA AT CSEE

Massimo Poesio

University of Essex

THE BIG DATA TORRENT

(Kinsey Report, 2011)

Social media sites,
smartphones
, and other consumer devices
including PCs and laptops have increased this amount of information
by allowing billions of individuals around the world to make
information about their interests, likes and dislikes publically
available.

In a digitized world, consumers going about their day

communicating, browsing, buying, sharing, searching


create their
own enormous trails of data.

E.g., 30 billion pieces of content shared on
Facebook

every month

Multimedia content (e.g., images, video) played major role in this
growth

RESEARCH ON BIG DATA & TEXT MINING
AT CSEE


EXTRACTING SEMANTIC INFORMATION

such
as named entities, sentiments, topics from large
amounts of data


DIGITAL LIBRARIES
: analysis of
QUERY LOGS
and
SCIENTIFIC PUBLICATIONS


E.g., collaboration with the Bridgeman Art Library in
GALATEAS


SOCIAL MEDIA
: e.g., sentiment analysis (PEBL)


SUMMARIZATION
of newspaper comment
threads in multiple languages (SENSEI)


CROWDSOURCING

EXTRACTING NAMED ENTITIES FROM
BRIDGEMAN IMAGE LIBRARY LOGS

Query

NER TYPE

D2W

calling of
st
.
matthew

ARTWORK / OBJ, PERSON

http://en.wikipedia.org/wiki/Th
e_Calling_of_St_Matthew_(C
aravaggio
)

joseph

PERSON / SAINT

http://
en.wikipedia.org/wiki/Sa
int_Joseph

george

dunlop

leslie

PERSON / PAINTER

http://
en.wikipedia.org/wiki/G
eorge_Dunlop_Leslie

the crucible

PLAY

/ THEATRE

http://
en.wikipedia.org/wiki/Th
e_Crucible

vesuvius

pompeii

LOC, LOC

http://
en.wikipedia.org
/wiki/Po
mpeii

NAMED ENTITY INFORMATION IN
BUSINESS ANALYTICS TOOLS

ANALYZING TOPIC INFORMATION

EXTRACTING REFERENCES TO
ENTITIES IN SPECIALIZED DOMAINS

LOC

SITE

CULTURE

APPLICATION: SPATIAL / ENTITY
BROWSING

SENTIMENTS

vcurve
: I like how Google celebrates little things like this:
Google.co.jp

honors Confucius Birthday


Japan Probe


mattfellows
:
Hai

world. I hate faulty hardware on remote systems
where politics prevents you from moving software to less faulty
systems.


brroooklyn
: I love the sound my iPod makes when I shake to
shuffle

it. Boo bee boo


MeganWilloughby
: Such a Disney buff. Just found out about the
new Alice in Wonderland movie.
Official

trailer: http://bit.ly/131Js0
I love the Cheshire Cat.

SUMMARIZING MULTILINGUAL CONTENT
FROM MULTIPLE DOCUMENTS


Automatically producing
SUMMARIES
of the
content of multiple documents is another
way of managing vast amounts of
information


Collaboration with JRC (
Ispra
):
summarization of documents from all EU
languages


Best performing system at TAC 2009


Arabic summarization: co
-
organized TAC
2011


Arabic NLP in general

SENSEI: SUMMARIZING COMMENT
THREADS

CROWDSOURCING


An alternative to automatic content
extraction: harnessing the Web to access the
expertise of hundreds of thousands of people


By paying them: Amazon Mechanical Turk


By attracting them with the opportunity to
collaborate to a scientific project (Galaxy Zoo)


By having them play (Phrase Detectives)

Galaxy Zoo

www.phrasedetectives.org

PHRASE DETECTIVES

CONCLUSIONS


Text mining supports several approaches to
making big data manageable



THE UNIVERSITY OF ESSEX MSC IN
BIG DATA AND TEXT ANALYTICS



Provides training in extracting information
from vast amounts of unstructured sources,
including both text and images


Offered from October