TEXT MINING AND SOCIAL
MEDIA AT CSEE
Massimo Poesio
University of Essex
THE BIG DATA TORRENT
(Kinsey Report, 2011)
Social media sites,
smartphones
, and other consumer devices
including PCs and laptops have increased this amount of information
by allowing billions of individuals around the world to make
information about their interests, likes and dislikes publically
available.
In a digitized world, consumers going about their day
—
communicating, browsing, buying, sharing, searching
—
create their
own enormous trails of data.
E.g., 30 billion pieces of content shared on
Facebook
every month
Multimedia content (e.g., images, video) played major role in this
growth
RESEARCH ON BIG DATA & TEXT MINING
AT CSEE
EXTRACTING SEMANTIC INFORMATION
such
as named entities, sentiments, topics from large
amounts of data
DIGITAL LIBRARIES
: analysis of
QUERY LOGS
and
SCIENTIFIC PUBLICATIONS
E.g., collaboration with the Bridgeman Art Library in
GALATEAS
SOCIAL MEDIA
: e.g., sentiment analysis (PEBL)
SUMMARIZATION
of newspaper comment
threads in multiple languages (SENSEI)
CROWDSOURCING
EXTRACTING NAMED ENTITIES FROM
BRIDGEMAN IMAGE LIBRARY LOGS
Query
NER TYPE
D2W
calling of
st
.
matthew
ARTWORK / OBJ, PERSON
http://en.wikipedia.org/wiki/Th
e_Calling_of_St_Matthew_(C
aravaggio
)
joseph
PERSON / SAINT
http://
en.wikipedia.org/wiki/Sa
int_Joseph
george
dunlop
leslie
PERSON / PAINTER
http://
en.wikipedia.org/wiki/G
eorge_Dunlop_Leslie
the crucible
PLAY
/ THEATRE
http://
en.wikipedia.org/wiki/Th
e_Crucible
vesuvius
pompeii
LOC, LOC
http://
en.wikipedia.org
/wiki/Po
mpeii
NAMED ENTITY INFORMATION IN
BUSINESS ANALYTICS TOOLS
ANALYZING TOPIC INFORMATION
EXTRACTING REFERENCES TO
ENTITIES IN SPECIALIZED DOMAINS
LOC
SITE
CULTURE
APPLICATION: SPATIAL / ENTITY
BROWSING
SENTIMENTS
vcurve
: I like how Google celebrates little things like this:
Google.co.jp
honors Confucius Birthday
—
Japan Probe
mattfellows
:
Hai
world. I hate faulty hardware on remote systems
where politics prevents you from moving software to less faulty
systems.
brroooklyn
: I love the sound my iPod makes when I shake to
shuffle
it. Boo bee boo
MeganWilloughby
: Such a Disney buff. Just found out about the
new Alice in Wonderland movie.
Official
trailer: http://bit.ly/131Js0
I love the Cheshire Cat.
SUMMARIZING MULTILINGUAL CONTENT
FROM MULTIPLE DOCUMENTS
Automatically producing
SUMMARIES
of the
content of multiple documents is another
way of managing vast amounts of
information
Collaboration with JRC (
Ispra
):
summarization of documents from all EU
languages
Best performing system at TAC 2009
Arabic summarization: co
-
organized TAC
2011
Arabic NLP in general
SENSEI: SUMMARIZING COMMENT
THREADS
CROWDSOURCING
An alternative to automatic content
extraction: harnessing the Web to access the
expertise of hundreds of thousands of people
By paying them: Amazon Mechanical Turk
By attracting them with the opportunity to
collaborate to a scientific project (Galaxy Zoo)
By having them play (Phrase Detectives)
Galaxy Zoo
www.phrasedetectives.org
PHRASE DETECTIVES
CONCLUSIONS
Text mining supports several approaches to
making big data manageable
THE UNIVERSITY OF ESSEX MSC IN
BIG DATA AND TEXT ANALYTICS
Provides training in extracting information
from vast amounts of unstructured sources,
including both text and images
Offered from October
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment