A summary and Critique of “TwitterStand: News ... - UNT NSL Website

shrubberystatuesqueData Management

Dec 1, 2012 (3 years and 11 months ago)


A summary and Critique of “
: News in Tweets

CSCE 6350.001 Spring 2010

Amgalan Baatarjav,
edsada Chartree
, and
Thiraphat Meesumrarn

February 8
, 2010



The following is a review of the article
“TwitterStand: News in

J. Sankaranarayanan, H. Samet, B.
Teitler, M. Lieberman, and J. Sperling (2009
In this
, J. Sankaranarayanan, et. al

investigate a
social network website,
to build
news processing system, called TwitterStand
. The



as junk or news
After classifying the tweets, t


news tweets that have
similar topic of content. Finally, each cluster pinpoints to the geographic location where breaking news being

This paper w
ill begin by
introducing Twitter
. Next, this paper will
state problem statements, then the proposed
TwitterStand will be pre
, along with methodologies
. Finally,
the article


2. Twitter

Twitter is a free social networking and micr
oblogging website that enable its users to send and read messages,
called tweets
. A tweet is a short message
, which contain

maximum of 140 characters
. In Twitter,
ser A
can receive tweets from other users, wh

User A’s friends,

User A can have
maximum of 2000 friends.

In addition,

User A can have followers who can receive and read user A’s tweets. There is no limit on

the User A can have

Thus, the number of people who use Twitter has been increased in recent

because it
the following


First of all, Twitter encourages people to acquire a large number of friends that a user can have unlimited
followers; user and followers can send and receive tweets on a wide variety of subjects. User can also link to
objects on the web, such as articles, images, and videos

The second advantage is that users can be viewed as part of other users or groups because it is social
networking website; users can see others profiles and tweets from friends and followers.

ermore, we can

the location of

who send the tweets.

Last but not least, it is useful for tracking news, and users receive reaction and feedback by tweets faster than
conventional news media.

. Problem statement

The drawback of sending a tweet is it
is restricted to only


: short message
; most users use
. Therefore,

tweets may also be grammatically incorrect and spelling errors. Because a tweet
contains a short message, users are likely to send them more often. Thus,
a very large number of tweets

as a high throughput rate
. This research focuses on the

news from the Twitter tweets
. T
he authors
try to separate news and noise,

news cluster, and
determine the relevant location of each cluster. To
do these tasks, the researchers create the processing system from Twitter tweets, called TwitterStand,

and test
this system
works as efficiency or not.

The main purpose of getting news from tweets is dealing with noise data; therefore, the question is

TwitterStand can effectively classify news from noise

tweets or not, the proposed online clustering syst
em ca
quickly mitigate noise or not, and this system can determine the location of clusters


Therefore the main question is that TwitterStand can gather and

breaking news much faster than
conventional news media or not.

According to the prob
lem statements above, the goals of this research are
building the processi
ng system to
automatically obtain breaking news posted by Twitter users, and providing a map interface for reading these

. Methodology

The authors create
which is the proposed news processing system that can identify breaking
news as fast and high quality as possible.

With the architecture of TwiterStand, all data from the Twitter
tweets are first gathered us
ing Twitter services including S
t, Search, and BirdDog. Then
this data will be classified to separate noise and news. After that, the news will be au
tomatically grouped as
clusters. Next is geotagger, which locate

and extract

geographic content from each cluster
. The results will be
splayed on the
screen, which shows the ranking clusters on the left pane,

and the map of geographic region
related to these clusters

on the right pane

From the first step of TwitterStand, especially Seeders, the re
handpick 2,000 Twitter users w
publish news in tweets and who are primarily interested in news
. They also use the GardenHose and BirdDog
ces to obtain feeds from

The next important part of the TwiterStand system is separating news from noise.
All tweets are classi
fied as
news or junk; the authors use a Naïve Bayes


the news and junk
. The idea of this classifier
is training and testing; a large number of tweets have already been marked as news and junks on the training
corpus of tweets, then ea
ch user tweet will be tested using Bayes’ theorem
. In this step, the researchers also use
static and dynamic corpus to identify news from tweets.

After classifying tweets related to news,
the next step is online clustering, which automatically group
tweets into sets of tweets
; typically

this step deals with dynamic news tweets using an online algorithm, called
follower clustering. The system chooses only tweets

are published within three day;
the authors
apply a variant of cosine
similarity measure and Gaussian algorithm to identify active clusters, which are less
than three days.

More over,
in order to improve better clustering results, the researchers do more tasks, which
are dealing with noise,

or duplicat
e clusters on a single topic
, weight upper bounds,
dealing with
, which two words always come together

he next step is geotagger, which each cluster is extracted int
o geographic location; then the researchers

aggregate per
tweet geographic con

to determine over all geographic focus. To reach this goal, they do
several strategies. First is using tweet content, which performs toponym recognition

and toponym resolution.
The toponym recognition

combines two Natural Language Processing strategie
s, which are Part of Speech and
Entity Recognition to identify the proper parts of each word in the sentences and extract the type of
entities into categories like organization or person, respectively. The other

geotagging, toponym resolution,
s with
determining geo
graphic coordinates
, which are stored in database associated with meta

The database

of geographic location contains almost 7 million entries gathered from volunteers around the
world and stored in PostgreSQL database. The second

trategy is using tweet metadata that

the researchers
collect meta

in particular source locations to perform geotagging process. Finally, they compute a
geographic focus for the topic as a whole and ranking the location in the cluster, and then repor
t them.

The Other step is building user interface, which provide two panes on the screen, left and right pane. The left
pane is designed to show the ranking cluster in terms of importance, while the right pane shows
a map of
clusters which geographic reg
ion of in

Furthermore, TwitterStand also contain interesting functions:
hashtags and friend finder. Both two services
help us to find the important topics and new friends, respectively.



and Conclusion

The authors develop processing
system, TwitterStand, which automatically obtain breaking news from tweets
in Twitter

They describe the architecture of the proposed system in detail. The do the experiment following by
the system architecture.

The results are satisfied
: faster than conve
news media. T
he proposed system
can separate news from noise and cluster the tweets as fast and robust. This system can also determine the
clusters’ geographic locations

This work is good for future work.

. Criticisms


The authors qui
te state directly that the use of Twitter has been increasing and becoming popular for
communication using short messages, tweets. Because tweets are restricted to 140 characters’ long, the tweets
are broadcasted more frequently a
nd very high rate; mostly

messages are news, events, activities, and
opinions. In other word
, tweets tell users and followers what happen, what is going on, what are they doing,
what are their opinions; From this point, the authors motivate that they want to know that news f
rom tweets
are faster than conventional news media. Thus, they prop

TwitterStand to separate news from noise, group
the same topics into a set of tweets, and define the location of clusters. Therefore, the authors point out the
motivation directly and c

Related works

The authors introduce Twitter clearly and lots of detail. They also describe the proposed TwitterStand.
However, they do not show related or previous works; they just refer to some related papers, but no details.
For instance, they o
nly refer to their own papers, STEWARD and New
Stand, without details. In fact, there is
no other previous works introduced in this article. This is very important for academic paper

to show related
works in order to solve the problems, which motivate the
m to develop the experiment. Although many
previous or related works, which are shown in the references section, they do not reveal these related works in
this article. This affects the reliability and valuable of the experiment. Although this article is q
uite new topic,
the authors should have focused on the related works. It would have been better if the authors had related to
the findings to previous works on this topic.


The contribution of this work is that
Twitter changes the way to comm
unicate with each other. One person
can broadcast messages via electronic gadgets such as, cell phone and PDA. Consequently, it made millions of
Twitter users to be eyes and ears in the world. By cooperating with Twitter, TwitterStand offers the way to fin
related news by clustering data, and points to the source of news by using Geo
tagging. The strong point of this
method is to locate source of news, so TwitterStand can offer local news for specific gro
up of users;
example, Iran election and earth qu
ake in California.

On the other han
d, Twitter is informal reporter
, so the messages would be consist with wrong spelling and Net
lingua. In addition, the reports from local areas might have local vocabulary and local language. It is difficult
to group news from noise, and some of messages might be avoided because unrecog
nized local languages. To
fix these weak point,
TwitterStand should have dealt

with enlarge corpus.

Problem Definition

The main purpose of TwitterStand is to extract news from 140 characters in tweets messages without noise.
Because of limitation of text

message, the
drawback will offer high



messages, wrong spelling, and
wrong grammar. Therefore, they have to separate news and noise. These problems will be solved by
classification technique:
ve Bayes, and clustering technique:
term frequen

inverse document frequency

However, the difficult points of this research are what the news is and how we can trust the reporters. These
points are hard to deal with because news from some places is not important for other places, or the so
ces of
news are low trustworthiness

@breakingnews site, automated site, for example, has over 1,000 followers.



had been updated about swine flu situation because its followers kept updating swine flu every 10 minutes.
Nonetheless, most of messages ar
e rumours, for no one audit that tweets.

In short, the number of tweets does
not mean that we can trust the news from tweet.

From the motivation above, the author obviously state problem definition of their experiment. The goal of this
work is to define
news from tweets, thus, the authors develop TwitterStand, the processing system, to separate
news from noise, cluster the same topics of tweets, and locate the geographic content of clusters. Thus, this
work deals with the problem of restricted 140 charact
ers of tweets, the problem of noise in tweets. The authors
point out these problems clearly. However, the authors do not define directly about the efficiency or quality of
their work; they just describe how it works. Hence, it would have been better if the

researchers state the
problem definition directly.

Key concepts



several key concepts in this article. The first important concept is separating news from noise; this is
the main goal in this research. It is necessary to get rid of noise data in Twitter tweets. The authors apply the
known Naïve Bayes algorithm to
classify news from noise. Another key concept is clustering tweets for
dealing with dynamic corpus as new topics that less than three days using a variant of cosine similarity
measure and Gaussian algorithm. The other key concept is mapping the clusters in
to geographic focus using
geotagging, which applies the mechanism of Natural Language Processing: Part of Speech and Named

These are good key concepts used for this work. The good results come from a good method of the experiment
related to the good key concepts.


The main point of this work is to capture tweets that correspond to the late breaking news. They choose 2,000
users related to breaking news and gather messages from tweets from theses users. Then, they separate ne
from noise in tweets using Naïve Bayes classifier. After that, each tweet is online grouped into clusters; then
each cluster is extracted into geographic focus. However, there are some limited in this experiment.

First, most tweets could be very noise
and bias because tweets come from users and many followers; thus,
many tweets could be opinions even though the authors handpick 2000 users; thus, the authors should have
specified the proportion of different categorize of users. If many users come from th
e same group, the news
may be biased and contained many opinions. The validation and reliability of this work is based on the
reliability of data, which comes from users. Hence, the authors should have paid more attention to the non
bias and reliable users

Second, it is difficult to understand the function of separating news from noise even though the authors apply
the well
known Naïve Bayes classifier and show its formula; they do not show more detail how they apply this
classifier, how they train tra
ining data, how they test testing data; in fact, they do not describe clearly about the
training corpus, such as how size it is; this affects the accuracy of classification of tweets: news and junk.
Moreover, it is difficult to understand online clustering
. The authors apply Gaussian parameter to define active
and inactive clusters; the active clusters are less than three days, but the authors do not explain more detail
about active clusters: just describe and show the formulas of cosine similarity measure
and Gaussian
parameter. Thus, the authors should have focused on how easy the readers can understand.

In addition, most readers wonder about dealing with fragmentation of the same topics of clusters. The authors
do not explain more detail why they mark th
e older clusters as master clusters, even though the slave clusters,
new topics, seem to be more important news.

Next, with the section of geographic focus, the authors apply geotagging to indicate geographic locations.
They also apply Part
Speech (POS
) and Named
Entity Recognition (NER), which are in the area of Natural
Language Processing (NLP). However, the authors do not describe more detail this step; they just introduce
POS and NER. They also only refer to STEWARD and New
Stand, which are shown in

references. It would
have been better if the authors given more detail in this function and described how it work.


Key assumptions

There are some key assum
ptions made by the authors. First, TwitterStand can separate news from noise, and
tweet does not

belong to news domain
. They come from users and followers
. Second is that TwiterSatnd can
cluster news tweets into a set of tweets.

In this step, the authors assume that the tweets, which come from the
trusted seeders, are reliable news because this kind
of tweet comes from the users who publish news in tweets
from newspapers and televisions. Third is the assumption for Naïve Bayes classifier, which the words in tweets
are independent.

The weak point for these assumptions is that if most users are not th
e seeders or no seeders, it would be a huge
of noise, and the system would slow down because a lot of tweets would be in the process of classifier.
Furthermore, the problem of Naïve Bayes classifier is that each word is independent, meaning that it does no
know the meaning of words. Therefore, we would see different topics in the same meaning; This is the
drawback of Naïve Bayes classifier.

Validation of Methodology

From the method above, the authors show that their TwitterStand works, but they do not compare to the
findings of previous works; in fact, there is no related works in this article. Also, the authors do not show the
results in tables or graphs. They just m
ention that it woks, it can reduce noise, it can group or cluster the
tweets, it can locate the location of geographic content, but they do not show more detail. Therefore, the
results cannot be compared to other works. Moreover, know more detail about use
rs; we just know that the
authors choose 2000 people who related to news; we do not know the proportion of genders, how old are they,
where are they, their occupations, and so on. These affect the accuracy of the results. Thus, this work is quite
less reli
ability and validation. It would have been more reliable and validly if the authors had focused on these


In this article, the authors first introduce Twitter in general topic as a social networking web site that allows
people to

communicate to others using short messages, tweet; these messages are broadcasted in wide
of topics, then they reveal the advantages of Twitter. In this research, the authors focus only on breaking news;
thus, they explain more detail about Twitte
r in section 2. After that, the authors introduce key strategies to
develop their proposed system, TwitterStand, along with its architecture, which is an important part of this
article; they describe more detail in this part, including inputs, separating n
ews from noises, online clustering
tweets, determining cluster’s geographic focus, user interface, topic hashtags, and friend finder. With the
architecture, the authors explain how the do and what techniques or algorithms they use, they also discuss the
sults like how good the proposed system it works. Finally, the authors describe concluding remarks and
show references. This seems to be a good organization; however, the authors do not show related or previous
works, which are also important part of the p
aper. They just refer to some references in the paper, such as
STEWARD [1] and New
Stand [18]. Moreover, it is difficult to understand methodology because the authors
do not separate methodology and the results from the section of architecture of TwitterSt
and; in fact, they
describe concluding remarks, last section, after the section of architecture of TwitterStand. We also cannot see
the results as tables or graphs. Obviously, it takes time to understand this article. Although this is new,
interesting, and

important paper, the authors could have given more attention to the fact that a good organized
paper helps readers to easily understand. It would have been better and more reliability if the author had
related to the findings of previous works and given t
heir main findings in the form of tables or graphs.

. References

E. Amitay, N. Har’El, R. Sivan, and A. Soffer. Web
Where: Geotagging web content. In Proc. of ACM

SIGIR, pages 273

280, Sheffield, UK, July 2004.


http://www.cnn.com/2009/TECH/07/01/ celebrity.death.pranks/index.html
. Retr. July 1, 2009.


N. Cohen. Twitter on the barricades: Six lessons learned.

. Pub. June 20, 2009.




Site profile for twitter.com.
http:// siteanalytics.compete.com/twitter.com/
. Retr. July



R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley Interscience, New York, second

edition, 2000.


GeoNames. GeoNames.
. Retr. June 17, 2009.


. Retr. July 1, 2009.


B. Heil and M. Piskorski. New Twitter research: Men follow men and nobody tweets.

cs/2009/06/new_twitter_ research_men_follo.html
. Pub. June 1, 2009.


A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and

communities. In Proc. of SNA
KDD 2007 Workshop on Web mining and social netwo
rk analysis, pages


65, San Jose, California, 2007.


D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language

Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, U
pper Saddle River, NJ,

USA, 2000.


M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling. STEWARD: architecture of a spatio

textual search engine. In Proc. of ACM GIS, pages 186

193, Seattle, WA, Nov. 2007.


McGiboney. Twitter’s tweet smell of success.

http://blog.nielsen.com/nielsenwire/online_ mobile/twitters
. Pub. Mar. 18,



M. Milian. Twitter sees earth
shaking activity during SoCal quake.

. Pub. July 30, 2008.


T. M. Mitchell. Machine Learning. McGraw
Hill, New York, NY, 1997.


. Retr. July 1, 2009. [16] G. Salton and C. Buckley. Term


in automatic text retrieval. Information Processing and

Management, 24(5):513

523, 1988.

M. Steinbach, G. Karypis, and V. Kumar. A

comparison of document clustering techniques. In KDD Workshop on Text Mining, pages 1

20, Boston,

MA, Aug. 2000.


B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. NewsStand:

A new view on new
s. In Proc. of ACM SIGSPATIAL GIS, pages 144

153, Irvine, CA, Nov. 2008.


. Retr. July 1, 2009.


. Retr. July 1, 2009.


Twitter Search.
. Retr. July 1,