A summary and Critique of “
: News in Tweets
CSCE 6350.001 Spring 2010
The following is a review of the article
“TwitterStand: News in
J. Sankaranarayanan, H. Samet, B.
Teitler, M. Lieberman, and J. Sperling (2009
, J. Sankaranarayanan, et. al
social network website,
news processing system, called TwitterStand
as junk or news
After classifying the tweets, t
news tweets that have
similar topic of content. Finally, each cluster pinpoints to the geographic location where breaking news being
This paper w
ill begin by
. Next, this paper will
state problem statements, then the proposed
TwitterStand will be pre
, along with methodologies
Twitter is a free social networking and micr
oblogging website that enable its users to send and read messages,
. A tweet is a short message
, which contain
maximum of 140 characters
. In Twitter,
can receive tweets from other users, wh
User A’s friends,
User A can have
maximum of 2000 friends.
User A can have followers who can receive and read user A’s tweets. There is no limit on
the User A can have
Thus, the number of people who use Twitter has been increased in recent
First of all, Twitter encourages people to acquire a large number of friends that a user can have unlimited
followers; user and followers can send and receive tweets on a wide variety of subjects. User can also link to
objects on the web, such as articles, images, and videos
The second advantage is that users can be viewed as part of other users or groups because it is social
networking website; users can see others profiles and tweets from friends and followers.
ermore, we can
the location of
who send the tweets.
Last but not least, it is useful for tracking news, and users receive reaction and feedback by tweets faster than
conventional news media.
. Problem statement
The drawback of sending a tweet is it
is restricted to only
: short message
; most users use
tweets may also be grammatically incorrect and spelling errors. Because a tweet
contains a short message, users are likely to send them more often. Thus,
a very large number of tweets
as a high throughput rate
. This research focuses on the
news from the Twitter tweets
try to separate news and noise,
news cluster, and
determine the relevant location of each cluster. To
do these tasks, the researchers create the processing system from Twitter tweets, called TwitterStand,
works as efficiency or not.
The main purpose of getting news from tweets is dealing with noise data; therefore, the question is
TwitterStand can effectively classify news from noise
tweets or not, the proposed online clustering syst
quickly mitigate noise or not, and this system can determine the location of clusters
Therefore the main question is that TwitterStand can gather and
breaking news much faster than
conventional news media or not.
According to the prob
lem statements above, the goals of this research are
building the processi
ng system to
automatically obtain breaking news posted by Twitter users, and providing a map interface for reading these
The authors create
which is the proposed news processing system that can identify breaking
news as fast and high quality as possible.
With the architecture of TwiterStand, all data from the Twitter
tweets are first gathered us
ing Twitter services including S
t, Search, and BirdDog. Then
this data will be classified to separate noise and news. After that, the news will be au
tomatically grouped as
clusters. Next is geotagger, which locate
geographic content from each cluster
. The results will be
splayed on the
screen, which shows the ranking clusters on the left pane,
and the map of geographic region
related to these clusters
on the right pane
From the first step of TwitterStand, especially Seeders, the re
handpick 2,000 Twitter users w
publish news in tweets and who are primarily interested in news
. They also use the GardenHose and BirdDog
ces to obtain feeds from
The next important part of the TwiterStand system is separating news from noise.
All tweets are classi
news or junk; the authors use a Naïve Bayes
the news and junk
. The idea of this classifier
is training and testing; a large number of tweets have already been marked as news and junks on the training
corpus of tweets, then ea
ch user tweet will be tested using Bayes’ theorem
. In this step, the researchers also use
static and dynamic corpus to identify news from tweets.
After classifying tweets related to news,
the next step is online clustering, which automatically group
tweets into sets of tweets
this step deals with dynamic news tweets using an online algorithm, called
follower clustering. The system chooses only tweets
are published within three day;
apply a variant of cosine
similarity measure and Gaussian algorithm to identify active clusters, which are less
than three days.
in order to improve better clustering results, the researchers do more tasks, which
are dealing with noise,
e clusters on a single topic
, weight upper bounds,
, which two words always come together
he next step is geotagger, which each cluster is extracted int
o geographic location; then the researchers
tweet geographic con
to determine over all geographic focus. To reach this goal, they do
several strategies. First is using tweet content, which performs toponym recognition
and toponym resolution.
The toponym recognition
combines two Natural Language Processing strategie
s, which are Part of Speech and
Entity Recognition to identify the proper parts of each word in the sentences and extract the type of
entities into categories like organization or person, respectively. The other
geotagging, toponym resolution,
, which are stored in database associated with meta
of geographic location contains almost 7 million entries gathered from volunteers around the
world and stored in PostgreSQL database. The second
trategy is using tweet metadata that
in particular source locations to perform geotagging process. Finally, they compute a
geographic focus for the topic as a whole and ranking the location in the cluster, and then repor
The Other step is building user interface, which provide two panes on the screen, left and right pane. The left
pane is designed to show the ranking cluster in terms of importance, while the right pane shows
a map of
clusters which geographic reg
ion of in
Furthermore, TwitterStand also contain interesting functions:
hashtags and friend finder. Both two services
help us to find the important topics and new friends, respectively.
The authors develop processing
system, TwitterStand, which automatically obtain breaking news from tweets
They describe the architecture of the proposed system in detail. The do the experiment following by
the system architecture.
The results are satisfied
: faster than conve
news media. T
he proposed system
can separate news from noise and cluster the tweets as fast and robust. This system can also determine the
clusters’ geographic locations
This work is good for future work.
The authors qui
te state directly that the use of Twitter has been increasing and becoming popular for
communication using short messages, tweets. Because tweets are restricted to 140 characters’ long, the tweets
are broadcasted more frequently a
nd very high rate; mostly
messages are news, events, activities, and
opinions. In other word
, tweets tell users and followers what happen, what is going on, what are they doing,
what are their opinions; From this point, the authors motivate that they want to know that news f
are faster than conventional news media. Thus, they prop
TwitterStand to separate news from noise, group
the same topics into a set of tweets, and define the location of clusters. Therefore, the authors point out the
motivation directly and c
The authors introduce Twitter clearly and lots of detail. They also describe the proposed TwitterStand.
However, they do not show related or previous works; they just refer to some related papers, but no details.
For instance, they o
nly refer to their own papers, STEWARD and New
Stand, without details. In fact, there is
no other previous works introduced in this article. This is very important for academic paper
to show related
works in order to solve the problems, which motivate the
m to develop the experiment. Although many
previous or related works, which are shown in the references section, they do not reveal these related works in
this article. This affects the reliability and valuable of the experiment. Although this article is q
uite new topic,
the authors should have focused on the related works. It would have been better if the authors had related to
the findings to previous works on this topic.
The contribution of this work is that
Twitter changes the way to comm
unicate with each other. One person
can broadcast messages via electronic gadgets such as, cell phone and PDA. Consequently, it made millions of
Twitter users to be eyes and ears in the world. By cooperating with Twitter, TwitterStand offers the way to fin
related news by clustering data, and points to the source of news by using Geo
tagging. The strong point of this
method is to locate source of news, so TwitterStand can offer local news for specific gro
up of users;
example, Iran election and earth qu
ake in California.
On the other han
d, Twitter is informal reporter
, so the messages would be consist with wrong spelling and Net
lingua. In addition, the reports from local areas might have local vocabulary and local language. It is difficult
to group news from noise, and some of messages might be avoided because unrecog
nized local languages. To
fix these weak point,
TwitterStand should have dealt
with enlarge corpus.
The main purpose of TwitterStand is to extract news from 140 characters in tweets messages without noise.
Because of limitation of text
drawback will offer high
messages, wrong spelling, and
wrong grammar. Therefore, they have to separate news and noise. These problems will be solved by
ve Bayes, and clustering technique:
inverse document frequency
However, the difficult points of this research are what the news is and how we can trust the reporters. These
points are hard to deal with because news from some places is not important for other places, or the so
news are low trustworthiness
@breakingnews site, automated site, for example, has over 1,000 followers.
had been updated about swine flu situation because its followers kept updating swine flu every 10 minutes.
Nonetheless, most of messages ar
e rumours, for no one audit that tweets.
In short, the number of tweets does
not mean that we can trust the news from tweet.
From the motivation above, the author obviously state problem definition of their experiment. The goal of this
work is to define
news from tweets, thus, the authors develop TwitterStand, the processing system, to separate
news from noise, cluster the same topics of tweets, and locate the geographic content of clusters. Thus, this
work deals with the problem of restricted 140 charact
ers of tweets, the problem of noise in tweets. The authors
point out these problems clearly. However, the authors do not define directly about the efficiency or quality of
their work; they just describe how it works. Hence, it would have been better if the
researchers state the
problem definition directly.
several key concepts in this article. The first important concept is separating news from noise; this is
the main goal in this research. It is necessary to get rid of noise data in Twitter tweets. The authors apply the
known Naïve Bayes algorithm to
classify news from noise. Another key concept is clustering tweets for
dealing with dynamic corpus as new topics that less than three days using a variant of cosine similarity
measure and Gaussian algorithm. The other key concept is mapping the clusters in
to geographic focus using
geotagging, which applies the mechanism of Natural Language Processing: Part of Speech and Named
These are good key concepts used for this work. The good results come from a good method of the experiment
related to the good key concepts.
The main point of this work is to capture tweets that correspond to the late breaking news. They choose 2,000
users related to breaking news and gather messages from tweets from theses users. Then, they separate ne
from noise in tweets using Naïve Bayes classifier. After that, each tweet is online grouped into clusters; then
each cluster is extracted into geographic focus. However, there are some limited in this experiment.
First, most tweets could be very noise
and bias because tweets come from users and many followers; thus,
many tweets could be opinions even though the authors handpick 2000 users; thus, the authors should have
specified the proportion of different categorize of users. If many users come from th
e same group, the news
may be biased and contained many opinions. The validation and reliability of this work is based on the
reliability of data, which comes from users. Hence, the authors should have paid more attention to the non
bias and reliable users
Second, it is difficult to understand the function of separating news from noise even though the authors apply
known Naïve Bayes classifier and show its formula; they do not show more detail how they apply this
classifier, how they train tra
ining data, how they test testing data; in fact, they do not describe clearly about the
training corpus, such as how size it is; this affects the accuracy of classification of tweets: news and junk.
Moreover, it is difficult to understand online clustering
. The authors apply Gaussian parameter to define active
and inactive clusters; the active clusters are less than three days, but the authors do not explain more detail
about active clusters: just describe and show the formulas of cosine similarity measure
parameter. Thus, the authors should have focused on how easy the readers can understand.
In addition, most readers wonder about dealing with fragmentation of the same topics of clusters. The authors
do not explain more detail why they mark th
e older clusters as master clusters, even though the slave clusters,
new topics, seem to be more important news.
Next, with the section of geographic focus, the authors apply geotagging to indicate geographic locations.
They also apply Part
) and Named
Entity Recognition (NER), which are in the area of Natural
Language Processing (NLP). However, the authors do not describe more detail this step; they just introduce
POS and NER. They also only refer to STEWARD and New
Stand, which are shown in
references. It would
have been better if the authors given more detail in this function and described how it work.
There are some key assum
ptions made by the authors. First, TwitterStand can separate news from noise, and
tweet does not
belong to news domain
. They come from users and followers
. Second is that TwiterSatnd can
cluster news tweets into a set of tweets.
In this step, the authors assume that the tweets, which come from the
trusted seeders, are reliable news because this kind
of tweet comes from the users who publish news in tweets
from newspapers and televisions. Third is the assumption for Naïve Bayes classifier, which the words in tweets
The weak point for these assumptions is that if most users are not th
e seeders or no seeders, it would be a huge
of noise, and the system would slow down because a lot of tweets would be in the process of classifier.
Furthermore, the problem of Naïve Bayes classifier is that each word is independent, meaning that it does no
know the meaning of words. Therefore, we would see different topics in the same meaning; This is the
drawback of Naïve Bayes classifier.
Validation of Methodology
From the method above, the authors show that their TwitterStand works, but they do not compare to the
findings of previous works; in fact, there is no related works in this article. Also, the authors do not show the
results in tables or graphs. They just m
ention that it woks, it can reduce noise, it can group or cluster the
tweets, it can locate the location of geographic content, but they do not show more detail. Therefore, the
results cannot be compared to other works. Moreover, know more detail about use
rs; we just know that the
authors choose 2000 people who related to news; we do not know the proportion of genders, how old are they,
where are they, their occupations, and so on. These affect the accuracy of the results. Thus, this work is quite
ability and validation. It would have been more reliable and validly if the authors had focused on these
In this article, the authors first introduce Twitter in general topic as a social networking web site that allows
communicate to others using short messages, tweet; these messages are broadcasted in wide
of topics, then they reveal the advantages of Twitter. In this research, the authors focus only on breaking news;
thus, they explain more detail about Twitte
r in section 2. After that, the authors introduce key strategies to
develop their proposed system, TwitterStand, along with its architecture, which is an important part of this
article; they describe more detail in this part, including inputs, separating n
ews from noises, online clustering
tweets, determining cluster’s geographic focus, user interface, topic hashtags, and friend finder. With the
architecture, the authors explain how the do and what techniques or algorithms they use, they also discuss the
sults like how good the proposed system it works. Finally, the authors describe concluding remarks and
show references. This seems to be a good organization; however, the authors do not show related or previous
works, which are also important part of the p
aper. They just refer to some references in the paper, such as
STEWARD  and New
Stand . Moreover, it is difficult to understand methodology because the authors
do not separate methodology and the results from the section of architecture of TwitterSt
and; in fact, they
describe concluding remarks, last section, after the section of architecture of TwitterStand. We also cannot see
the results as tables or graphs. Obviously, it takes time to understand this article. Although this is new,
important paper, the authors could have given more attention to the fact that a good organized
paper helps readers to easily understand. It would have been better and more reliability if the author had
related to the findings of previous works and given t
heir main findings in the form of tables or graphs.
E. Amitay, N. Har’El, R. Sivan, and A. Soffer. Web
Where: Geotagging web content. In Proc. of ACM
SIGIR, pages 273
280, Sheffield, UK, July 2004.
. Retr. July 1, 2009.
N. Cohen. Twitter on the barricades: Six lessons learned.
. Pub. June 20, 2009.
Site profile for twitter.com.
. Retr. July
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley Interscience, New York, second
. Retr. June 17, 2009.
. Retr. July 1, 2009.
B. Heil and M. Piskorski. New Twitter research: Men follow men and nobody tweets.
. Pub. June 1, 2009.
A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and
communities. In Proc. of SNA
KDD 2007 Workshop on Web mining and social netwo
rk analysis, pages
65, San Jose, California, 2007.
D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, U
pper Saddle River, NJ,
M. D. Lieberman, H. Samet, J. Sankaranarayanan, and J. Sperling. STEWARD: architecture of a spatio
textual search engine. In Proc. of ACM GIS, pages 186
193, Seattle, WA, Nov. 2007.
McGiboney. Twitter’s tweet smell of success.
. Pub. Mar. 18,
M. Milian. Twitter sees earth
shaking activity during SoCal quake.
. Pub. July 30, 2008.
T. M. Mitchell. Machine Learning. McGraw
Hill, New York, NY, 1997.
. Retr. July 1, 2009.  G. Salton and C. Buckley. Term
in automatic text retrieval. Information Processing and
M. Steinbach, G. Karypis, and V. Kumar. A
comparison of document clustering techniques. In KDD Workshop on Text Mining, pages 1
MA, Aug. 2000.
B. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. NewsStand:
A new view on new
s. In Proc. of ACM SIGSPATIAL GIS, pages 144
153, Irvine, CA, Nov. 2008.
. Retr. July 1, 2009.
. Retr. July 1, 2009.
. Retr. July 1,