Discovering Geographical Topics in the Twitter
Stream
Karthik Kumar Rangineni
Zhi
Liu
Introduction
Micro

blogging services have become important
communication tools for online users for
Spreading of Breaking

News, Individual opinions, Events. Social Networking sites like Twitter,
Facebook, and Tumbler started supporting location services in the messages, which is done by
either explicitly, by letti
ng users choose their places, or implicitly, by enabling geo

tagging,
which is to associate messages with latitudes and longitudes.
The Author main
challenging task
to discover topics and identify users interests from these geo

tagged messages due to the sheer amount of data and diversity of language variations used on
these
location

sharing
services.
The Author shows e show high accuracy in location estimation
based
on this model. Moreover, the algorithm identiﬁes interesting topics based on location and
language.
The Author proposed a model that is both ﬂexible enough to embed all reasonable components
of content and geographical locations, as well as user preferen
ce modeling. Moreover, it scales to
real

world datasets to handle millions of documents and users. It utilizes both statistical topic
models and sparse coding techniques to provide a principled method for uncovering di
ﬀ
erent
language patterns and common in
terests shared across the world
Many factors
have
inﬂuence
on
the language used in a tweet with a particular location
and
words
used in a tweet certainly depend on the author and the location where the tweet is written. Thus,
di
ﬀ
erent geographical regions
have di
ﬀ
erent language variations and topics have di
ﬀ
erent
chances of being discussed in those regions.
Related Work
There are two lines of related research.
Paper
[1] proposed a model based on probabilistic latent
semantic indexing. It assumes that each w
ord is either drawn from a universal background topic
or from a location and time dependent language model.
Paper
[3] introduced a
fully Bayesian
generative
model to incorporate locations. Rather than
working with real latitudes and
longitudes, they have
a
fixed number of region labels and they assume that
each term is
associated with a location label.
Paper [4] presented a same model with [3], the authors replace
using a multinomial distribution with two Gaussian distributions for generating latitude and
l
ongitude respectively.
Paper [5] proposed a model built on [3], but the authors introduced the
notion of global topics and local
events where more general terms are grouped into global topics
and terms related to local events going to local topics. Paper [
6] and [7] proposed the same
models. A latent region generates the terms and the location of a particular document. The
location is generated from a region by a normal distribution and the region is sampled from a
multinomial distribution. Wing used an eve
n simpler approach where documents are assigned to
geodesic grids and thus a supervised learning method is utilized, essentially yielding to build
naïve Bayes classifiers on geodesic grids in [8].
Models
The authors used Sparse Additive Generative model to
take different aspects into account
without having to infer a complex indictor variable distinguishing the set of causes. In the model
of this paper, the authors proposed following intuitions:
Words used in a tweet depend on both the location and topic of
the tweet.
Different geographical regions have different language variations. Topics have different
chances to be discussed in different regions.
Users tend to appear in a handful geographical locations
Figure 1: Model Framework
In this model, each tweet consists three parts: w is the word vector for the tweet, following a
simple bag of word assumption, I is a real

valued pair of latitude and longitude where the tweet
is written and u is the user id for the author of the tweet.
F
igure 1 shows the
framework
of the model in this paper.
For each tweet, the model generates
the location, the topic and terms in the tweet consecutively. To generate the region index r, the
authors utilized a multinomial model by the global region distribu
tion and user

dependent region
distribution. After generated region index, it will be used in the location I drawn, topic z
selection and
terms w. Each location Id is drawn from a region

dependent multivariate normal
distribution N(u, Σ), u is the mean loc
ation of a latent region and Σ is the covari
ance matrix of a
latent region.
Once the region and the location are generated, a topic z is selected dependent on
both the latent region and the author of tweet. The global topic distribution, region

dependent
t
opic distribution and user

dependent topic distribution will be combined to generate this. Then
drawing from the aggregate distribution, global term distribution, user dependent topic
distribution and a topic matrix where each row is a distribution over te
rms generates each word
in the tweet. The distributions are demonstrated by a Laplace distribution.
Then the authors combined different index and applied the EM step to learn the parameters in
this model. To maximize the likelihood, the gradient

based opt
imization method has been
introduced in this process. The process is not very stable for the reason that only one sample of
regional assignments for each tweet is taken. The authors provided a geographical location
modeling to effectively sample latent reg
ions.
Experiment
The
authors
used the algorithm in [6] as the baseline algorithm. In the baseline, there is no user
level preferences are learned. The prediction process can be divided into two steps: choosing the
region index that can maximize
the best tweet likelihood and use the mean location of the region
as the predicted location. And the authors combine their models in different ways:
only use the
topic model, combine topic and region model and use the full model. The results are shown in
the figure 2. The authors also compared the models with or without the geographical location
model.
F
igure
2: Experiments results of different combinations of models
The authors compare their models with other algorithm in figure 3, their full model got
the best
results in the cooperation.
F
igure
3: Comparison between different algorithms
Contribution & Conclusion
The main contributions are as follows:
The authors proposed an additive generative model of content and locations that
incorporates
multiple facets of micro

blogging environments in an integral fashion.
The sparse coding techniques and Bayesian treatments are smoothly embedded in our
modeling, resulting in an effective implementation.
Models in this paper outperform several state

of

th
e

art algorithms in the task of location
predictions and it demonstrates interesting patterns in real

world datasets.
In this paper, the authors address the problem of modeling geographical topical patterns on
Twitter by introducing a novel sparse generati
ve model, which utilizes both statistical topic
models and sparse coding techniques to provide a principled method for uncovering different
language patterns and common interests shared across the world.
For the future work, the authors
plan to model human
mobility explicitly by introducing user level regional components and
temporal factors will also be considered for the task of location prediction.
Reference
1.
Mei, Qiaozhu, Chao Liu, Hang Su, and ChengXiang Zhai. "A probabilistic approach to
spatiotemporal theme pattern mining on weblogs." In
Proceedings of the 15th international
conference on World Wide Web
, pp. 533

542. ACM, 2006.
2.
Hofmann, Thomas
. "Unsupervised learning by probabilistic latent semantic
analysis."
Machine Learning
42, no. 1
(2001): 177

196.
3.
Wang, Chong, Jinggang Wang, Xing Xie, and Wei

Ying Ma. "Mining geographic
knowledge using location aware topic model." In
Proceedings of the 4th ACM workshop on
Geographical information retrieval
, pp. 65

70. ACM, 2007.
4.
Sizov, Sergej. "Geof
olk: latent spatial semantics in web 2.0 social media." In
Proceedings of
the third ACM international conference on Web search and data mining
, pp. 281

290. ACM,
2010.
5.
Hao, Qiang, Rui Cai, Changhu Wang, Rong Xiao, Jiang

Ming Yang, Yanwei Pang, and Lei
Zhang
. "Equip tourists with knowledge mined from travelogues." In
Proceedings of the 19th
international conference on World wide web
, pp. 401

410. ACM, 2010.
6.
Yin, Zhijun, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang.
"Geographical topic discove
ry and comparison." In
Proceedings of the 20th international
conference on World wide web
, pp. 247

256. ACM, 2011.
7.
Eisenstein, Jacob, Brendan O'Connor, Noah A. Smith, and Eric P. Xing. "A latent variable
model for geographic lexical variation." In
Proceedi
ngs of the 2010 Conference on
Empirical Methods in Natural Language Processing
, pp. 1277

1287. Association for
Computational Linguistics, 2010.
8.
Wing, Benjamin, and Jason Baldridge. "Simple supervised document geolocation with
geodesic grids." In
Proceeding
s of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies
, vol. 1, pp. 955

964. 2011.
Comments 0
Log in to post a comment