Discovering Geographical Topics in the Twitter
Karthik Kumar Rangineni
blogging services have become important
communication tools for online users for
Spreading of Breaking
News, Individual opinions, Events. Social Networking sites like Twitter,
Facebook, and Tumbler started supporting location services in the messages, which is done by
either explicitly, by letti
ng users choose their places, or implicitly, by enabling geo
which is to associate messages with latitudes and longitudes.
The Author main
to discover topics and identify users interests from these geo
tagged messages due to the sheer amount of data and diversity of language variations used on
The Author shows e show high accuracy in location estimation
on this model. Moreover, the algorithm identiﬁes interesting topics based on location and
The Author proposed a model that is both ﬂexible enough to embed all reasonable components
of content and geographical locations, as well as user preferen
ce modeling. Moreover, it scales to
world datasets to handle millions of documents and users. It utilizes both statistical topic
models and sparse coding techniques to provide a principled method for uncovering di
language patterns and common in
terests shared across the world
the language used in a tweet with a particular location
used in a tweet certainly depend on the author and the location where the tweet is written. Thus,
erent geographical regions
erent language variations and topics have di
chances of being discussed in those regions.
There are two lines of related research.
 proposed a model based on probabilistic latent
semantic indexing. It assumes that each w
ord is either drawn from a universal background topic
or from a location and time dependent language model.
 introduced a
model to incorporate locations. Rather than
working with real latitudes and
longitudes, they have
fixed number of region labels and they assume that
each term is
associated with a location label.
Paper  presented a same model with , the authors replace
using a multinomial distribution with two Gaussian distributions for generating latitude and
Paper  proposed a model built on , but the authors introduced the
notion of global topics and local
events where more general terms are grouped into global topics
and terms related to local events going to local topics. Paper [
6] and  proposed the same
models. A latent region generates the terms and the location of a particular document. The
location is generated from a region by a normal distribution and the region is sampled from a
multinomial distribution. Wing used an eve
n simpler approach where documents are assigned to
geodesic grids and thus a supervised learning method is utilized, essentially yielding to build
naïve Bayes classifiers on geodesic grids in .
The authors used Sparse Additive Generative model to
take different aspects into account
without having to infer a complex indictor variable distinguishing the set of causes. In the model
of this paper, the authors proposed following intuitions:
Words used in a tweet depend on both the location and topic of
Different geographical regions have different language variations. Topics have different
chances to be discussed in different regions.
Users tend to appear in a handful geographical locations
Figure 1: Model Framework
In this model, each tweet consists three parts: w is the word vector for the tweet, following a
simple bag of word assumption, I is a real
valued pair of latitude and longitude where the tweet
is written and u is the user id for the author of the tweet.
igure 1 shows the
of the model in this paper.
For each tweet, the model generates
the location, the topic and terms in the tweet consecutively. To generate the region index r, the
authors utilized a multinomial model by the global region distribu
tion and user
distribution. After generated region index, it will be used in the location I drawn, topic z
terms w. Each location Id is drawn from a region
dependent multivariate normal
distribution N(u, Σ), u is the mean loc
ation of a latent region and Σ is the covari
ance matrix of a
Once the region and the location are generated, a topic z is selected dependent on
both the latent region and the author of tweet. The global topic distribution, region
opic distribution and user
dependent topic distribution will be combined to generate this. Then
drawing from the aggregate distribution, global term distribution, user dependent topic
distribution and a topic matrix where each row is a distribution over te
rms generates each word
in the tweet. The distributions are demonstrated by a Laplace distribution.
Then the authors combined different index and applied the EM step to learn the parameters in
this model. To maximize the likelihood, the gradient
imization method has been
introduced in this process. The process is not very stable for the reason that only one sample of
regional assignments for each tweet is taken. The authors provided a geographical location
modeling to effectively sample latent reg
used the algorithm in  as the baseline algorithm. In the baseline, there is no user
level preferences are learned. The prediction process can be divided into two steps: choosing the
region index that can maximize
the best tweet likelihood and use the mean location of the region
as the predicted location. And the authors combine their models in different ways:
only use the
topic model, combine topic and region model and use the full model. The results are shown in
the figure 2. The authors also compared the models with or without the geographical location
2: Experiments results of different combinations of models
The authors compare their models with other algorithm in figure 3, their full model got
results in the cooperation.
3: Comparison between different algorithms
Contribution & Conclusion
The main contributions are as follows:
The authors proposed an additive generative model of content and locations that
multiple facets of micro
blogging environments in an integral fashion.
The sparse coding techniques and Bayesian treatments are smoothly embedded in our
modeling, resulting in an effective implementation.
Models in this paper outperform several state
art algorithms in the task of location
predictions and it demonstrates interesting patterns in real
In this paper, the authors address the problem of modeling geographical topical patterns on
Twitter by introducing a novel sparse generati
ve model, which utilizes both statistical topic
models and sparse coding techniques to provide a principled method for uncovering different
language patterns and common interests shared across the world.
For the future work, the authors
plan to model human
mobility explicitly by introducing user level regional components and
temporal factors will also be considered for the task of location prediction.
Mei, Qiaozhu, Chao Liu, Hang Su, and ChengXiang Zhai. "A probabilistic approach to
spatiotemporal theme pattern mining on weblogs." In
Proceedings of the 15th international
conference on World Wide Web
, pp. 533
542. ACM, 2006.
. "Unsupervised learning by probabilistic latent semantic
42, no. 1
Wang, Chong, Jinggang Wang, Xing Xie, and Wei
Ying Ma. "Mining geographic
knowledge using location aware topic model." In
Proceedings of the 4th ACM workshop on
Geographical information retrieval
, pp. 65
70. ACM, 2007.
Sizov, Sergej. "Geof
olk: latent spatial semantics in web 2.0 social media." In
the third ACM international conference on Web search and data mining
, pp. 281
Hao, Qiang, Rui Cai, Changhu Wang, Rong Xiao, Jiang
Ming Yang, Yanwei Pang, and Lei
. "Equip tourists with knowledge mined from travelogues." In
Proceedings of the 19th
international conference on World wide web
, pp. 401
410. ACM, 2010.
Yin, Zhijun, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang.
"Geographical topic discove
ry and comparison." In
Proceedings of the 20th international
conference on World wide web
, pp. 247
256. ACM, 2011.
Eisenstein, Jacob, Brendan O'Connor, Noah A. Smith, and Eric P. Xing. "A latent variable
model for geographic lexical variation." In
ngs of the 2010 Conference on
Empirical Methods in Natural Language Processing
, pp. 1277
1287. Association for
Computational Linguistics, 2010.
Wing, Benjamin, and Jason Baldridge. "Simple supervised document geolocation with
geodesic grids." In
s of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies
, vol. 1, pp. 955