Discovering Geographical Topics in the Twitter

piloturuguayanΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

61 εμφανίσεις

Discovering Geographical Topics in the Twitter
Stream

Karthik Kumar Rangineni

Zhi

Liu

Introduction

Micro
-
blogging services have become important
communication tools for online users for
Spreading of Breaking
-
News, Individual opinions, Events. Social Networking sites like Twitter,
Facebook, and Tumbler started supporting location services in the messages, which is done by
either explicitly, by letti
ng users choose their places, or implicitly, by enabling geo
-
tagging,
which is to associate messages with latitudes and longitudes.

The Author main
challenging task

to discover topics and identify users interests from these geo
-
tagged messages due to the sheer amount of data and diversity of language variations used on
these
location
-
sharing

services.

The Author shows e show high accuracy in location estimation
based

on this model. Moreover, the algorithm identifies interesting topics based on location and
language.

The Author proposed a model that is both flexible enough to embed all reasonable components
of content and geographical locations, as well as user preferen
ce modeling. Moreover, it scales to
real
-
world datasets to handle millions of documents and users. It utilizes both statistical topic
models and sparse coding techniques to provide a principled method for uncovering di

erent
language patterns and common in
terests shared across the world

Many factors
have

influence
on
the language used in a tweet with a particular location

and

words
used in a tweet certainly depend on the author and the location where the tweet is written. Thus,
di

erent geographical regions
have di

erent language variations and topics have di

erent
chances of being discussed in those regions.

Related Work

There are two lines of related research.

Paper

[1] proposed a model based on probabilistic latent
semantic indexing. It assumes that each w
ord is either drawn from a universal background topic
or from a location and time dependent language model.

Paper

[3] introduced a
fully Bayesian
generative

model to incorporate locations. Rather than

working with real latitudes and
longitudes, they have

a

fixed number of region labels and they assume that

each term is
associated with a location label.

Paper [4] presented a same model with [3], the authors replace
using a multinomial distribution with two Gaussian distributions for generating latitude and
l
ongitude respectively.
Paper [5] proposed a model built on [3], but the authors introduced the
notion of global topics and local
events where more general terms are grouped into global topics
and terms related to local events going to local topics. Paper [
6] and [7] proposed the same
models. A latent region generates the terms and the location of a particular document. The
location is generated from a region by a normal distribution and the region is sampled from a
multinomial distribution. Wing used an eve
n simpler approach where documents are assigned to
geodesic grids and thus a supervised learning method is utilized, essentially yielding to build
naïve Bayes classifiers on geodesic grids in [8].

Models

The authors used Sparse Additive Generative model to

take different aspects into account
without having to infer a complex indictor variable distinguishing the set of causes. In the model
of this paper, the authors proposed following intuitions:



Words used in a tweet depend on both the location and topic of

the tweet.



Different geographical regions have different language variations. Topics have different
chances to be discussed in different regions.



Users tend to appear in a handful geographical locations


Figure 1: Model Framework

In this model, each tweet consists three parts: w is the word vector for the tweet, following a
simple bag of word assumption, I is a real
-
valued pair of latitude and longitude where the tweet
is written and u is the user id for the author of the tweet.

F
igure 1 shows the
framework

of the model in this paper.
For each tweet, the model generates
the location, the topic and terms in the tweet consecutively. To generate the region index r, the
authors utilized a multinomial model by the global region distribu
tion and user
-
dependent region
distribution. After generated region index, it will be used in the location I drawn, topic z
selection and
terms w. Each location Id is drawn from a region
-
dependent multivariate normal
distribution N(u, Σ), u is the mean loc
ation of a latent region and Σ is the covari
ance matrix of a
latent region.
Once the region and the location are generated, a topic z is selected dependent on
both the latent region and the author of tweet. The global topic distribution, region
-
dependent
t
opic distribution and user
-
dependent topic distribution will be combined to generate this. Then
drawing from the aggregate distribution, global term distribution, user dependent topic
distribution and a topic matrix where each row is a distribution over te
rms generates each word
in the tweet. The distributions are demonstrated by a Laplace distribution.

Then the authors combined different index and applied the EM step to learn the parameters in
this model. To maximize the likelihood, the gradient
-
based opt
imization method has been
introduced in this process. The process is not very stable for the reason that only one sample of
regional assignments for each tweet is taken. The authors provided a geographical location
modeling to effectively sample latent reg
ions.

Experiment

The

authors
used the algorithm in [6] as the baseline algorithm. In the baseline, there is no user
level preferences are learned. The prediction process can be divided into two steps: choosing the
region index that can maximize

the best tweet likelihood and use the mean location of the region
as the predicted location. And the authors combine their models in different ways:
only use the
topic model, combine topic and region model and use the full model. The results are shown in
the figure 2. The authors also compared the models with or without the geographical location
model.


F
igure

2: Experiments results of different combinations of models

The authors compare their models with other algorithm in figure 3, their full model got
the best
results in the cooperation.


F
igure

3: Comparison between different algorithms

Contribution & Conclusion

The main contributions are as follows:



The authors proposed an additive generative model of content and locations that
incorporates
multiple facets of micro
-
blogging environments in an integral fashion.



The sparse coding techniques and Bayesian treatments are smoothly embedded in our
modeling, resulting in an effective implementation.



Models in this paper outperform several state
-
of
-
th
e
-
art algorithms in the task of location
predictions and it demonstrates interesting patterns in real
-
world datasets.

In this paper, the authors address the problem of modeling geographical topical patterns on
Twitter by introducing a novel sparse generati
ve model, which utilizes both statistical topic
models and sparse coding techniques to provide a principled method for uncovering different
language patterns and common interests shared across the world.
For the future work, the authors
plan to model human

mobility explicitly by introducing user level regional components and
temporal factors will also be considered for the task of location prediction.


Reference

1.

Mei, Qiaozhu, Chao Liu, Hang Su, and ChengXiang Zhai. "A probabilistic approach to
spatiotemporal theme pattern mining on weblogs." In

Proceedings of the 15th international
conference on World Wide Web
, pp. 533
-
542. ACM, 2006.

2.

Hofmann, Thomas
. "Unsupervised learning by probabilistic latent semantic
analysis."

Machine Learning

42, no. 1
(2001): 177
-
196.

3.

Wang, Chong, Jinggang Wang, Xing Xie, and Wei
-
Ying Ma. "Mining geographic
knowledge using location aware topic model." In

Proceedings of the 4th ACM workshop on
Geographical information retrieval
, pp. 65
-
70. ACM, 2007.

4.

Sizov, Sergej. "Geof
olk: latent spatial semantics in web 2.0 social media." In
Proceedings of
the third ACM international conference on Web search and data mining
, pp. 281
-
290. ACM,
2010.

5.

Hao, Qiang, Rui Cai, Changhu Wang, Rong Xiao, Jiang
-
Ming Yang, Yanwei Pang, and Lei
Zhang
. "Equip tourists with knowledge mined from travelogues." In

Proceedings of the 19th
international conference on World wide web
, pp. 401
-
410. ACM, 2010.

6.

Yin, Zhijun, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang.
"Geographical topic discove
ry and comparison." In

Proceedings of the 20th international
conference on World wide web
, pp. 247
-
256. ACM, 2011.

7.

Eisenstein, Jacob, Brendan O'Connor, Noah A. Smith, and Eric P. Xing. "A latent variable
model for geographic lexical variation." In

Proceedi
ngs of the 2010 Conference on
Empirical Methods in Natural Language Processing
, pp. 1277
-
1287. Association for
Computational Linguistics, 2010.

8.

Wing, Benjamin, and Jason Baldridge. "Simple supervised document geolocation with
geodesic grids." In

Proceeding
s of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies
, vol. 1, pp. 955
-
964. 2011.