Discovering Overlapping Groups in Social Media

dealerdeputyAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)

88 views

Discovering Overlapping
Groups in Social Media

Xufei Wang
, Lei Tang, Huiji Gao,
and Huan Liu

xufei.wang@asu.edu

Arizona State University

Social Media


Facebook


500 million active users


50% of users log on to Facebook everyday


Twitter


100 million users


300, 000 new users everyday


55 million tweets everyday


Flickr


12 million members


5 billion photos

3

Activities in Social Media


Connect with others to form “
Friends



Interact

with others (comment, discussion,
messaging)


Bookmark

websites/URLs (StumbleUpon,
Delicious)


Join

groups

if explicitly exist (Flickr, YouTube)


Write

blogs

(Wordpress,Myspace)


Update

status

(Twitter, Facebook)


Share

content (Flickr, YouTube, Delicious)

5

Community Structure


Behavior Studying


Individual ? Too many users


Site level ? Lose too much details


Community level. Yes, provide information
with vary granularity



6

Overlapping Communities

8

Colleagues

Family

Neighbors

Related Work


Disjoint Community Detection


Modularity Maximization


Based on
Link Structure
, (
how to understand ?
)



Overlapping Community Detection


Soft Clustering (
Clustering is dense
)


CFinder (
Efficiency and Scalability
)



Co
-
clustering


Disjoint


Understanding groups by words (tags)

9

Problem Statement


Given a User
-
Tag subscription matrix
M
,
and the number of clusters
k
, find
k

overlapping

communities which consist
of both
users and tags
.


u
3

t2

u
1

u
2

t1

t4

u
4

u
5

t3

10

Our Contributions


Extracting overlapping communities that
better reflect reality



Clustering on a user
-
tag graph. Tags are
informative in identifying user interests



Understanding groups by looking at tags
within each group

11

u
3

t2

u
1

u
2

t1

t4

u
4

u
5

t3

Edge
-
centric View


Cluster
edges
instead of
nodes

into
disjoint groups


One
node

can belong to
multiple groups


One
edge

belongs to
one group

u
3

t2

u
1

u
2

t1

t4

u
4

u
5

t3

12

Edge
-
centric View


In an Edge
-
centric view

edge

u1

u2

u3

u4

u5

t1

t2

t3

t4

e1

1

0

0

0

0

1

0

0

0

e2

1

0

0

0

0

0

1

0

0

e3

0

1

0

0

0

1

0

0

0

e4

0

1

0

0

0

0

1

0

0

e5

0

0

1

0

0

0

1

0

0

e6

0

0

1

0

0

0

0

1

0

e7

0

0

0

1

0

0

0

1

0

e8

0

0

0

1

0

0

0

0

1

e9

0

0

0

0

1

0

0

1

0

e10

0

0

0

0

1

0

0

0

1

13

Clustering Edges


We can use any clustering algorithms
(e.g., k
-
means) to group similar edges
together






Different similarity schemes





14





k
i
C
x
i
j
c
C
i
j
c
x
S
k
1
)
,
(
1
max
arg
Defining Edge Similarity


Similarity between two edges e and e’ can
be defined, but not limited, by

ui

uj

tp

tq

)
,
(
)
1
(
)
,
(
)
'
,
(
q
p
t
j
i
u
e
t
t
S
u
u
S
e
e
S






α is set to 0.5, which suggests the equal
importance of user and tag


Define
user
-
user

and
tag
-
tag

similarity

15

Independent Learning


Assume users are independent, tags are
independent










n
m
n
m
n
m
t
t
u
u
e
e
S
q
p
j
i
e
,
0
,
1
)
,
(
))
,
(
)
,
(
(
2
1
)
'
,
(



16

Normalized Learning


Differentiate nodes with varying degrees
by normalizing each node with its nodal
degree


)
0
,...,
0
,
1
,
0
,...,
0
,
1
,
0
,...
0
(
)
,
(
p
i
t
u
p
i
d
d
t
u
e

2
2
2
2
)
,
(
)
,
(
)
'
,
(
q
p
j
i
j
i
q
p
t
t
u
u
q
p
u
u
j
i
t
t
e
d
d
d
d
t
t
d
d
u
u
d
d
e
e
S






17

Correlational Learning


Tags are semantically close


Tags

cars
,
automobile
,
autos,

car reviews

are used to
describe a blog written by
sid0722

on BlogCatalog

u
Х

t

u
Х

k


Compute user
-
user and tag
-
tag cosine
similarity in the latent space

18

)
~
~
~
~
~
~
~
~
(
2
1
)
'
,
(
q
p
q
p
j
i
j
i
e
t
t
t
t
u
u
u
u
e
e
S






Spectral Clustering Perspective


Graph partition can be solved by the Generalized
Eigenvalue problem



























V
U
Z
M
M
W
D
M
M
D
L
Wz
Lz
T
T
z
0
0
min
2
1

19

Spectral Clustering Perspective


Plug in L,W,Z, we obtain







































V
D
U
M
U
D
V
M
V
U
D
D
V
U
D
M
M
D
T
T
T
T
2
1
2
1
)
1
(
)
1
(
2
0
0
1




U

and
V

are the
right and left singular vectors

corresponding to the top k
largest singular
values

of user
-
tag matrix
M

20

Synthetic Data Sets


Synthetic data sets


Number of clusters, users, and tags


Inner
-
cluster density and Inter
-
cluster density
(1% of total user
-
tag links)


Normalized mutual Information


Between 0 and 1


The higher, the better

21

Synthetic Performance


We fix the number of users, tags, and
density, but vary the number of clusters

22

Synthetic Performance


We fixed the number of users, tags, and
clusters, but vary the inner
-
cluster density

23

Social Media Data Sets


BlogCatalog


Tags describing each blog


Category predefined by BlogCatalog for each
blog



Delicious


Tags describing each bookmark


Select the top 10 most frequently used tags
for each person

24

Inferring Personal Interests


Category information reveals
personal
interests
, view
group affiliation
as features

to infer personal interests via cross
-
validation

25

Connectivity Study


The correlation between the
number of co
-
occurrence

of two users in different
affiliations and
their connectivity
in real
networks.






The larger the co
-
occurrence of two users,
the more likely they are connected

26

Understanding Groups via Tag
Cloud


Tag cloud for
Category Health

27

Understanding Groups via Tag
Cloud


Tag cloud for
Cluster Health

28

Understanding Groups via Tag
Cloud


Tag cloud for
Cluster Nutrition

29

Conclusions and Future Work


Overlapping communities on a User
-
Tag
graph


Propose an edge
-
centric view and define
edge similarity


Independent Learning


Normalized Learning


Correlational Learning


Evaluate results in synthetic and real data
sets


Many applications: link prediction,
Scalability

30

References


I. S. Dhillon, “Co
-
clustering documents and words using bipartite spectral graph
partitioning,” in KDD ’01, NY, USA


L. Tang and H. Liu, “Scalable learning of collective behavior based on sparse social
dimensions,” in CIKM’09, NY, USA.


L. Tang and H. Liu, “Community Detection and Mining in Social Media,” Morgan &
Claypool Publishers, Synthesis Lectures on Data Mining and Knowledge Discovery,
2010.


G. Palla, I. Dernyi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community
structure of complex networks in nature and society,” Nature’05, vol.435, no.7043,
p.814


K. Yu, S. Yu, and V. Tresp, “Soft clustering on graphs,” in NIPS, p. 05, 2005.


U. Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no.
4, pp. 395

416, 2007.


M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in
networks,” Phys. Rev. E, vol. 69, no. 2, p. 026113, Feb 2004.


S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3
-
5,
pp. 75


174, 2010.

31

Contact the Authors


Xufei Wang


xufei.wang@asu.edu


Arizona State University



Lei Tang


ltang@yahoo
-
inc.com


Yahoo! Labs

32