Probabilistic Models for Discovering
E

Communities
Ding Zhou
1
, Eren Manavoglu
1
Jia Li
3
, C. Lee Giles
1,2
, Hongyuan Zha
1,2
1
Computer Science and Engineering,
2
Information
Sciences and Technology,
3
Statistics
Pennsylvania State University
University Park, PA, USA 16802
Outline
•
Semantic discovery of communities
•
Graphical models for communities
•
En
tropy
F
iltering

Gibbs sampling algorithm
•
Experiments with Enron data base
•
Conclusions and new directions
Community discovery using probabilistic
graphs
–
Discovery of communities in the digital
society and the impact of user communication
(e

mail, instant messaging, publications, etc.)
–
Inferred relationships among people,
community and the information carriers (e.g.
email text)
–
Why people communicate with each other
–
social explanations
Semantic Social Networks
•
Semantic social network or community
:
a social
network of actors in which there is an expressed meaning
as to why users (actors) are associated
–
In contrast to observational social networks
•
Semantics from
–
Communication interests.
–
Topic interests.
–
Family interests
–
etc
Other definitions:
•
Friend blogging, Friendster, S Downes, 2004.
•
Applications
–
Collaborative filtering
–
Knowledge discovery
–
Social classification
–
Spam detection/filtering
High school romantic and sexual activity social network
Bearman, Moody, Stovel,
AJS
, 2004.
Motivation:
why study semantic communities
•
For example, what if no semantics?
Communication frequency

based community
Communication semantic
based community
•
The social network on the left is built based on communication
frequency; the right is built when considering communication
content.
•
Semantic topology is biased (on the left) when only
considering communication frequency, e.g. a spammer
can have high degrees, a group of spammers is able to
form a new clique in the graph.
•
High intensity of communication, from a content
(semantic) point of view, may misrepresent topics
discovery in a community.
Semantic Social Network Models
•
Social networks

actor graphs or linear algebras
–
Wasserman, Faust, 1999.
•
Semantic social networks
–
Tripartite graphs (heterogeneous)

Mika, 2004.
–
Clustering
–
Probabilistic graphs
•
Bayesian (directed)
•
Markov random field (nondirected)
Probabilistic Layered Framework Models for
Modeling Social Networks
(Possible confusion: two
different
graphs are discussed

social and probabilistic model)
•
Layered probabilistic graph with two types of layers
–
Observable layer
•
nodes are observable from data
–
Hidden layer
•
Nodes not observable but believed to exist
•
Casual in generative models
•
Observable layers in social networks (this work)
–
Users
–
social actors
–
Edges are communication relationships
•
Only the communication frequency used to build a graph of users
–
the semantics are ignored [Tyler
2003].
•
Derived hidden layer
–
Probabilistic framework for text mining
•
Related work in topic discovery in text [Blei 2003][Griffiths 2004][Steyvers 2004][McCallum 2004]
–
Introduce community layer
–
groups of users (this work)
•
Proposed in our generative model (CUT1 and CUT2).
Advantages of graphical models
•
Independence relationships in multivariable models are
clarified
•
Computational inference
–
posterior probabilities can be calculated efficiently
–
tree structure: linear in number of variables
–
exist completely general algorithms for inference
•
Interpretation of results
•
Generative model

explanation of results
Introduction to probabilistic graphical models
•
Graphical models definitions
–
Use graphs to describe the probability dependency
among a set of variables.
–
Variables are represented by nodes.
–
Conditional dependency is represented by edges.
•
Types of Graphical models
–
Cyclic vs DAGs
–
Directed (Bayesian network) vs Undirected (Markov
Random Field)
–
Hidden vs observed variables
–
others
Burglary/Earthquake Alarm Model
Russell, Norvig
P(B,E,A,J,M)
•
Given a set of documents D, each consisting of a sequence
of words
w
d
of size N
d
, the generation of each word for a
specific document can be conditioned on an author or topic
and
w
is a specific observed word.
•
a
d
constitutes an author set with a particular author
x
that
generates a specific document
d
.
•
Prior (dirichet) distributions of topics and words are
parameterized by
a
and
b
.
•
Each topic is a
f
multinomial distribution over words with
z
a specific topic.
Variables
Related graphical models for document generation
•
Three generative models for authors, topics and words
Three latent variables in the generation of documents are considered: (1) Authors; (2) words;
and (3) topics.
Graphical model is a natural way to model dependency among variables, Conditional Random
Field (CRF) is a recently proposed feature

based graphical modeling method.
All generative models, trained by EM/Gibbs sampling
(1) Author

word model
[McCallum 2004]
(2) Topic

word (LDA) model
[Blei 2003] [Griffiths 2004]
(3) Author

topic model
[Steyvers 2004]
Bold: observed vector;
w:
observed word
For user, topic, community:
a, b,
are prior distribution (dirichlet) parameters
, f, j
are conditional distribution (multinomial) parameters
topic
Author

Word Model
–
Group of authors, a
d
, decide to write the document
d
–
For each word
w
in the document:
•
An author is chosen uniformly at random
•
A word is chosen from a probability distribution over words that is specific to that author
Topic

Word Model (LDA)
–
For each document
d
a distribution over topics are sampled from a Dirichlet
distribution
–
For each word
w
in the document:
•
A topic is chosen from that distribution
•
The word is sampled from a probability distribution over words specific to that topic
Author

Topic Model
–
A group of authors, a
d
, decide to write the document
d
–
For each word
w
in the document:
•
An author is chosen uniformly at random
•
A topic is chosen from a distribution over topics specific to that author
•
The word is sampled from a probability distribution over words specific to that topic
Model Details
Graphical models: our approach
•
For document generation due to user
communication, the community factor can be
considered.
•
How should such the community factor play a
role?
•
Our model is a generative model extension of the
Dirichlet allocation
–
related to the LDA, Author

word, Author

Topic

word
models.
•
Training uses a modified version of Gibbs
sampling
Different hidden layers based on communities
consisting of users and topics
CUT2: When community affects
topic generation only.
CUT1 architecture:
•
Community prior distribution effects distribution of authors
•
Authors are those who communicate based on community
•
Distribution of topic of communication is effected by author distribution
•
A specific word is generated conditioned on the topic distribution (and actually all the
previous instances of community and authors)
CUT2:
•
Community prior distribution effects distribution of topics directly
•
Distribution of authors of communication is effected by topic distribution
•
A specific word is generated conditioned on the author distribution (and actually all the
previous instances of community and topics)
Community
Author (user)
Topic
Word
Community
Topic
Author (user)
Word
CUT1: When community affects
user communication only
.
Related graphical models for document generation
•
Three generative models for authors, topics and words
Three latent variables in the generation of documents are considered: (1) Authors; (2) words;
and (3) topics.
Graphical model is a natural way to model dependency among variables, Conditional Random
Field (CRF) is a recently proposed feature

based graphical modeling method.
All generative models, trained by EM/Gibbs sampling
(1) Author

word model
[McCallum 2004]
(2) Topic

word (LDA) model
[Blei 2003] [Griffiths 2004]
(3) Author

topic model
[Steyvers 2004]
Bold: observed vector;
W:
observed word
For user, topic, community:
a, b,
are prior distribution (dirichlet) parameters
, f, j
are conditional distribution (multinomial) parameters
1.
Assuming the previous conditional dependency of variables.
What is the probability that a word under topic z,
is read by author u from community c?
2. The posterior probability:
Graphical models: our approach
•
What’s inferred from the models (e.g. CUT1)?
Issue:
Because of the denominator, computation is extensive for CUT
Graphical models for CUT1
•
For specific p community, q user, r topic, assuming all the
CUT1 conditionals (priors) are dirichlet, results in:
Computation becomes tractable:
Based on previous training of Graphical models using
Gibbs sampling approach [Griffiths 2004]
where
C, T, U
number of communities, topics, and users
with assignment counts:
Graphical models for CUT2
•
The training for CUT2 is the same with just a permutation
of order:
Note here only the matrices to store the assignment counts are changed.
CUT1 and CUT2 can be trained in the same algorithm framework.
EnF

Gibbs sampling algorithms
•
Given the models, CUT1 and CUT2, how do we
estimate the models, i.e. the probabilities on the
Bayesian network edges?
•
Gibbs

sampling
–
A well known algorithm to approximate the joint
distribution of multiple variables by sampling
sequentially.
–
Why not expectation/maximization (EM)
•
Too expensive because there is more than one latent variable
to estimate and EM tends to get a local optimal solution
instead of a global one.
•
Propose an entropy filtering Gibbs sampling
method (EnF

Gibbs)
Gibbs

sampling vs EnF

Gibbs sampling
•
Entropy is used to reduce the number of words used.
A. Gibbs sampling solution
B. EnF

Gibbs sampling solution
In B: Steps 9

13 do not initially consider entropy;
steps 15

19, use entropy to filter uninformative words.
Summary of EnF

Gibbs sampling algorithm
–
Similar algorithm framework to Gibbs
–
Modified by using entropy

filtering step in
estimation of joint probability
•
Words with large entropy are filtered out when
estimating the variable:
topic.
–
Advantages:
•
Higher precision in
topic
discovery
•
Enhanced efficiency in computation
–
the problem
space is reduced during the iterations
–
Experimental reduction of 4 to 5
Enron email dataset testbed
•
Enron
–
Dataset made public by the Federal Energy Regulatory
Commission during its investigations
–
Data manually notated by William Cohen
•
MySQL version from
Shetty, et.al. 2004
.
•
517,431 emails
–
151 users, over 3000 folders.
•
Full datasets for community discovery
•
Various size subsets for measuring training time
–
1%, 10%, 50%, 100%
–
C=6, T=20
–
Smoothing parameters:
a, b,
Sample probability distribution for a user
•
Probability a specific user is in different communities (C=6)
or topics (T=20) (empirically determined)
A communication or document can be generated over multiple topics.
Generated topics associated with communities
•
4 sample topics
our of T=20
•
Sorted by how
probable they are
•
Words sorted by
their generative
probability
•
Legend
Topic 3
Topic 5
Topic 12
Topic 14
rate
dynegy
budget
contract
cash
gas
plan
monitor
balance
transmission
chart
litigation
number
energy
deal
agreement
price
transco
project
trade
analysis
calpx
report
cpuc
database
power
group
pressure
deals
california
meeting
utility
letter
reliant
draft
materials
fax
electric
discussion
citizen
abbreviations
organizations
dynegy
An electricity, natural gas provider
transco
A gas transportation company
calpx
California Power Exchange Corp.
cpuc
California Public Utilities Commission
ferc
Federal Energy Regulatory Commission
epsa
Electric Power Supply Association
naruc
National Association of Regulatory Utility Commissioners
Discovered “semantic” community: CUT1
•
Sample community out of C=6 set empirically
•
Shown is Mike Grigsby sample topic association (not shown that each
user is associated with several topics)
Discovered semantic community: CUT2
•
Sample community out of 6 set empirically
•
Topic 8 associated with users (not shown that each topic is associated
with several users)
Probability distribution over topics for all users.
•
Joint distribution of users and topics
–
Note variability of users and topics
Topic interests for
users are not uniformly
distributed.
Compare CUT1 vs CUT2 for discovering communities
•
Randindex similarity measure for clusters
•
CUT1 and CUT2 compared to modularity method [
Newman 2002
]
•
Mod (modularity) is a graph partition method used in SNA
CUT1(authors first) is similar to Mod but CUT2(topic first) is less
Efficiency of training
•
Entropy filtering speeds up computation for all cases
(1) Training time v.s. Training size
(2) Training time per iteration
v.s. iteration number
Conclusions
•
Propose two graphical probabilistic models for semantic
social network generation and community discovery
–
First use of probabilistic graphic models for social networks?
•
Developed a modified Gibbs sampling based training
algorithm for large variable reduction
•
New semantic community discovery associated with topics
and authors for the Enron dataset.
•
Expandable to other dataset models
–
Author

topics discovery in CiteSeer
–
User

topic discovery in intelligence
Other graphical models: our approach
•
Future work: what if community affects simultaneously
user communication and topic?
If the community is joint
distribution over topics and
users, can consider the
following generative
Bayesian network.
The computation for
training using EM is much
more expensive.
EnF

Gibbs algorithm does
not directly apply.
New work in review
•
Social networks in the publications in
CiteSeer.
Are research topics derived from CiteSeer related? In author social
networks in CiteSeer, what author motivates topic transitions?
New work
•
Basic assumptions:
–
New topics come from old topics.
–
Such transitions are motivated by author social
networks.
–
Author impact on transitions vary and can be
ranked accordingly, e.g.:
Ranking on the transition of topics from T2 to T1.
T2

databases, T1

data mining in databases.
Future work
•
Study the hidden social network and its impact on
topic transitions
–
Cluster topics according to their associated hidden
social network.
–
Rank authors according to their impact on topic
transitions.
–
Discover topic transition dependencies, explaining the
emergence of new topics
Future work: open questions
–
How does this model explain the social
communities in other e

formats, such as Web
blogs?
–
Are we able to evaluate users’ contribution
and position in such a information society?
–
Can we mediate user interests to maintain a
stable user group (community)?
–
How will these models scale?
“The wealth of information has created a
poverty of attention.”
Herbert Simon
Semantic Social Networks
•
A semantic social network is a social network whose relationships
have meaning.
•
We consider the discovery of community index (without semantic
explanation of the communities) as not enough.
•
To enrich the community discovery with semantics (e.g. topic
structures), one cannot only look at the social network topology.
•
Applications:
–
Predicting future communication
–
Discovering latent topics of interests w.r.t. a group of people
–
Collaborative filtering
–
Ranking social actors according to their contribution of new information
(topics)
–
Content

based Anti

spam
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο