Intelligent Information Retrieval and Web Search - University of ...

boorishadamantΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 4 χρόνια και 10 μέρες)

61 εμφανίσεις

Probabilistic Models for Discovering
E
-
Communities

Ding Zhou
1
, Eren Manavoglu
1

Jia Li
3
, C. Lee Giles
1,2
, Hongyuan Zha
1,2

1
Computer Science and Engineering,
2
Information
Sciences and Technology,
3
Statistics

Pennsylvania State University

University Park, PA, USA 16802

Outline


Semantic discovery of communities


Graphical models for communities


En
tropy
F
iltering
-
Gibbs sampling algorithm


Experiments with Enron data base


Conclusions and new directions

Community discovery using probabilistic
graphs


Discovery of communities in the digital
society and the impact of user communication
(e
-
mail, instant messaging, publications, etc.)


Inferred relationships among people,
community and the information carriers (e.g.
email text)


Why people communicate with each other


social explanations



Semantic Social Networks


Semantic social network or community
:

a social
network of actors in which there is an expressed meaning
as to why users (actors) are associated


In contrast to observational social networks


Semantics from


Communication interests.


Topic interests.


Family interests


etc

Other definitions:


Friend blogging, Friendster, S Downes, 2004.


Applications


Collaborative filtering


Knowledge discovery


Social classification


Spam detection/filtering

High school romantic and sexual activity social network

Bearman, Moody, Stovel,
AJS
, 2004.

Motivation:
why study semantic communities


For example, what if no semantics?

Communication frequency
-

based community

Communication semantic

based community


The social network on the left is built based on communication
frequency; the right is built when considering communication
content.



Semantic topology is biased (on the left) when only
considering communication frequency, e.g. a spammer
can have high degrees, a group of spammers is able to
form a new clique in the graph.



High intensity of communication, from a content
(semantic) point of view, may misrepresent topics
discovery in a community.

Semantic Social Network Models


Social networks
-

actor graphs or linear algebras


Wasserman, Faust, 1999.


Semantic social networks


Tripartite graphs (heterogeneous)
-

Mika, 2004.


Clustering


Probabilistic graphs


Bayesian (directed)


Markov random field (nondirected)


Probabilistic Layered Framework Models for

Modeling Social Networks

(Possible confusion: two
different

graphs are discussed
-

social and probabilistic model)



Layered probabilistic graph with two types of layers


Observable layer


nodes are observable from data


Hidden layer


Nodes not observable but believed to exist


Casual in generative models



Observable layers in social networks (this work)


Users


social actors


Edges are communication relationships


Only the communication frequency used to build a graph of users


the semantics are ignored [Tyler
2003].



Derived hidden layer


Probabilistic framework for text mining


Related work in topic discovery in text [Blei 2003][Griffiths 2004][Steyvers 2004][McCallum 2004]


Introduce community layer


groups of users (this work)


Proposed in our generative model (CUT1 and CUT2).



Advantages of graphical models


Independence relationships in multivariable models are
clarified


Computational inference


posterior probabilities can be calculated efficiently


tree structure: linear in number of variables


exist completely general algorithms for inference


Interpretation of results


Generative model
-

explanation of results


Introduction to probabilistic graphical models


Graphical models definitions


Use graphs to describe the probability dependency
among a set of variables.


Variables are represented by nodes.


Conditional dependency is represented by edges.


Types of Graphical models


Cyclic vs DAGs


Directed (Bayesian network) vs Undirected (Markov
Random Field)


Hidden vs observed variables


others

Burglary/Earthquake Alarm Model

Russell, Norvig

P(B,E,A,J,M)


Given a set of documents D, each consisting of a sequence
of words
w
d

of size N
d
, the generation of each word for a
specific document can be conditioned on an author or topic
and
w

is a specific observed word.


a
d

constitutes an author set with a particular author
x

that
generates a specific document
d
.


Prior (dirichet) distributions of topics and words are
parameterized by
a

and
b
.


Each topic is a
f

multinomial distribution over words with
z

a specific topic.


Variables

Related graphical models for document generation


Three generative models for authors, topics and words



Three latent variables in the generation of documents are considered: (1) Authors; (2) words;
and (3) topics.



Graphical model is a natural way to model dependency among variables, Conditional Random
Field (CRF) is a recently proposed feature
-
based graphical modeling method.



All generative models, trained by EM/Gibbs sampling

(1) Author
-
word model

[McCallum 2004]

(2) Topic
-
word (LDA) model

[Blei 2003] [Griffiths 2004]

(3) Author
-
topic model

[Steyvers 2004]

Bold: observed vector;

w:

observed word

For user, topic, community:

a, b, 

are prior distribution (dirichlet) parameters

, f, j

are conditional distribution (multinomial) parameters

topic

Author
-
Word Model


Group of authors, a
d
, decide to write the document
d


For each word
w

in the document:


An author is chosen uniformly at random


A word is chosen from a probability distribution over words that is specific to that author

Topic
-
Word Model (LDA)


For each document
d

a distribution over topics are sampled from a Dirichlet
distribution


For each word
w

in the document:


A topic is chosen from that distribution


The word is sampled from a probability distribution over words specific to that topic

Author
-
Topic Model


A group of authors, a
d
, decide to write the document
d


For each word
w

in the document:


An author is chosen uniformly at random


A topic is chosen from a distribution over topics specific to that author


The word is sampled from a probability distribution over words specific to that topic


Model Details

Graphical models: our approach


For document generation due to user
communication, the community factor can be
considered.


How should such the community factor play a
role?


Our model is a generative model extension of the
Dirichlet allocation


related to the LDA, Author
-
word, Author
-
Topic
-
word
models.


Training uses a modified version of Gibbs
sampling


Different hidden layers based on communities
consisting of users and topics

CUT2: When community affects

topic generation only.

CUT1 architecture:


Community prior distribution effects distribution of authors



Authors are those who communicate based on community



Distribution of topic of communication is effected by author distribution



A specific word is generated conditioned on the topic distribution (and actually all the
previous instances of community and authors)

CUT2:



Community prior distribution effects distribution of topics directly



Distribution of authors of communication is effected by topic distribution



A specific word is generated conditioned on the author distribution (and actually all the
previous instances of community and topics)

Community


Author (user)


Topic


Word

Community


Topic


Author (user)


Word

CUT1: When community affects

user communication only
.

Related graphical models for document generation


Three generative models for authors, topics and words



Three latent variables in the generation of documents are considered: (1) Authors; (2) words;
and (3) topics.



Graphical model is a natural way to model dependency among variables, Conditional Random
Field (CRF) is a recently proposed feature
-
based graphical modeling method.



All generative models, trained by EM/Gibbs sampling

(1) Author
-
word model

[McCallum 2004]

(2) Topic
-
word (LDA) model

[Blei 2003] [Griffiths 2004]

(3) Author
-
topic model

[Steyvers 2004]

Bold: observed vector;

W:

observed word

For user, topic, community:

a, b, 

are prior distribution (dirichlet) parameters

, f, j

are conditional distribution (multinomial) parameters

1.
Assuming the previous conditional dependency of variables.
What is the probability that a word under topic z,


is read by author u from community c?







2. The posterior probability:




Graphical models: our approach


What’s inferred from the models (e.g. CUT1)?

Issue:

Because of the denominator, computation is extensive for CUT

Graphical models for CUT1


For specific p community, q user, r topic, assuming all the
CUT1 conditionals (priors) are dirichlet, results in:


Computation becomes tractable:

Based on previous training of Graphical models using

Gibbs sampling approach [Griffiths 2004]

where
C, T, U

number of communities, topics, and users

with assignment counts:

Graphical models for CUT2


The training for CUT2 is the same with just a permutation
of order:


Note here only the matrices to store the assignment counts are changed.
CUT1 and CUT2 can be trained in the same algorithm framework.

EnF
-
Gibbs sampling algorithms


Given the models, CUT1 and CUT2, how do we
estimate the models, i.e. the probabilities on the
Bayesian network edges?


Gibbs
-
sampling


A well known algorithm to approximate the joint
distribution of multiple variables by sampling
sequentially.


Why not expectation/maximization (EM)


Too expensive because there is more than one latent variable
to estimate and EM tends to get a local optimal solution
instead of a global one.


Propose an entropy filtering Gibbs sampling
method (EnF
-
Gibbs)



Gibbs
-
sampling vs EnF
-
Gibbs sampling


Entropy is used to reduce the number of words used.

A. Gibbs sampling solution

B. EnF
-
Gibbs sampling solution

In B: Steps 9
-
13 do not initially consider entropy;

steps 15
-
19, use entropy to filter uninformative words.

Summary of EnF
-
Gibbs sampling algorithm


Similar algorithm framework to Gibbs


Modified by using entropy
-
filtering step in
estimation of joint probability


Words with large entropy are filtered out when
estimating the variable:
topic.


Advantages:


Higher precision in
topic
discovery


Enhanced efficiency in computation


the problem
space is reduced during the iterations


Experimental reduction of 4 to 5

Enron email dataset testbed


Enron


Dataset made public by the Federal Energy Regulatory
Commission during its investigations


Data manually notated by William Cohen


MySQL version from
Shetty, et.al. 2004
.


517,431 emails


151 users, over 3000 folders.


Full datasets for community discovery


Various size subsets for measuring training time


1%, 10%, 50%, 100%


C=6, T=20


Smoothing parameters:
a, b, 


Sample probability distribution for a user


Probability a specific user is in different communities (C=6)
or topics (T=20) (empirically determined)

A communication or document can be generated over multiple topics.

Generated topics associated with communities


4 sample topics
our of T=20


Sorted by how
probable they are


Words sorted by
their generative
probability





Legend


Topic 3

Topic 5

Topic 12

Topic 14

rate

dynegy

budget

contract

cash

gas

plan

monitor

balance

transmission

chart

litigation

number

energy

deal

agreement

price

transco

project

trade

analysis

calpx

report

cpuc

database

power

group

pressure

deals

california

meeting

utility

letter

reliant

draft

materials

fax

electric

discussion

citizen

abbreviations

organizations

dynegy

An electricity, natural gas provider

transco

A gas transportation company

calpx

California Power Exchange Corp.

cpuc

California Public Utilities Commission

ferc

Federal Energy Regulatory Commission

epsa

Electric Power Supply Association

naruc

National Association of Regulatory Utility Commissioners

Discovered “semantic” community: CUT1


Sample community out of C=6 set empirically


Shown is Mike Grigsby sample topic association (not shown that each
user is associated with several topics)


Discovered semantic community: CUT2


Sample community out of 6 set empirically


Topic 8 associated with users (not shown that each topic is associated
with several users)


Probability distribution over topics for all users.


Joint distribution of users and topics


Note variability of users and topics

Topic interests for
users are not uniformly
distributed.

Compare CUT1 vs CUT2 for discovering communities


Randindex similarity measure for clusters


CUT1 and CUT2 compared to modularity method [
Newman 2002
]


Mod (modularity) is a graph partition method used in SNA

CUT1(authors first) is similar to Mod but CUT2(topic first) is less

Efficiency of training


Entropy filtering speeds up computation for all cases

(1) Training time v.s. Training size

(2) Training time per iteration

v.s. iteration number

Conclusions


Propose two graphical probabilistic models for semantic
social network generation and community discovery


First use of probabilistic graphic models for social networks?


Developed a modified Gibbs sampling based training
algorithm for large variable reduction


New semantic community discovery associated with topics
and authors for the Enron dataset.


Expandable to other dataset models


Author
-
topics discovery in CiteSeer


User
-
topic discovery in intelligence


Other graphical models: our approach


Future work: what if community affects simultaneously
user communication and topic?

If the community is joint
distribution over topics and
users, can consider the
following generative

Bayesian network.


The computation for
training using EM is much
more expensive.


EnF
-
Gibbs algorithm does
not directly apply.


New work in review


Social networks in the publications in
CiteSeer.


Are research topics derived from CiteSeer related? In author social
networks in CiteSeer, what author motivates topic transitions?

New work


Basic assumptions:


New topics come from old topics.


Such transitions are motivated by author social
networks.


Author impact on transitions vary and can be
ranked accordingly, e.g.:


Ranking on the transition of topics from T2 to T1.

T2
-

databases, T1
-

data mining in databases.

Future work


Study the hidden social network and its impact on
topic transitions


Cluster topics according to their associated hidden
social network.


Rank authors according to their impact on topic
transitions.


Discover topic transition dependencies, explaining the
emergence of new topics

Future work: open questions


How does this model explain the social
communities in other e
-
formats, such as Web
blogs?


Are we able to evaluate users’ contribution
and position in such a information society?


Can we mediate user interests to maintain a
stable user group (community)?


How will these models scale?


“The wealth of information has created a
poverty of attention.”

Herbert Simon

Semantic Social Networks


A semantic social network is a social network whose relationships
have meaning.



We consider the discovery of community index (without semantic
explanation of the communities) as not enough.



To enrich the community discovery with semantics (e.g. topic
structures), one cannot only look at the social network topology.



Applications:


Predicting future communication


Discovering latent topics of interests w.r.t. a group of people


Collaborative filtering


Ranking social actors according to their contribution of new information
(topics)


Content
-
based Anti
-
spam