What is the right clustering of this graph?

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

68 εμφανίσεις

What is the right
clustering of this graph?

Clique Percolation

A community is a collection of adjacent
𝑘
-
cliques.


Questions:

What is a good k?

How to find cliques?

Clique Finding

Find the largest clique in a graph?


NP
-
complete


Find maximal clique containing a node?


Polynomial

Percolation Algorithm


Find all
maximal
cliques


Create clique
-
clique overlap
matrix


Ignore entries
less than k


Running Time


Maximal clique finding is output
-
polynomial


Extensively studied




we note that a complete analysis
of a
co
-
authorship network with 127000 links takes
less than 2 hours on a PC
.”

A more Theory approach

Important features of a community


Internally dense


Externally sparse


Clique
-
percolation ignores externally sparse

Modularity defines it as the edge cut

(

,

)

clustering

A cluster C is an
(

,

)

cluster if


Internally Dense:

Every vertex in the cluster neighbors at least a
β

fraction of the cluster


Externally Sparse:

Every vertex outside the cluster neighbors at
most an
α

fraction of the cluster


(1/5,4/5)

(1/5,4/5)

First approach
-

𝜌
-
Champions

Wes
Anderson







9
7
,
3
1
Ben Stiller

Owen
Wilson

Bill
Murray

Gwenyth
Paltrow

Will
Ferrell

Vince
Vaughn

Anjelica
Houston

Steve
Martin

Dan
Akroyd

Scarlett
Johanssen

Jack
Black

Ellijah
Wood

Algorithm with
𝜌
-
Champions


Let c be a
ρ
-
champion


If
v
in
C
, then
v

and
c

share at least
(
2



1
)
|
𝐶
|

neighbors


If
v

is outside
C

then
v

and
c

share at most

(
𝜌

+


)
|
𝐶
|

neighbors



β
|C|

β
|C|

ρ
|C|

α
|C|

(2
β
-
1)|C|

c

v

v

Runs in
O
(
d
0.7
n
1.9
+n
2+o(1)
) time where
d

is the average degree


Discussion


Pros


Very parallel


Experiments show good results


Are a good feature in recommendation
algs


Cons


Beta > ½ doesn’t seem realistic


The
𝜌

champion is fairly restrictive


Not based on observed data

Finding Overlapping Communities

Assumptions

1)
Community edges are chosen according to
the expected affinity (degree) model.

2)
Maximality

assumption with gap
𝜖

3)
Community membership accounts for a
significant portion of each node’s edges,


Another Algorithm Style


Grow a community from a set of seed nodes.


Clique finding:

Pick s starting nodes at random

For each starting node v, sample
𝑆

Γ
𝑣

For each clique in S, grow to maximal clique.

Output if it satisfies your conditions.

Ego Networks

You are the ego. Your
friends form the ego
network.

Sociology on Ego Networks

Functions Served by Ego Networks


Social
support


Sense
-
making


Social control


Access to
resources


Behavioral models


Dunbar Circles

Dunbar number
-

“the theoretical
cognitive limit to the number of
people with whom one can maintain
stable social relationships
.” between
100 and 250

Community Detection with
Egonets

Idea 1


When you
remove the ego, the
egonet

becomes
disconnected
components.


Idea 2


It becomes
weakly connected
components.

family

eecs

Grade school

college

microsoft

uva

radio

TCS

Egonet

based Systems
-

DEMON

DEMON


Apply a community detection algorithm (Label
Propagation) to the
Egonet


Repeat this for every user in the network.

Community Definition:

The set of communities
is the set of maximal
sets that ‘contain’ the
egonet

communities.

Demon


Merge the results.


Output:

Set of overlapping communities

Running time:
𝑂
(
𝑛
𝐾
3

𝛼
)

‘Real’ Community

Random Walk

Metis

Infomap

Newman
-
Modularity

Louvain

21

Cornell Study

Slides due to Bruno
Abrahao

Community Detection


Community structure is not well defined


different people have different notions of community structure



Traditional strategy


(1) start with an expectation of what a community should look like


e.g., a set of nodes that interact more within the set than with the outside


(2) define an optimization problem


(3) design heuristic


(4) the solution gives the desired communities

22

Key questions


A multitude of algorithms


different objective functions


different heuristics



How dissimilar are their outputs?



Communities may differ from the
proposed mathematical constructs


e.g., preponderance of links to the outside



Which algorithms extract communities that most
closely resemble the structure of real
communities?

23

Obstacles to answering the questions


We don't know what properties communities possess



We can't characterize communities in the absence of negative
examples


Look at real communities and determine their structure


do other sets that are not communities have these properties?


every other connected set could be a negative example
-

intractable


sets that are not annotated could also be communities



We don't know what metrics we should use


modularity, conductance, clustering coefficient...

24

Building structural classes

Algorithm

Network

Extract community

examples

Apply

25

Building structural classes

Algorithm 2

Algorithm 4

Algorithm k

Algorithm 1

Algorithm 3

Class 1

Class 2

Class 3

Class 4

Class k

26

Building a feature space

Labeled Example

Feature Vector

27

Building a feature space

Feature Space

28

Inter
-
class
separability

Feature Space

Separability = Distinct structures

Class Separability

Measure

Are the classes separable?

29

Large
-
scale network datasets


Social




Commercial




Biological

+ Rice University

Facebook+Rice with permission of Mislove et al.. Other datasets publicly available.

30

Community detection algorithms


BFS (Random connected subgraphs)


Random
-
Walk
-
based (with and without
restart)


(α,β)
-
communities


InfoMap


Markov Clustering


Metis


Louvain


Newman
-
Clauset
-
Moore


Link Communities

31

Annotated communities

+ Rice University

Metadata included in the datasets identifies exemplar communities that form
in these domains

32

To what extent are the classes separable?

Probabilistic k
-
way
classifier

(SVM, k
-
NN)

Algorithm 1

Algorithm 2

Annotated
communities

Train

33

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

(SVM, k
-
NN)

Classify

(cross
-
validation)

Pr(
Algorithm 1
) = 0.05

Pr(
Algorithm 2
) = 0.08

...

Pr(
Annotated
) = 0.48

34

Cross
-
validation performance

35

Matching annotated communities






Which algorithms extract communities that most
closely resemble the structure of annotated
communities?

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

Algorithm 1

Algorithm 2

Algorithm k

Learn

37

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

Classify


Pr(
Algorithm 1
) = 0.02

Pr(
Algorithm 2
) = 0.19

...

Pr(
Algorithm k
) = 0.12

38

Classification of annotated into extracted

39

Step 1: identifying the most important features

7 features out of 36 retain the discriminative power of the full set

40

Tendencies of algorithms

with respect to most discriminative features

41

Summary


Traditional methods are
unsupervised


they find a particular type of community


little sensitivity to different purposes, structures of interest and domains of
application




Our approach suggests a
supervised

approach to
community detection


user specifies what they intended to find through examples (real or
synthetic)


algorithm learns from those examples and retrieves similar structures in
the network

42

Experimental Assignment


Goal: Do some data mining research,
comparing real networks and the models in
class



Due: Email a report by Friday, October 12.