What is the right clustering of this graph?

plantationscarfAI and Robotics

Nov 25, 2013 (3 years and 9 months ago)

92 views

What is the right
clustering of this graph?

Clique Percolation

A community is a collection of adjacent
π‘˜
-
cliques.


Questions:

What is a good k?

How to find cliques?

Clique Finding

Find the largest clique in a graph?


NP
-
complete


Find maximal clique containing a node?


Polynomial

Percolation Algorithm

β€’
Find all
maximal
cliques

β€’
Create clique
-
clique overlap
matrix

β€’
Ignore entries
less than k


Running Time

β€’
Maximal clique finding is output
-
polynomial

–
Extensively studied


β€’
β€œ
we note that a complete analysis
of a
co
-
authorship network with 127000 links takes
less than 2 hours on a PC
.”

A more Theory approach

Important features of a community


Internally dense


Externally sparse


Clique
-
percolation ignores externally sparse

Modularity defines it as the edge cut

(

,

)

clustering

A cluster C is an
(

,

)

cluster if

–
Internally Dense:

Every vertex in the cluster neighbors at least a
Ξ²

fraction of the cluster

–
Externally Sparse:

Every vertex outside the cluster neighbors at
most an
Ξ±

fraction of the cluster


(1/5,4/5)

(1/5,4/5)

First approach
-

𝜌
-
Champions

Wes
Anderson

οƒ·
οƒΈ
οƒΆ



9
7
,
3
1
Ben Stiller

Owen
Wilson

Bill
Murray

Gwenyth
Paltrow

Will
Ferrell

Vince
Vaughn

Anjelica
Houston

Steve
Martin

Dan
Akroyd

Scarlett
Johanssen

Jack
Black

Ellijah
Wood

Algorithm with
𝜌
-
Champions

β€’
Let c be a
ρ
-
champion

β€’
If
v
in
C
, then
v

and
c

share at least
(
2


βˆ’
1
)
|
𝐢
|

neighbors

β€’
If
v

is outside
C

then
v

and
c

share at most

(
𝜌

+


)
|
𝐢
|

neighbors



Ξ²
|C|

Ξ²
|C|

ρ
|C|

Ξ±
|C|

(2
Ξ²
-
1)|C|

c

v

v

Runs in
O
(
d
0.7
n
1.9
+n
2+o(1)
) time where
d

is the average degree


Discussion

β€’
Pros

–
Very parallel

–
Experiments show good results

–
Are a good feature in recommendation
algs

β€’
Cons

–
Beta > Β½ doesn’t seem realistic

–
The
𝜌

champion is fairly restrictive

–
Not based on observed data

Finding Overlapping Communities

Assumptions

1)
Community edges are chosen according to
the expected affinity (degree) model.

2)
Maximality

assumption with gap
πœ–

3)
Community membership accounts for a
significant portion of each node’s edges,


Another Algorithm Style

β€’
Grow a community from a set of seed nodes.

β€’
Clique finding:

Pick s starting nodes at random

For each starting node v, sample
𝑆
βŠ‚
Ξ“
𝑣

For each clique in S, grow to maximal clique.

Output if it satisfies your conditions.

Ego Networks

You are the ego. Your
friends form the ego
network.

Sociology on Ego Networks

Functions Served by Ego Networks

β€’
Social
support

β€’
Sense
-
making

β€’
Social control

β€’
Access to
resources

β€’
Behavioral models


Dunbar Circles

Dunbar number
-

β€œthe theoretical
cognitive limit to the number of
people with whom one can maintain
stable social relationships
.” between
100 and 250

Community Detection with
Egonets

Idea 1
–

When you
remove the ego, the
egonet

becomes
disconnected
components.


Idea 2
–

It becomes
weakly connected
components.

family

eecs

Grade school

college

microsoft

uva

radio

TCS

Egonet

based Systems
-

DEMON

DEMON

β€’
Apply a community detection algorithm (Label
Propagation) to the
Egonet

β€’
Repeat this for every user in the network.

Community Definition:

The set of communities
is the set of maximal
sets that β€˜contain’ the
egonet

communities.

Demon

β€’
Merge the results.


Output:

Set of overlapping communities

Running time:
𝑂
(
𝑛
𝐾
3
βˆ’
𝛼
)

β€˜Real’ Community

Random Walk

Metis

Infomap

Newman
-
Modularity

Louvain

21

Cornell Study

Slides due to Bruno
Abrahao

Community Detection

β€’
Community structure is not well defined

–
different people have different notions of community structure


β€’
Traditional strategy

–
(1) start with an expectation of what a community should look like

β€’
e.g., a set of nodes that interact more within the set than with the outside

–
(2) define an optimization problem

–
(3) design heuristic

–
(4) the solution gives the desired communities

22

Key questions

β€’
A multitude of algorithms

–
different objective functions

–
different heuristics


β€’
How dissimilar are their outputs?


β€’
Communities may differ from the
proposed mathematical constructs

–
e.g., preponderance of links to the outside


β€’
Which algorithms extract communities that most
closely resemble the structure of real
communities?

23

Obstacles to answering the questions

β€’
We don't know what properties communities possess


β€’
We can't characterize communities in the absence of negative
examples

–
Look at real communities and determine their structure

–
do other sets that are not communities have these properties?

–
every other connected set could be a negative example
-

intractable

–
sets that are not annotated could also be communities


β€’
We don't know what metrics we should use

–
modularity, conductance, clustering coefficient...

24

Building structural classes

Algorithm

Network

Extract community

examples

Apply

25

Building structural classes

Algorithm 2

Algorithm 4

Algorithm k

Algorithm 1

Algorithm 3

Class 1

Class 2

Class 3

Class 4

Class k

26

Building a feature space

Labeled Example

Feature Vector

27

Building a feature space

Feature Space

28

Inter
-
class
separability

Feature Space

Separability = Distinct structures

Class Separability

Measure

Are the classes separable?

29

Large
-
scale network datasets

β€’
Social



β€’
Commercial



β€’
Biological

+ Rice University

Facebook+Rice with permission of Mislove et al.. Other datasets publicly available.

30

Community detection algorithms

β€’
BFS (Random connected subgraphs)

β€’
Random
-
Walk
-
based (with and without
restart)

β€’
(Ξ±,Ξ²)
-
communities

β€’
InfoMap

β€’
Markov Clustering

β€’
Metis

β€’
Louvain

β€’
Newman
-
Clauset
-
Moore

β€’
Link Communities

31

Annotated communities

+ Rice University

Metadata included in the datasets identifies exemplar communities that form
in these domains

32

To what extent are the classes separable?

Probabilistic k
-
way
classifier

(SVM, k
-
NN)

Algorithm 1

Algorithm 2

Annotated
communities

Train

33

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

(SVM, k
-
NN)

Classify

(cross
-
validation)

Pr(
Algorithm 1
) = 0.05

Pr(
Algorithm 2
) = 0.08

...

Pr(
Annotated
) = 0.48

34

Cross
-
validation performance

35

Matching annotated communities





β€’
Which algorithms extract communities that most
closely resemble the structure of annotated
communities?

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

Algorithm 1

Algorithm 2

Algorithm k

Learn

37

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

Classify


Pr(
Algorithm 1
) = 0.02

Pr(
Algorithm 2
) = 0.19

...

Pr(
Algorithm k
) = 0.12

38

Classification of annotated into extracted

39

Step 1: identifying the most important features

7 features out of 36 retain the discriminative power of the full set

40

Tendencies of algorithms

with respect to most discriminative features

41

Summary

β€’
Traditional methods are
unsupervised

–
they find a particular type of community

–
little sensitivity to different purposes, structures of interest and domains of
application



β€’
Our approach suggests a
supervised

approach to
community detection

–
user specifies what they intended to find through examples (real or
synthetic)

–
algorithm learns from those examples and retrieves similar structures in
the network

42

Experimental Assignment

β€’
Goal: Do some data mining research,
comparing real networks and the models in
class


β€’
Due: Email a report by Friday, October 12.