# What is the right clustering of this graph?

Τεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

212 εμφανίσεις

What is the right
clustering of this graph?

Clique Percolation

A community is a collection of adjacent
𝑘
-
cliques.

Questions:

What is a good k?

How to find cliques?

Clique Finding

Find the largest clique in a graph?

NP
-
complete

Find maximal clique containing a node?

Polynomial

Percolation Algorithm

Find all
maximal
cliques

Create clique
-
clique overlap
matrix

Ignore entries
less than k

Running Time

Maximal clique finding is output
-
polynomial

Extensively studied

we note that a complete analysis
of a
co
-
authorship network with 127000 links takes
less than 2 hours on a PC
.”

A more Theory approach

Important features of a community

Internally dense

Externally sparse

Clique
-
percolation ignores externally sparse

Modularity defines it as the edge cut

(

,

)

clustering

A cluster C is an
(

,

)

cluster if

Internally Dense:

Every vertex in the cluster neighbors at least a
β

fraction of the cluster

Externally Sparse:

Every vertex outside the cluster neighbors at
most an
α

fraction of the cluster

(1/5,4/5)

(1/5,4/5)

First approach
-

𝜌
-
Champions

Wes
Anderson

9
7
,
3
1
Ben Stiller

Owen
Wilson

Bill
Murray

Gwenyth
Paltrow

Will
Ferrell

Vince
Vaughn

Anjelica
Houston

Steve
Martin

Dan
Akroyd

Scarlett
Johanssen

Jack
Black

Ellijah
Wood

Algorithm with
𝜌
-
Champions

Let c be a
ρ
-
champion

If
v
in
C
, then
v

and
c

share at least
(
2


1
)
|
𝐶
|

neighbors

If
v

is outside
C

then
v

and
c

share at most

(
𝜌

+

)
|
𝐶
|

neighbors

β
|C|

β
|C|

ρ
|C|

α
|C|

(2
β
-
1)|C|

c

v

v

Runs in
O
(
d
0.7
n
1.9
+n
2+o(1)
) time where
d

is the average degree

Discussion

Pros

Very parallel

Experiments show good results

Are a good feature in recommendation
algs

Cons

Beta > ½ doesn’t seem realistic

The
𝜌

champion is fairly restrictive

Not based on observed data

Finding Overlapping Communities

Assumptions

1)
Community edges are chosen according to
the expected affinity (degree) model.

2)
Maximality

assumption with gap
𝜖

3)
Community membership accounts for a
significant portion of each node’s edges,


Another Algorithm Style

Grow a community from a set of seed nodes.

Clique finding:

Pick s starting nodes at random

For each starting node v, sample
𝑆

Γ
𝑣

For each clique in S, grow to maximal clique.

Output if it satisfies your conditions.

Ego Networks

You are the ego. Your
friends form the ego
network.

Sociology on Ego Networks

Functions Served by Ego Networks

Social
support

Sense
-
making

Social control

resources

Behavioral models

Dunbar Circles

Dunbar number
-

“the theoretical
cognitive limit to the number of
people with whom one can maintain
stable social relationships
.” between
100 and 250

Community Detection with
Egonets

Idea 1

When you
remove the ego, the
egonet

becomes
disconnected
components.

Idea 2

It becomes
weakly connected
components.

family

eecs

college

microsoft

uva

TCS

Egonet

based Systems
-

DEMON

DEMON

Apply a community detection algorithm (Label
Propagation) to the
Egonet

Repeat this for every user in the network.

Community Definition:

The set of communities
is the set of maximal
sets that ‘contain’ the
egonet

communities.

Demon

Merge the results.

Output:

Set of overlapping communities

Running time:
𝑂
(
𝑛
𝐾
3

𝛼
)

‘Real’ Community

Random Walk

Metis

Infomap

Newman
-
Modularity

Louvain

21

Cornell Study

Slides due to Bruno
Abrahao

Community Detection

Community structure is not well defined

different people have different notions of community structure

(1) start with an expectation of what a community should look like

e.g., a set of nodes that interact more within the set than with the outside

(2) define an optimization problem

(3) design heuristic

(4) the solution gives the desired communities

22

Key questions

A multitude of algorithms

different objective functions

different heuristics

How dissimilar are their outputs?

Communities may differ from the
proposed mathematical constructs

e.g., preponderance of links to the outside

Which algorithms extract communities that most
closely resemble the structure of real
communities?

23

Obstacles to answering the questions

We don't know what properties communities possess

We can't characterize communities in the absence of negative
examples

Look at real communities and determine their structure

do other sets that are not communities have these properties?

every other connected set could be a negative example
-

intractable

sets that are not annotated could also be communities

We don't know what metrics we should use

modularity, conductance, clustering coefficient...

24

Building structural classes

Algorithm

Network

Extract community

examples

Apply

25

Building structural classes

Algorithm 2

Algorithm 4

Algorithm k

Algorithm 1

Algorithm 3

Class 1

Class 2

Class 3

Class 4

Class k

26

Building a feature space

Labeled Example

Feature Vector

27

Building a feature space

Feature Space

28

Inter
-
class
separability

Feature Space

Separability = Distinct structures

Class Separability

Measure

Are the classes separable?

29

Large
-
scale network datasets

Social

Commercial

Biological

+ Rice University

Facebook+Rice with permission of Mislove et al.. Other datasets publicly available.

30

Community detection algorithms

BFS (Random connected subgraphs)

Random
-
Walk
-
based (with and without
restart)

(α,β)
-
communities

InfoMap

Markov Clustering

Metis

Louvain

Newman
-
Clauset
-
Moore

31

Annotated communities

+ Rice University

Metadata included in the datasets identifies exemplar communities that form
in these domains

32

To what extent are the classes separable?

Probabilistic k
-
way
classifier

(SVM, k
-
NN)

Algorithm 1

Algorithm 2

Annotated
communities

Train

33

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

(SVM, k
-
NN)

Classify

(cross
-
validation)

Pr(
Algorithm 1
) = 0.05

Pr(
Algorithm 2
) = 0.08

...

Pr(
Annotated
) = 0.48

34

Cross
-
validation performance

35

Matching annotated communities

Which algorithms extract communities that most
closely resemble the structure of annotated
communities?

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

Algorithm 1

Algorithm 2

Algorithm k

Learn

37

Probabilistic multi
-
class learners

Probabilistic k
-
way
classifier

Classify

Pr(
Algorithm 1
) = 0.02

Pr(
Algorithm 2
) = 0.19

...

Pr(
Algorithm k
) = 0.12

38

Classification of annotated into extracted

39

Step 1: identifying the most important features

7 features out of 36 retain the discriminative power of the full set

40

Tendencies of algorithms

with respect to most discriminative features

41

Summary

unsupervised

they find a particular type of community

little sensitivity to different purposes, structures of interest and domains of
application

Our approach suggests a
supervised

approach to
community detection

user specifies what they intended to find through examples (real or
synthetic)

algorithm learns from those examples and retrieves similar structures in
the network

42

Experimental Assignment

Goal: Do some data mining research,
comparing real networks and the models in
class

Due: Email a report by Friday, October 12.