Kronecker Graphs - CS 294: Social and Information Networks ...

thumpinsplishInternet και Εφαρμογές Web

18 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

48 εμφανίσεις

Kronecker

Graphs

The
Kronecker

Graph Model (
rmat
)


Start with a parameter matrix A


=






For n vertices, take
log


Kronecker

products





Normalize the entries

Generating Edges


One Method


Calculate the whole
Kronecker

matrix


Sample each edge independently according to
entry



Another Method


Treat parameters as probabilities


Flip
log


coins for each edge

Features


Pro


Fast to generate: parallel and distributed


Few parameters to fit


Self
-
similarity


Con


Doesn’t have a
powerlaw

distribution


Parameters aren’t intuitive


May not be connected


Used in Graph500 benchmark

[
Seshadri
,
Kolda
, Pinar]

Variance of

Real Graphs

[Moreno,
Kirschner
,
Neville,
Vishwanathan
]

Web Search and Ranking

Web Search

Information Retrieval:

Given a query
Hugh Laurie
, find all documents
that mention those words


Web Ranking Before 1998


Use
tf
-
idf

(roughly)


Term frequency


inverse document frequency

𝑓

,

=

# of occurrences of


in


𝑖𝑓

,

=
# of occurrences of


in

, the corpus

Results


It was bad



The best results for a topic may not mention
the topic explicitly a lot

What are we missing?


Traditional IR only has the text to work with


We have an information network



The hyperlinks are created by intelligent,
rational beings!

1998


HITS (J. Kleinberg)


What if we ranked documents by in
-
links?

The power law distribution on in
-
degree will get us every time.

HITS


Idea: Different pages and different links play
different roles


Some pages are AUTHORITIES


Some pages are HUBS

Hubs


What is a good hub?


A page is a good hub if it points to many
authorities.

Authorities


What is a good authority?


A page is a good authority if many hub pages
point to it.

How can we find good hubs
and good authorities?

HITS


Everyone starts with a hub
-
score of 1 and
authority
-
score of 1



A
-
update: For each page
p
,
auth
(p) is the sum
of the hub
-
scores of pages that
point to
p
.


H
-
update: For each page
p
, hub(p) is the sum
of the
auth
-
scores of pages
p

points

to
.

Formally


M is the adjacency matrix, h the hub
-
scores
and a the
auth
-
scores



(
𝑖
)
=
𝑀

(
𝑖

1
)


(
𝑖
)
=
𝑀
𝑇

(
𝑖
)

Calculated on the
subgraph

that corresponds to the query at hand

How many iterations should
we do?

Where does HITS fail?


Assumes a
bipartite clique structure
to the
web


Doesn’t allow more general forms of
endorsement

PageRank


try 1


Instead of
h

and
a

scores, just one score.


PR
-
update(p) = sum of normalized PR score of
each page that points to
p

Where does this fail?

Hint: The web
graph is directed.

Actual PageRank


Make the graph strongly connected by adding
epsilon weight links between all pages.


Let A be the normalized adjacency matrix

𝑃
=
1

𝜖


𝑃
+
𝜖
𝟏

Calculating with the Power Method


Start with
𝑃
(
1
)
=
𝟏
/
𝒏


Calculate
𝑃

(
2
)
=


𝑃
(
1
)


Add
𝜖

to every entry


Normalize and repeat

Repeat this
1
/
𝜖

times

The Random Surfer Model


What natural process can justify PageRank?


How can we model how people might use the
web?

The Random Surfer


Starts at some page on the web


With probability (1
-
𝛼
), selects a random link
on the page and follows it


With probability
𝛼
, gets bored and jumps to
some new random web page.


Pr
 𝑖𝑖

𝑓 






=
1

𝛼


,




,


+
𝛼

The Random Surfer


The PageRank vector is the probability that
you will visit each website in this process

Random Walks on Graphs

1

1/3

1/2

1/3

Stationary Distributions


What does this process converge to?





Connection between eigenvectors and
stationary distributions. Why is the top
eigenvalue always 1?


=


Mixing Time


How long does it take to converge?





Why does PageRank converge in
O
(
1
𝜖
)

time?

min

max

{




𝜋
<
1
4
}





𝜋

1

Φ
2


Undirected Graphs


The stationary distribution is proportional to
the degree



,

𝜋

=


,

𝜋
(

)

Spectral Analysis for HITS


(
1
)
=
𝑀
𝑇

(
0
)


(
1
)
=
𝑀

(
1
)
=
𝑀
𝑀
𝑇

(
0
)


(
2
)
=
𝑀
𝑇

(
1
)
=
𝑀
𝑇
𝑀
𝑀
𝑇

(
0
)


(
2
)
=
𝑀

(
2
)
=
𝑀
𝑀
𝑇
𝑀
𝑀
𝑇

(
0
)




(
𝑘
)
=
𝑀
𝑇
𝑀
𝑘

1
𝑀
𝑇

(
0
)


(
𝑘
)
=
𝑀𝑀
𝑇
𝑘

(
0
)

APPLICATIONS AND EXTENSIONS

Personalized PageRank


𝑃
=
1

𝜖


𝑃
+
𝜖
𝟏


What if the surfer didn’t jump randomly?





s
can be any distribution over the pages



𝑃
(
𝜖
,

)
=
1

𝜖


𝑃
(
𝜖
,

)
+
𝜖

Uses of Personalized PageRank


Creating personalized search results


Topic
-
sensitive PageRank


Local community detection



Can you compute it more efficiently than
PageRank?

The Intentional Surfer


Click data is collected by


Google/Bing Toolbar


Cookies from ad websites..


Can use this to get better estimates for click
through rates of each link


Modifies our transition probabilities to
improve PageRank

Search Engine Optimization


Designing your page with the ranking function
in mind


Co
-
evolves with search engines


Obvious Tricks


Make a collection of websites to point to you


Buy old webpages


Include text in background color font


Paying others to link to you

Link
s
pam detection


Spam

The web graph

Connection to HITS


If you link to a lot of spam sites, you are
probably also spam. (Hub)


If you are linked to by lots of spam sites, you
are probably why that spam collection was
built. (Authority)



Start with seed sites with Hub, Authority
scores of 1.

Trust Propagation


Given some information (
i

trusts j) or (
i

does
not trust j), how can we model trust in a
network?


Direct Propagation



Transpose Propagation



Co
-
citation




Trust Coupling

Types of Trust Propagation

i

j

k

i

j

i

j

k

m

i

j

m

𝑀
2


𝑀
𝑇



𝑀
𝑇
𝑀



𝑀
𝑀
𝑇

Distrust Propagation


Trust Only



1
-
Step Distrust



Propagated Distrust


=
𝑇


=
𝑇
,
𝑃
(
𝑘
)
=

𝑘
(
𝑇


)


=
𝑇


,
𝑃
(
𝑘
)
=

𝑘

Propagating Trust and Distrust


Eigenvalue Propagation




Weighted Linear Combination




How do you round this matrix to give trust/distrust?

𝐹
=

𝑃
(
𝑘
)

𝐹
=


𝛾
𝑘
𝑃
(
𝑘
)
𝐾
𝑘
=
1

Experiments


Epinions

‘web
-
of
-
trust’



841,372 edges


labeled + or
-
.


Try all combinations of trust

and distrust propagation.


What is the best model?


Project Proposals


Email by 9/26 to:

isabelle@eecs.berkeley.edu

anirban.dasgupta+cs294@gmail.com