# Kronecker Graphs - CS 294: Social and Information Networks ...

Internet και Εφαρμογές Web

18 Νοε 2013 (πριν από 4 χρόνια και 5 μήνες)

84 εμφανίσεις

Kronecker

Graphs

The
Kronecker

Graph Model (
rmat
)

=





For n vertices, take
log


Kronecker

products

Normalize the entries

Generating Edges

One Method

Calculate the whole
Kronecker

matrix

Sample each edge independently according to
entry

Another Method

Treat parameters as probabilities

Flip
log


coins for each edge

Features

Pro

Fast to generate: parallel and distributed

Few parameters to fit

Self
-
similarity

Con

Doesn’t have a
powerlaw

distribution

Parameters aren’t intuitive

May not be connected

Used in Graph500 benchmark

[
,
Kolda
, Pinar]

Variance of

Real Graphs

[Moreno,
Kirschner
,
Neville,
Vishwanathan
]

Web Search and Ranking

Web Search

Information Retrieval:

Given a query
Hugh Laurie
, find all documents
that mention those words

Web Ranking Before 1998

Use
tf
-
idf

(roughly)

Term frequency

inverse document frequency

𝑓

,

=

# of occurrences of

in


𝑖𝑓

,

=
# of occurrences of

in

, the corpus

Results

The best results for a topic may not mention
the topic explicitly a lot

What are we missing?

Traditional IR only has the text to work with

We have an information network

The hyperlinks are created by intelligent,
rational beings!

1998

HITS (J. Kleinberg)

What if we ranked documents by in
-

The power law distribution on in
-
degree will get us every time.

HITS

Idea: Different pages and different links play
different roles

Some pages are AUTHORITIES

Some pages are HUBS

Hubs

What is a good hub?

A page is a good hub if it points to many
authorities.

Authorities

What is a good authority?

A page is a good authority if many hub pages
point to it.

How can we find good hubs
and good authorities?

HITS

Everyone starts with a hub
-
score of 1 and
authority
-
score of 1

A
-
update: For each page
p
,
auth
(p) is the sum
of the hub
-
scores of pages that
point to
p
.

H
-
update: For each page
p
, hub(p) is the sum
of the
auth
-
scores of pages
p

points

to
.

Formally

M is the adjacency matrix, h the hub
-
scores
and a the
auth
-
scores

(
𝑖
)
=
𝑀

(
𝑖

1
)

(
𝑖
)
=
𝑀
𝑇

(
𝑖
)

Calculated on the
subgraph

that corresponds to the query at hand

How many iterations should
we do?

Where does HITS fail?

Assumes a
bipartite clique structure
to the
web

Doesn’t allow more general forms of
endorsement

PageRank

try 1

h

and
a

scores, just one score.

PR
-
update(p) = sum of normalized PR score of
each page that points to
p

Where does this fail?

Hint: The web
graph is directed.

Actual PageRank

Make the graph strongly connected by adding
epsilon weight links between all pages.

Let A be the normalized adjacency matrix

𝑃
=
1

𝜖

𝑃
+
𝜖
𝟏

Calculating with the Power Method

𝑃
(
1
)
=
𝟏
/
𝒏

Calculate
𝑃

(
2
)
=

𝑃
(
1
)

𝜖

to every entry

Normalize and repeat

Repeat this
1
/
𝜖

times

The Random Surfer Model

What natural process can justify PageRank?

How can we model how people might use the
web?

The Random Surfer

Starts at some page on the web

With probability (1
-
𝛼
on the page and follows it

With probability
𝛼
, gets bored and jumps to
some new random web page.

Pr
 𝑖𝑖

𝑓 






=
1

𝛼


,



,

+
𝛼

The Random Surfer

The PageRank vector is the probability that
you will visit each website in this process

Random Walks on Graphs

1

1/3

1/2

1/3

Stationary Distributions

What does this process converge to?

Connection between eigenvectors and
stationary distributions. Why is the top
eigenvalue always 1?


=


Mixing Time

How long does it take to converge?

Why does PageRank converge in
O
(
1
𝜖
)

time?

min

max

{




𝜋
<
1
4
}




𝜋

1

Φ
2


Undirected Graphs

The stationary distribution is proportional to
the degree



,

𝜋

=


,

𝜋
(

)

Spectral Analysis for HITS

(
1
)
=
𝑀
𝑇

(
0
)

(
1
)
=
𝑀

(
1
)
=
𝑀
𝑀
𝑇

(
0
)

(
2
)
=
𝑀
𝑇

(
1
)
=
𝑀
𝑇
𝑀
𝑀
𝑇

(
0
)

(
2
)
=
𝑀

(
2
)
=
𝑀
𝑀
𝑇
𝑀
𝑀
𝑇

(
0
)

(
𝑘
)
=
𝑀
𝑇
𝑀
𝑘

1
𝑀
𝑇

(
0
)

(
𝑘
)
=
𝑀𝑀
𝑇
𝑘

(
0
)

APPLICATIONS AND EXTENSIONS

Personalized PageRank

𝑃
=
1

𝜖

𝑃
+
𝜖
𝟏

What if the surfer didn’t jump randomly?

s
can be any distribution over the pages

𝑃
(
𝜖
,

)
=
1

𝜖

𝑃
(
𝜖
,

)
+
𝜖

Uses of Personalized PageRank

Creating personalized search results

Topic
-
sensitive PageRank

Local community detection

Can you compute it more efficiently than
PageRank?

The Intentional Surfer

Click data is collected by

Can use this to get better estimates for click

Modifies our transition probabilities to
improve PageRank

Search Engine Optimization

Designing your page with the ranking function
in mind

Co
-
evolves with search engines

Obvious Tricks

Make a collection of websites to point to you

Include text in background color font

Paying others to link to you

s
pam detection

Spam

The web graph

Connection to HITS

If you link to a lot of spam sites, you are
probably also spam. (Hub)

If you are linked to by lots of spam sites, you
are probably why that spam collection was
built. (Authority)

scores of 1.

Trust Propagation

Given some information (
i

trusts j) or (
i

does
not trust j), how can we model trust in a
network?

Direct Propagation

Transpose Propagation

Co
-
citation

Trust Coupling

Types of Trust Propagation

i

j

k

i

j

i

j

k

m

i

j

m

𝑀
2

𝑀
𝑇

𝑀
𝑇
𝑀

𝑀
𝑀
𝑇

Distrust Propagation

Trust Only

1
-
Step Distrust

Propagated Distrust


=
𝑇


=
𝑇
,
𝑃
(
𝑘
)
=

𝑘
(
𝑇


)


=
𝑇


,
𝑃
(
𝑘
)
=

𝑘

Propagating Trust and Distrust

Eigenvalue Propagation

Weighted Linear Combination

How do you round this matrix to give trust/distrust?

𝐹
=

𝑃
(
𝑘
)

𝐹
=

𝛾
𝑘
𝑃
(
𝑘
)
𝐾
𝑘
=
1

Experiments

Epinions

‘web
-
of
-
trust’

841,372 edges

labeled + or
-
.

Try all combinations of trust

and distrust propagation.

What is the best model?

Project Proposals

Email by 9/26 to:

isabelle@eecs.berkeley.edu

anirban.dasgupta+cs294@gmail.com