Random Walking on the World-Wide-Web

toadspottedincurableInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

56 εμφανίσεις

Project in the Technion Electrical
Engineering Software Lab






Random Walking on
the World
-
Wide
-
Web


Final Report





Team members:



Levin Boris 304082506

Laserson Itamar 060403359



Instr
uctor Name:

Gurevich Maxim



2

Table of Contents:






Part 1

-

Introduction

3


Part 2


Project Description



4


Part 3


Algorithm Description


5
-
9


Part 4


Design and


System Architecture



10
-
2
5









Part
5



Re
sults and Analysis


2
6
-
3
7


Part 6



Conclusions





3
8


Part 7
-

References






3
9









3

1.

I
ntroduction


Timely and accurate statistics about the web
-
pages are becoming increasingly
important for academic and commercial use. Some relevant quest
ions are: What
percentage of web pages are in the .com domain? How many pages are indexed by
a particular search engine? What is the distribution of sizes, modification times,
and content of web pages?

In order to answer these questions we need to estimate

the size of certain sets of
web pages. In this project implemented and run a "
webw
alker
", proposed by
Ziv
Bar
-
Yossef
,
Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror
Weitz

in their paper [1] ,

which
is a technique for sampling web pages u
sing a
random walk on the web
. The random walk surfs the web "at random". At each
step it either follows a random link from the current page, or follows a random
link that enters the current page "in reverse".

The main goal of this project was to create a

generic design that is able to run all
sort of random walks using this code. Other goals of this project include
implementing and experimenting with the webwalker to evaluate its performance
and using the webwalker to learn interesting properties of the w
eb.

The generic designed implemented in this project includes support of the search
engine based random walkers we designed in our previous project.

The programming language in this project
is

Java and
the developing was done
using the

E
clipse development
environment. The search engine we
used
is
"
Yahoo
"

since it

provides an API to its search
services
.




















4

2.

Project Description


The project consists of several important steps:




Reading and understanding the article "Approximating Aggregate Que
ries
about Web Pages via Random Walks " by Ziv Bar Yossef

et al.

[1]



Learning to use the Yahoo Interface


a Yahoo public API for getting the
incoming URLs for a specific webpage.




Software structure design


designi
ng the framework of the project:
classe
s, interfaces
, inheritance and relations. This was the major part of the
project as we were
developing

a general framework that will b
e able to run
different random walks for sampling web pages. T
his meant that the code
we design
had to

be generic and suit
e
d

for
different scenarios and
implementations. For example the code in this project was also meant to
run our previous project


Search Based Random Walks.



Implementing the code that we design.



Design data structures for holding the information gathered
by the run of
the project.



Planning the analysis for the algorithms


I
n this step we planned the
parameters we would run our simulations on. We selected several
parameters that would give us the needed information on the performance
of the algorithm and s
ome tools for comparing it to other similar
algorithms.



Running the simulations for the algorithm


in this step we run the
simulation
s

on our algorithm using various parameters that we defined
before, all the results were saved in .xml files.



Implementing

a result analyzer


this is the part where we implemented a
result analyzer a java program that took the
output data

files and then
outputted the needed information for our result analyzing to .csv files that
can be
open
ed with
excel.



Analyzing the resul
ts


here we analyzed the results
we got from the result
analyzer
,

made the graphs in excel and saw how
the random walk

behaves

in the real world
.



5

3.
Algorithm Description


The
Webwalker
Algorithm description

[1]
:

(Refer to chapter 3.1,3.2 [1])

WebWalker per
forms a regular undirected random walk while exploring the web, using
link resources (i.e., the HTML text, the walk itself, and the
yahoo search engine
). As
WebWalker traverses the web it builds a d
-
regular undirected graph that is used to
determine the ne
xt step of the walk.

Each time WebWalker visits a new page it adds a
corresponding node v to the graph. At this point we will determine and
fi
x for the
remainder of the walk the d edges (possibly including self
-
loops) incident to v.

More explicitly, denot
e by N(v) the set of v's neighbors that are availab
le from the above
resources the fir
st time WebWalker visits v. WebWalker considers only pages in N(v) to
be neighbors of v throughout the walk. If it happens to discover a new link u ! v later,

it ignore
s this link and does not add u to N(v). This is done in order to ensure consistency
in the graph on which WebWalker walks: we want to make sure that the exact same set of
neighbors is available to Web
-
Walker every time it visits v
.







In each step in or
der to fix the bias of the random walk in each node we calculate the
number of self loops according to the following geometrical distribution
1
-
w/d

where
w

is the weight of the node (number of neighbors) and
d

is the webs index , defined as
1,000,000 in ou
r case.










v

u


6




Web Walker Pseudo Code:

Fix indentation below

WebWalker_Visit(v){

If v was already visited skip to SELF

I:= r random in
-
neighbors of v (from the Yahoo in
-
links API)

O:= all out
-
neighbors of v (from the FileParse.getOutgoingLinks)

For a
ll u in (I+
O
)
\

N(v) do{


If u was not visited yet {


Add u to N(v)


Add v to N(u)

}

}


SELF:


W:= d
-

|N(v)|

Compute the number of self loops spent at
v

according to p = 1


w/d (distributed
geometrically)


SELECT:


Select a random u in N(v)


If u is "bad" go back to SELECT


Webwalker_Visit(u)

}













7

The
Search Engine Based Random Walker
Algorithm description:

We will now discuss the random walk algorithm and its' two implementations


MH and MD
.

For more details see [
2, 5
]
.

Both of the
algorithms
are variants of the

basic random walker algorithm
are
described

below in pseudo code.


SE
: the search engine.

B
: the number of steps.

X0
: the initial url.


1:Function RWSampler(SE, B, x0)

2:

X := x0

3:

for t = 1 to B do

4:


Y := sampleNeighbo
r(SE, X)

5:



if (accept(X,Y))

6:



X := Y

7:


return X



The accept function determines if the random walker should accept the next URL
computed and walk to it or stay in the current URL.

Every variant of the random walk (MD and MH) implements this metho
d
differently.


r
mcmc
(x, y):
this function is implemented differently for each variant of the
random walk (MH or MD). We will elaborate below on the implementations of
r
MH
(X, Y
)

and

r
M
D
(X, Y
)
.

x
: the current
URL
.

y
: the candidate for the next
URL.




8

1:Fun
ction accept(
X,Y
)

2:


compute r
mcmc
(
X,Y
)

3:


toss a coin whose heads probability is r
mcmc
(
X,Y
)

4:


return true if and only if coin comes up heads


The neighbor sampler chooses the next candidate url for the random walker.

It first computes all the queries

that correspond to x (the current url), then chooses
a random query and submits it to the search engine. If the query is valid it
randomly chooses a url from the result set of the query.

A valid query is one that neither overflows (when the result set's l
ength is greater
than the search engine's limit) nor underflows (when the result set is empty).



1: Function sampleNeighbor(SE, x)

2:

queriesP(x) := getIncidentQueriesP(
X
)

3:

while (true) do

4:


Q := query chosen uniformly from queriesP
(X
)

5:


submit
Q to the search engine SE

6:


if (Q neither overflows nor underflows)

7:



break

8:

results(Q) := results returned from SE

9:

Y := document chosen uniformly at random from results(Q)

10:

return Y


The basic Random Walk sampler works by submitting a qu
ery to the search
engine, getting the response URL, parsing the file from the given URL to
phrase
s
(with a defined length), and finally, randomly choosing a
phrase

and submitting it
as a new Query.

The process is repeated B time.


This algorithm uses both

the MD and MH algorithms for Random Walk we'll now
describe short pseudo code parts of both of them:


9



The Metropolis
-
Hastings
(MH)
algorithm

degP(x)
= queries
(
P(x)
)
-

The number of shingles in node x.

1:Function accept
(
X,Y
)

2:


r
MH
(x
,y
) :=
min


(
Y
) deg
P
(
X
))

/

π
(
X
)
degP(
Y
)),1)

3:


toss a coin whose heads probability is r
MH

(
X,Y
)

4: return true if and only if coin comes up heads


The Maximum Degree
(MD)
algorithm

1:Function accept(P, C,
X
)

2:


r
MD
(
X
) := p(
X
)
/(
C
π
(
X
)
)

3:

toss a coin whose heads probability

is r
MD

(
X
)

4:


return true if and only if coin comes up heads





















10

4.

Design
and System
Architecture.


In
this

section we will provide the software design of the system.


Basic Software Description

I
n this part we'll present the basic flow of t
he software:


The MainRandomWalker class is the main of the system here we define the initial url's
we want to run the algorithm for, number of threads and etc.

NeighborHandler

and
RandomWalkHandler

are generic classes suitable for all types of
random wal
kers according to the type of walker we want to run:



WebWalkerNeighborHandler

and
WebWalkerHandler

for the webwalker
algorithm



S
earchEngineWalkerNeighborHandler


and
SearchEngineWalkerHandler

for the
search engine based walker algorithm.

The NeighborHand
ler class is the class responsible for extracting the neighbour nodes
that we'll then use the MD/MH algorithm to choose our next step to. The 2
implementations vary in their way of extracting neighbours (urls vs. shingles) and
validating the nodes.

The Ran
domWalkHandler is a simple class that creates the appropriate StepInfo


a data
holding class,
SearchEngineWalkerStepInfo

/
WebWalkerStepInfo

according to the type
of algorithm we're running.


The main class then activates a RandomWalk class this is the cl
ass responsible for the run
of the algorithm itself it is the same for all types of algorithms as it's a generic class
capable of running all types of random walkers. The class initiates a
MonteCarloMethod

class a general class that represent a Monte Carlo

Method


MD/MH.

We implement a
MDMonteCarloMethod

and
MHMonteCarloMethod

depending on the
method we want to use. These classes are different in their way of choosing neighbours
and calculating the number of self loops for the algorithms.

This class is a
lso responsible for writing the data to the .xml file as StepInfo objects. This
is done using an open source XStream project that allows you to serialize/deserialize a
java object in xml format.



11

At the end the gathered results are read using a ResultAnaly
zer class reads the xml files,
performs the data analysis and statistic calculations and writes the final information to
.cvs files.


The utility package class diagram:



C
o
n
s
t
a
n
t
s
d

:

i
n
t
r

:

i
n
t
i
n
d
e
x
_
d
e
p
t
h

:

i
n
t
Q
U
O
T
E
S
_
C
H
A
R

:

c
h
a
r
Q
U
O
T
E
S
_
S
T
R
I
N
G

:

S
t
r
i
n
g
t
i
m
e
o
u
t

:

l
o
n
g
a
r
r
O
f
P
h
r
a
s
e
s

:

i
n
t
[
]
p
h
a
r
s
e
s
F
o
r
S
i
m
i
l
a
r
i
t
y

:

i
n
t
[
]
i
s
P
r
i
n
t
I
n
O
u
t
U
r
l
s

:

b
o
o
l
e
a
n
F
i
l
e
D
o
w
n
l
o
a
d
e
r
+
D
o
w
n
l
o
a
d
F
i
l
e
(
S
t
r
i
n
g

u
r
l
S
t
r
,

S
t
r
i
n
g

f
i
l
e
N
a
m
e
)

:

H
t
t
p
R
e
s
u
l
t
U
r
l
D
o
w
n
l
o
a
d
e
r
+
U
r
l
D
o
w
n
l
o
a
d
e
r
(
i
n
t

t
h
r
e
a
d
I
n
d
e
x
,
S
t
r
i
n
g

u
r
l
,

l
o
n
g

t
i
m
e
o
u
t
)
+
d
o
w
n
l
o
a
d
U
R
L
(
S
t
r
i
n
g

u
r
l
)

:

S
t
r
i
n
g
H
t
t
p
R
e
s
u
l
t
+
H
t
t
p
R
e
s
u
l
t
(
S
t
r
i
n
g

e
r
r
o
r
R
e
s
p
o
n
s
e
,
M
a
p
<
S
t
r
i
n
g
,
L
i
s
t
<
S
t
r
i
n
g
>
>

h
e
a
d
e
r
)
O
p
e
n
U
r
l
T
h
r
e
a
d
+
O
p
e
n
U
r
l
T
h
r
e
a
d
(
S
t
r
i
n
g

u
r
l
)
+
r
u
n
(
)

:

v
o
i
d
X
m
l
U
t
i
l
s
+
X
m
l
U
t
i
l
s
(
i
n
t

t
h
r
e
a
d
I
n
d
e
x
)
+
p
r
i
n
t
X
m
l
(
S
t
e
p
I
n
f
o

s
t
e
p
I
n
f
o
)

:

v
o
i
d
+
c
l
o
s
e
X
m
l
(
)

:

v
o
i
d
F
i
l
e
P
a
r
s
e
r
+
F
i
l
e
P
a
r
s
e
r
(
S
t
r
i
n
g

s
e
N
a
m
e
,
b
o
o
l
e
a
n

c
a
c
h
e
,
S
t
r
i
n
g

i
n
p
u
t
F
i
l
e
N
a
m
e
,
S
t
r
i
n
g

u
r
l
,
b
o
o
l
e
a
n

e
x
t
r
a
c
t
L
i
n
k
s
)
+
r
u
n
(
)

:

S
t
r
i
n
g
+
g
e
t
T
e
x
t
(
)

:

S
t
r
i
n
g
+
g
e
t
O
u
t
g
o
i
n
g
L
i
n
k
s
(
)

:

M
a
p
<
S
t
r
i
n
g
,

S
t
r
i
n
g
>
Y
a
h
o
o
S
e
a
r
c
h
E
n
g
i
n
e
Q
u
e
r
y
+
Y
a
h
o
o
S
e
a
r
c
h
E
n
g
i
n
e
Q
u
e
r
y
(
i
n
t

t
h
r
e
a
d
I
n
d
e
x
)
+
g
e
t
Q
u
e
r
y
R
e
s
u
l
t
s
(
Q
u
e
r
y

q
u
e
r
y
)

:

R
e
s
1
Y
a
h
o
o
W
e
b
S
e
r
v
i
c
e
G
e
t
+
Y
a
h
o
o
W
e
b
S
e
r
v
i
c
e
G
e
t
(
i
n
t

r
)
+
g
e
t
I
n
L
i
n
k
s
(
S
t
r
i
n
g

q
u
e
r
y
U
R
L
)

:

L
i
s
t
<
S
t
r
i
n
g
>
X
S
t
r
e
a
m
O
u
t
+
X
S
t
r
e
a
m
O
u
t
(
)
+
X
S
t
r
e
a
m
O
u
t
(
O
u
t
p
u
t
S
t
r
e
a
m

s
t
r
e
a
m
,

b
o
o
l
e
a
n

a
u
t
o
F
l
u
s
h
)
+
w
r
i
t
e
(
T

o
b
j
)

:

v
o
i
d
X
S
t
r
e
a
m
I
n
+
X
S
t
r
e
a
m
I
n
(
)
+
X
S
t
r
e
a
m
I
n
(
R
e
a
d
e
r

r
e
a
d
e
r
)
+
r
e
a
d
(
)

:

T
+
c
l
o
s
e
(
)

:

v
o
i
d




12



The random walker class diagram (1):


<
<
i
n
t
e
r
f
a
c
e
>
>
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
+

e
x
t
r
a
c
t
N
e
i
g
h
b
o
r
s
(
N
o
d
e

p
r
e
v
N
o
d
e

,

N
o
d
e

c
u
r
r
N
o
d
e
)

:
N
e
i
g
h
b
o
r
s
+

i
s
V
a
l
i
d
(
E
d
g
e

e
d
g
e
)

:

b
o
o
l
e
a
n
+
c
a
l
c
u
l
a
t
e
D
e
g
r
e
e
(
N
o
d
e

c
u
r
r
e
n
t
N
o
d
e
)

:
i
n
t
<
<
i
n
t
e
r
f
a
c
e
>
>
E
d
g
e
+
g
e
t
S
o
u
r
c
e
(
)

:

N
o
d
e
+
g
e
t
T
a
r
g
e
t
(
)

:

N
o
d
e
N
e
i
g
h
b
o
r
s
+
N
e
i
g
h
b
o
r
s
(
L
i
s
t
<
S
t
r
i
n
g
>

i
n
,
L
i
s
t
<
S
t
r
i
n
g
>

o
u
t
)
+
N
e
i
g
h
b
o
r
s
(
S
e
a
r
c
h
E
n
g
i
n
e
R
e
s
u
l
t
[
]

o
u
t
)
+
g
e
t
N
e
i
g
h
b
o
r
s
(
)
:
L
i
s
t
<
N
o
d
e
>
+
g
e
t
I
n
(
)
:
L
i
s
t
<
N
o
d
e
>
+
g
e
t
O
u
t
(
)
:
L
i
s
t
<
N
o
d
e
>

<
<
i
n
t
e
r
f
a
c
e
>
>
N
o
d
e
+
e
q
u
a
l
s
(
O
b
j
e
c
t

o
b
j
)

:

b
o
o
l
e
a
n
U
r
l
E
d
g
e
+
U
r
l
E
d
g
e
(
U
r
l
N
o
d
e

s
o
u
r
c
e
,
U
r
l
N
o
d
e

t
a
r
g
e
t
)
+
g
e
t
S
o
u
r
c
e
(
)

:

U
r
l
N
o
d
e
+
g
e
t
T
a
r
g
e
t
(
)

:

U
r
l
N
o
d
e
U
r
l
N
o
d
e
+
U
r
l
N
o
d
e
(
S
t
r
i
n
g

u
r
l
)
+
g
e
t
U
r
l
(
)

:

S
t
r
i
n
g
+
e
q
u
a
l
s
(
O
b
j
e
c
t

o
b
j
)

:

b
o
o
l
e
a
n
<
<
i
n
t
e
r
f
a
c
e
>
>
M
o
n
t
e
C
a
r
l
o
M
e
t
h
o
d
+
c
a
l
c
u
l
a
t
e
N
e
x
t
M
o
n
t
e
C
a
r
l
o
S
t
e
p
(
N
o
d
e

c
u
r
r
e
n
t
N
o
d
e
,

N
o
d
e

p
r
e
v
i
o
u
s
N
o
d
e
)

:

M
o
n
t
e
C
a
r
l
o
S
t
e
p
M
o
n
t
e
C
a
r
l
o
S
t
e
p
+
M
o
n
t
e
C
a
r
l
o
S
t
e
p
(
i
n
t

n
u
m
b
e
r
O
f
S
e
l
f
L
o
o
p
s
,

E
d
g
e

n
e
x
t
E
x
t
e
r
n
a
l
E
d
g
e

,
L
i
s
t
<
N
o
d
e
>

i
n
c
o
m
i
n
g
L
i
n
k
s

,

L
i
s
t
<
N
o
d
e
>

o
u
t
g
o
i
n
g
L
i
n
k
s
)
+
g
e
t
N
e
x
t
E
x
t
e
r
n
a
l
E
d
g
e
(
)

:
E
d
g
e
+
g
e
t
N
u
m
b
e
r
O
f
S
e
l
f
L
o
o
p
s
(
)

:

i
n
t
+
g
e
t
I
n
c
o
m
i
n
g
L
i
n
k
s
(
)

:

j
L
i
s
t
+
g
e
t
O
u
t
g
o
i
n
g
L
i
n
k
s
(
)

:

L
i
s
t
M
H
M
o
n
t
e
C
a
r
l
o
M
e
t
h
o
d
+
M
H
M
o
n
t
e
C
a
r
l
o
M
e
t
h
o
d
(
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r

w
a
l
k
e
r
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
)
+
c
a
l
c
u
l
a
t
e
N
e
x
t
M
o
n
t
e
C
a
r
l
o
S
t
e
p
(
N
o
d
e

c
u
r
r
e
n
t
N
o
d
e
,
N
o
d
e

p
r
e
v
i
o
u
s
N
o
d
e
)
M
D
M
o
n
t
e
C
a
r
l
o
M
e
t
h
o
d
+
M
H
M
o
n
t
e
C
a
r
l
o
M
e
t
h
o
d
(
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r

w
a
l
k
e
r
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
)
+
c
a
l
c
u
l
a
t
e
N
e
x
t
M
o
n
t
e
C
a
r
l
o
S
t
e
p
(
N
o
d
e

c
u
r
r
e
n
t
N
o
d
e
,
N
o
d
e

p
r
e
v
i
o
u
s
N
o
d
e
)



13


The random walker class diagram (2):

<
<
i
n
t
e
r
f
a
c
e
>
>
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
+

e
x
t
r
a
c
t
N
e
i
g
h
b
o
r
s
(
N
o
d
e

p
r
e
v
N
o
d
e

,

N
o
d
e

c
u
r
r
N
o
d
e
)

:
N
e
i
g
h
b
o
r
s
+

i
s
V
a
l
i
d
(
E
d
g
e

e
d
g
e
)

:

b
o
o
l
e
a
n
+
c
a
l
c
u
l
a
t
e
D
e
g
r
e
e
(
N
o
d
e

c
u
r
r
e
n
t
N
o
d
e
)

:
i
n
t
S
e
a
r
c
h
E
n
g
i
n
e
W
a
l
k
e
r
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
+
S
e
a
r
c
h
E
n
g
i
n
e
W
a
l
k
e
r
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
(
i
n
t

t
h
r
e
a
d
I
n
d
e
x
,

N
o
d
e

i
n
i
t
i
a
l
N
o
d
e
,

U
r
l
D
o
w
n
l
o
a
d
e
r

a
U
r
l
D
o
w
n
l
o
a
d
e
r
)
+
e
x
t
r
a
c
t
N
e
i
g
h
b
o
r
s
(
N
o
d
e

p
r
e
v
N
o
d
e
,

N
o
d
e

c
u
r
r
N
o
d
e
)

:
N
e
i
g
h
b
o
r
s
+
i
s
V
a
l
i
d
(
E
d
g
e

e
d
g
e
)

:

b
o
o
l
e
a
n
+
c
a
l
c
u
l
a
t
e
D
e
g
r
e
e
(
N
o
d
e

c
u
r
r
N
o
d
e
)

:

i
n
t
+
g
e
t
S
h
i
n
g
l
e
s
(
N
o
d
e

c
u
r
r
N
o
d
e
)

:

C
o
l
l
e
c
t
i
o
n
<
S
t
r
i
n
g
>
+
g
e
t
S
h
i
n
g
l
e
s
F
o
r
S
i
m
i
l
a
r
i
t
y
(
N
o
d
e

c
u
r
r
N
o
d
e
)

:

C
o
l
l
e
c
t
i
o
n
<
S
t
r
i
n
g
>
+
g
e
t
N
u
m
b
e
r
O
f
O
v
e
r
f
l
o
w
s
(
)

:

i
n
t
+
g
e
t
N
u
m
b
e
r
O
f
U
n
d
e
r
f
l
o
w
s
(
)

:

i
n
t
+
g
e
t
Q
u
e
r
y
S
t
r
(
)

:

S
t
r
i
n
g
+
g
e
t
I
n
i
t
i
a
l
S
h
i
n
g
l
e
s
S
e
t
(
)

:

C
o
l
l
e
c
t
i
o
n
<
S
t
r
i
n
g
>
W
e
b
W
a
l
k
e
r
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
+
W
e
b
W
a
l
k
e
r
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
(
i
n
t

t
h
r
e
a
d
I
n
d
e
x
,

N
o
d
e

i
n
i
t
i
a
l
N
o
d
e
,

U
r
l
D
o
w
n
l
o
a
d
e
r

a
U
r
l
D
o
w
n
l
o
a
d
e
r

,

i
n
t

r
)
+
e
x
t
r
a
c
t
N
e
i
g
h
b
o
r
s
(
N
o
d
e

p
r
e
v
N
o
d
e
,

N
o
d
e

c
u
r
r
N
o
d
e
)

:
N
e
i
g
h
b
o
r
s
+
i
s
V
a
l
i
d
(
E
d
g
e

e
d
g
e
)

:

b
o
o
l
e
a
n
+
c
a
l
c
u
l
a
t
e
D
e
g
r
e
e
(
N
o
d
e

c
u
r
r
N
o
d
e
)

:

i
n
t
+
g
e
t
S
h
i
n
g
l
e
s
(
N
o
d
e

c
u
r
r
N
o
d
e
)

:

C
o
l
l
e
c
t
i
o
n
<
S
t
r
i
n
g
>
+
g
e
t
I
n
i
t
i
a
l
S
h
i
n
g
l
e
s
S
e
t
(
)

:

C
o
l
l
e
c
t
i
o
n
<
S
t
r
i
n
g
>
<
<
i
n
t
e
r
f
a
c
e
>
>
R
a
n
d
o
m
W
a
l
k
H
a
n
d
l
e
r
+


c
r
e
a
t
e
N
e
w
S
t
e
p
I
n
f
o
(
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
n
e
i
g
h
b
o
r
H
a
n
d
l
e
r
,

M
o
n
t
e
C
a
r
l
o
S
t
e
p

m
o
n
t
e
C
a
r
l
o
S
t
e
p
,
i
n
t

s
t
e
p
N
u
m
b
e
r

)

:

S
t
e
p
I
n
f
o
W
e
b
W
a
l
k
e
r
H
a
n
d
l
e
r
+
c
r
e
a
t
e
N
e
w
S
t
e
p
I
n
f
o
(
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
n
e
i
g
h
b
o
r
H
a
n
d
l
e
r
,
M
o
n
t
e
C
a
r
l
o
S
t
e
p

m
o
n
t
e
C
a
r
l
o
S
t
e
p
,

i
n
t

s
t
e
p
N
u
m
b
e
r
)

:

S
t
e
p
I
n
f
o
S
e
a
r
c
h
E
n
g
i
n
e
W
a
l
k
e
r
H
a
n
d
l
e
r
+
c
r
e
a
t
e
N
e
w
S
t
e
p
I
n
f
o
(
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
n
e
i
g
h
b
o
r
H
a
n
d
l
e
r
,
M
o
n
t
e
C
a
r
l
o
S
t
e
p

m
o
n
t
e
C
a
r
l
o
S
t
e
p
,
i
n
t

s
t
e
p
N
u
m
b
e
r
)

:

S
t
e
p
I
n
f
o
<
<
a
d
s
t
r
a
c
t
>
>
S
t
e
p
I
n
f
o
+
g
e
t
U
r
l
(
)

:

S
t
r
i
n
g
+
g
e
t
M
o
n
t
e
C
a
r
l
o
S
t
e
p
(
)

:

M
o
n
t
e
C
a
r
l
o
S
t
e
p
+
g
e
t
D
a
t
e
(
)

:

D
a
t
e
+
g
e
t
S
t
e
p
N
u
m
b
e
r
(
)

:

i
n
t
+
g
e
t
S
i
m
i
l
a
r
i
t
y
(
)

:

d
o
u
b
l
e
W
e
b
W
a
l
k
e
r
S
t
e
p
I
n
f
o
+
g
e
t
S
i
m
i
l
a
r
i
t
y
(
)

:

d
o
u
b
l
e
S
e
a
r
c
h
E
n
g
i
n
e
W
a
l
k
e
r
S
t
e
p
I
n
f
o
+
g
e
t
S
i
m
i
l
a
r
i
t
y
(
)

:

d
o
u
b
l
e
+
g
e
t
N
u
m
b
e
r
O
f
O
v
e
r
f
l
o
w
s
(
)

:

i
n
t
+
g
e
t
N
u
m
b
e
r
O
f
U
n
d
e
r
f
l
o
w
s
(
)

:

i
n
t
+
g
e
t
Q
u
e
r
y
(
)

:

S
t
r
i
n
g
+
g
e
t
N
u
m
O
f
S
h
i
n
g
l
e
s
(
)

:

i
n
t
M
a
i
n
R
a
n
d
o
m
W
a
l
k
e
r
+
M
a
i
n
R
a
n
d
o
m
W
a
l
k
e
r
(
)
+
m
a
i
n
(
S
t
r
i
n
g

a
r
g
s
[
]
)

:

v
o
i
d
R
a
n
d
o
m
W
a
l
k
+
w
a
l
k
(
i
n
t

n
u
m
b
e
r
O
f
S
t
e
p
s
,
N
o
d
e

i
n
i
t
i
a
l
N
o
d
e
,
R
a
n
d
o
m
W
a
l
k
H
a
n
d
l
e
r

r
a
n
d
o
m
W
a
l
k
H
a
n
d
l
e
r
,
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r

n
e
i
g
h
b
o
r
H
a
n
d
l
e
r
,
U
r
l
D
o
w
n
l
o
a
d
e
r

u
r
l
D
o
w
n
l
o
a
d
e
r
,
M
o
n
t
e
C
a
r
l
o
M
e
t
h
o
d

m
o
n
t
e
C
a
r
l
o
M
e
t
h
o
d
)

:

v
o
i
d
M
a
i
n
T
h
r
e
a
d
+
M
a
i
n
T
h
r
e
a
d
(
i
n
t

t
h
r
e
a
d
I
n
d
e
x
,
i
n
t

n
u
m
b
e
r
O
f
S
t
e
p
s
,
S
t
r
i
n
g

m
o
n
t
e
C
a
l
r
o
M
e
t
o
d
,
S
t
r
i
n
g

i
n
i
t
i
a
l
U
R
L
,
S
t
r
i
n
g

w
a
l
k
e
r
T
y
p
e
)
+
r
u
n
(
)

:

v
o
i
d





14

Presented next is the Result

-

Analyzer class diagram:


<
<
a
b
s
t
r
a
c
t
>
>
R
e
s
u
l
t
A
n
a
l
y
z
e
r
s
t
e
p
I
n
f
o
s
M
a
p

:

H
a
s
h
M
a
p
<
I
n
t
e
g
e
r
,

L
i
s
t
<
S
t
e
p
I
n
f
o
>
>
s
a
m
p
l
e
R
e
s
o
l
u
t
i
o
n

:

i
n
t
t
o
t
a
l
N
u
m
O
f
S
t
e
p
s

:

i
n
t
+
p
r
i
n
t
S
e
l
f
L
o
o
p
s
S
i
m
i
l
a
r
i
t
i
e
s
(
S
t
r
i
n
g

i
n
i
t
i
a
l
U
r
l
)

:

v
o
i
d
+
c
a
l
c
u
l
a
t
e
U
H
S
(
S
t
r
i
n
g

i
n
i
t
i
a
l
U
r
l
)

:

v
o
i
d
+
c
r
e
a
t
e
C
o
n
v
e
r
g
e
n
c
e
F
i
l
e
s
(
S
t
r
i
n
g

i
n
i
t
i
a
l
U
r
l
N
a
m
e
,
d
o
u
b
l
e

m
i
n
Y
,

d
o
u
b
l
e

m
a
x
Y
,
d
o
u
b
l
e

r
e
s
o
l
u
t
i
o
n
Y
,

d
o
u
b
l
e

m
i
n
X
,
d
o
u
b
l
e

m
a
x
X
,

d
o
u
b
l
e

r
e
s
o
l
u
t
i
o
n
X
)

:
v
o
i
d
+
r
e
a
d
X
m
l
F
i
l
e
s
F
r
o
m
D
i
r
(
S
t
r
i
n
g

d
i
r
e
c
t
o
r
y
)

:
H
a
s
h
M
a
p
<
I
n
t
e
g
e
r
,

L
i
s
t
<
S
t
e
p
I
n
f
o
>
>
R
e
s
u
l
t
A
n
a
l
y
z
e
r
R
u
n
n
e
r
+
m
a
i
n
(
S
t
r
i
n
g
[
]

a
r
g
s
)

:

v
o
i
d
S
e
a
r
c
h
E
n
g
i
n
e
R
e
s
u
l
t
A
n
a
l
y
z
e
r
+
S
e
a
r
c
h
E
n
g
i
n
e
R
e
s
u
l
t
A
n
a
l
y
z
e
r
(
S
t
r
i
n
g

d
i
r
e
c
t
o
r
y
)
+
r
e
a
d
X
m
l
F
i
l
e
s
F
r
o
m
D
i
r
(
S
t
r
i
n
g

d
i
r
e
c
t
o
r
y
)

:
H
a
s
h
M
a
p
<
I
n
t
e
g
e
r
,

L
i
s
t
<
S
t
e
p
I
n
f
o
>
W
e
b
W
a
l
k
e
r
R
e
s
u
l
t
A
n
a
l
y
z
e
r
+
W
e
b
W
a
l
k
e
r
R
e
s
u
l
t
A
n
a
l
y
z
e
r
(
S
t
r
i
n
g

d
i
r
e
c
t
o
r
y
)
+
r
e
a
d
X
m
l
F
i
l
e
s
F
r
o
m
D
i
r
(
S
t
r
i
n
g

d
i
r
e
c
t
o
r
y
)

:
H
a
s
h
M
a
p
<
I
n
t
e
g
e
r
,

L
i
s
t
<
S
t
e
p
I
n
f
o
>






















15

Presented next is a basic sequence diagram to explain t
he bas
ic workflow of the system
:


M
a
i
n
R
a
n
d
o
m
W
A
l
k
e
r
R
a
n
d
o
m
W
a
l
k
M
o
n
t
e
C
a
r
l
o
M
e
t
h
o
d
N
e
i
g
h
b
o
r
H
a
n
d
l
e
r
R
a
n
d
o
m
W
a
l
k
H
a
n
d
l
e
r
S
t
e
p
I
n
f
o
w
a
l
k
(
)
i
n
i
t
i
a
l
i
z
e
(
)
c
a
l
c
u
l
a
t
e
N
e
x
t
M
o
n
t
e
C
a
r
l
o
S
t
e
p
(
)
e
x
t
r
a
c
t
N
e
i
g
h
b
o
r
s
(
)
i
s
V
a
l
i
d
g
e
t
M
o
n
t
e
C
a
r
l
o
S
t
e
p
(
)
c
r
e
a
t
e
N
e
w
S
t
e
p
I
n
f
o
(
)
S
t
e
p
I
n
f
o
(
)


The class diagrams above describe the workflow of the system, we'll now describe each
class/interface in detail.













16



We'll now discuss the software design in a more extended manner,


Utili
ty package

This is a utility package that holds all of the auxiliary classes and function that we use in
the code.



UrlHandler


this class holds all the functions that deal with the extraction and
handling of URLs:
getUrlDomain(String url)

,
getAbsoluteUrl
(String baseUrl,
String relUrl) , getURLHostName(String currUrl)

and etc.




YahooWebServieGet


this class holds the functionality of getting the incoming
neighbors of a given URL. The class has a method of getIncomingURLs that
opens a HttpClient connection

to the Yahoo Web Service Application and
executes an HttpGet method. The return of the HttpGet method is a list of the
incoming links (all the links that the Yahoo search engine samples as having an
outgoing link to the current URL) from the Yahoo search

engine (this is
equivalent for writing an link:"URL" query in the Yahoo home page).




YahooSearchEngineQuery



a class that implements the
SearchEngineBasedWalker part of submitting a query to the Yahoo search engine
and getting the URLs for the current qu
ery. The class uses a Yahoo API for the
connecting and submitting the query to the search engine .




UrlDownloader



a class responsible for downloading given URL's to text files do
we can then parse them and extract shingles / outgoing URLs depending on th
e
type of algorithm that we're running. The class implements a HashMap mapping
between the URL to the file we've downloaded and saved on the hard drive.

where the key is URL and the value a unique file name. We've implemented this
kind of caching system in
stead of dealing with one file at a time to save running
time


the algorithms often go back to same URLs and the procedure of
downloading a URL is very time consuming so this solution enables us to
download a URL only once and save time on downloading the

same URL twice.



17



XmlUtils


a class holding the Xtream Objects and methods for writing and
reading from an xml file.

public void printXml(StepInfo stepInfo)



a method for reading data from the
StepInfo object and writing it to an xml file.




Constants


a

class holding all the constants we need for the run of the algorithm.

1.

d
-

max web index 1,000,000 in our case;

2.

r



number of incoming URLs taken from the yahoo incoming
links API (for the webwalker algorithm);

3.

index_depth



is the indexing cut
-
off thresh
old (maximum phrases
number)
,

for the search engine based walker)
.

4.

timeout

= the timeout for trying to download a URL;

5.

phraseLengths



a parameter for telling the FileParser how
many words to parse at a time from the text.

6.

arrOfPhrases



a parameter for th
e search engine based
walker telling the parser hoe many shingles to parse at a
time (3,4,5 words)
.

7.

isPrintInOutUrls



a general parameter telling the whether
we want to print all the in/out URLs to the xml result file;


walker.graph

package

As the algorit
hm transforms the walk on the web to an undirected walk on a graph with
nodes, this package holds the implementation of all the general class representing our
web walk graph.



Node interface


this is a general interface used in the code representing a
spe
cific node in the web graph that we're going over. Any new representation of
the graph we'll want to add in the future won't require changes in the code as
we're using the general Node in all the code. In out case we're using the UrlNode
representation in
the code.




UrlNode class


this is the class used in the URL representation of the web graph
the class holds only the URL variable.




18



Edge interface


this is a general interface representing an edge in our web graph
representation. The interface has two m
ethods
getTargetNode
() and
getSourceNode
() which returns the needed nodes.




UrlEdge class


this is the specific implementation of the Edge interface




Neigbors class


this is an auxiliary class for holding information regarding a
neighbors nodes (all th
e out/in links of a current URL for example), has 2
separate constructors :

public Neighbors(List<String> in, List<String> out)

for the webwalker
algorithm

public Neighbors(SearchEngineResult[] out)

for the search based
walker algorithm. The class has
also auxiliary methods for converting the given
neighbors nodes
private List<Node> stringToNode(List<String> urls)

for the
webwalker algorithm and

private List<Node>searchEngineResultToNode(SearchEngineResult[] results)

method for the search engine based
walker.


mcmethod

package

This package holds the implementation of the two Monte


Carlo Methods we’re using,
MD (Maximum Degree) and MH(Metropolis Hasting).




MonteCarloStep class


this is a general auxiliary class that holds the data needed
for the cur
rent step of the algorithm. The class holds two variables the
numberOfSelfLoops and the nextExteranlEdge that tells us the next URL we're
going to.




MonteCarloMethod interface


this is general interface that has only one method
calculateNextMonteCarloSte
p()

this method is responsible for the calculating the
next step of the algorithm. The method uses the NeigborHandler for selecting the
next algorithm step (selecting a random Node according to one of the MD/MH
methods) it returns a MonteCarloStep class ty
pe which is a representation of a
single step in the algorithm. Each class that implements this interface will have to
implement this method according to its specific algorithm.



19



MDMonteCarloMethod class


this class is the implementation of the MD random
walker algorithm of the previously described interface. The class implements the
calculation of the next step according to the Maximum Degree algorithm.

public MonteCarloStep calculateNextMonteCarloStep()

calculates the selection of
the next step according

to the MD method.

private int calculateNumberOfSelfLoops(int N, Random rnd)

calculates the
number of self loops according to the MD algorithm.




MHMonteCarloMethod class


this class does the same as the MDMonteCarlo
class but using the Metropolis Hasting
algorithm.


randomwalk package

This is the main package holding the main interfaces and the main that runs the software.




NeighborHandler Interface


this is a generic interface that deals with the
calculations and extraction of the URL neighbors at each
step. Holds the
following set of methods:

1.
initialize
()


a method that initializes and instantiates all the parameters we need
for working with the nodes neighbors.

2.
extractNeigbors
()


a method that gets all the neighbors of a current URL. The
met
hod receives two parameters: the current and previous Node (previous node
represents the previous step of the algorithm and the place (URL) where we were).
The method return a Neighbors type class which is a representation of all the
neighbors of a current

node (incoming and outgoing), the class we'll be discussed
later on.

3.
isValid
()


a method that check whether the current node we're dealing with is a
valid one checks various conditions that should be met by the algorithms and also
if the selected n
ode is not a dead one (we can download it).

4.
calculateDegree
()


a method calculating the degree of the current node.

These are general methods that are required for the running of every random walk
each new algorithm that someone will want to add in the
future will require only a
different implementation of the current set of methods.



20



RandomWalkHandler

interface


this is a general interface for handling the
different stepInfos classes we'll use during the algorithms has one method

public StepInfo creat
eNewStepInfo(NeighborHandler neighborHandler,
MonteCarloStep monteCarloStep , int stepNumber ).




RandomWalk class


this is the class that actually runs the whole algorithm has
only one method

public void walk(int numberOfSteps, Node
initial
Node , RandomW
alkHandler
randomWalkHandler , NeighborHandler neighborHandler , UrlDownloader
urlDownloader , String method , String walkerType)


This class is responsible for initiating all the parameters for the run of the
algorithm. This class creates an
MDMonteCarloM
ethod
/

MHMonteCarloMethod

objects depending on the type of monte carlo method we want to run,

downloads
URLs using the UrlDownloader , uses an appropriate NeighborHandler for
selecting the next step , creates a new StepInfo using the appropriate
RandomW
alkHandler and writes the data to an .xml file using the XmlUtils
methods.




MainRandoWalker class


this is the main class that runs the algorithm. The class
gets parameters (list the parameters) from the command line (args[]) and runs the
selected algorit
hm with the given parameters. The class is designed to be generic
so we can run any sort of random walk algorithm and all we need to do is
implement the interfaces and classes described above with our correct
implementation.


walker.se

package

This packa
ge holds the implementation of the Search Engine based walker.




SearchEngineNeighborHandler class


this is a specific implementation that deals
with the neighbors of the search engine based random walk samplers algorithms
from the previous project. The
ex
tractNeighbors
() method is used to get neighbor
URLs by submitting real steps of shingles (with different length) to the Yahoo
search engine and getting the responseURLs. The isValid method checks the

21

validity of the currently submitted URL according to th
e search engine based
random walk samplers algorithms.




SearchEngineWalkerHandler

class
-

RandomWalkerHandler interface with the
specific
public StepInfo createNewStepInfo(NeighborHandler neighborHandler,
MonteCarloStep monteCarloStep , int stepNumber
)

me
thod. Creates a new
SearchEngineWalkerStepInfo
class a specific implementation of StepInfo and
initiates it with the correct parameters for search engine based walker algorithm.


walker.ww

package

This package holds the implementation of the Web Walker al
gorithm.




WebWalkerNeighborHandler class


this is a specific implementation of the
WebWalker algorithm , the
extractNeighbors
() method gets the outgoing URLs
(by parsing the HTML page) and the incoming URLs (by using the previously
described YahooWebServ
iceGet class). The class also implements the specific
isValid method according to the algorithm.




WebWalkerHandler class

implement the RandomWalkerHandler interface with
the specific
public StepInfo createNewStepInfo(NeighborHandler
neighborHandler,MonteC
arloStep monteCarloStep, int stepNumber)

method.
Creates a new WebWalkerStepInfo class a specific implementation of StepInfo
and initiates it with the correct parameters for the web walker algorithm.















22

result.data

package

This package holds all

the data structure we’ve used during the run of the algorithm.




StepInfo class


this is the general data structure that holds all the information
gathered from the simulations like URL , date and other parameters we think
should be saved during the run
for the algorithm analysis in the end. The
parametrs :

1.

stepNumber



the current step number during the run of the
algorithm.

2.

monteCarloStep


a MonteCarloStep class (described earlier)

3.

date



the time when the entry was written.

4.

url



the current URL we'r
e dealing with.


The class holds only general information about the walk any specific parameters
needed for future different implementations of algorithm will be implemented in
different class that will inherit from this class.




WebWalkerStepInfo class


t
his class inherits from the StepInfo class and adds
additional parameters needed for the WebWalker algorithm. The parameters of
the class are all the parameters inherited from the StepInfo class and in addition:

Similarity

-

a parameter telling how much is

the current url similar to the initial
one (discussed in the Analyses part).




SearchEngineWalkerStepInfo
class


this class inherits from the StepInfo class
and adds additional parameters needed for the search engine based random walk
samplers' algorithms

from the previous project. The parameters of the class are all
the parameters inherited from the StepInfo class and in addition:

1.

similarity

a parameter telling how much is the current url similar to the

initial one (discussed in the Analyses part)

2.

number
OfOverflows



number of URLs returned as overflowed by

the Yahoo Search API until we found a valid one.

3.

numberOfUnderflows


number of URLs returned as underflowed

by the Yahoo Search API until we found a valid one.

4.

numOfShingles



the number of shingles
in the current URL.



5.

query



the current query submitted to the yahoo search

engine.



23

result.analyzer

package

This package holds all the classes implementing the result analyzer we’ve used for
analyzing the results gathered from the simulations.




Result
Analyzer class


this is the class that does the analyzing of the results
gathered during the run of the algorithm. The class reads previously saved xml
files and extracts the needed parameters we need for the analysis of the algorithm,
like URL domains, s
imilarity, host changes and etc.

private double std(double avg, double[] similarityList)



calculates the standart
deviation of the similarity from the average num.

protected void printSelfLoopsSimilarities(String initialUrl)



prints the similarity
wi
th the self loops.

public void calculateConvergence(double x, double y, FileWriter out1,FileWriter
out2,String initialUrlName)
-

calculates the convergence parameter for the x,y
values as defined at the ResultAnayzing part of the document.

private void c
reateUHSArrays(Iterator filesIterator, double[][]
sampledUniqueHostChanges, double[][] realStepUniqueHostChanges)



creates
the unique host changes array (defined later on) telling us how many hosts in %
the algorithm has passed during the walk.

The class
has several more methods mainly utility ones for writing data to .csv
files , initializing arrays and etc.




ResultAnalyzerRunner

class


this is the main for running the ResultAnalyzer the
class initiates the needed parameters for the run : number of steps

, resolution and
the directory the xml files are saved in. The class creates
WebWalkerResultAnalyzer

/
SearchEngineResultAnalyzer

classes depending on
the type of algorithm we’re analyzing. This class also defines which method from
the ResultAnalyzer we w
ant to run (we won’t need to run all the calculations each
time).




WebWalkerResultAnalyzer

class


this class extends the ResultAnalyzer class
and implements only one method

public static HashMap<Integer, List<StepInfo>> readXmlFilesFromDir(String
directo
ry) , this method reads the correct objects to the HashMap
(WebWalkerStepInfo vs.
SearchEngineWalkerStepInfo

classes) , this class is a

24

private implementation of the general ResultAnalyzer class because here we have
the WebWalkerStepInfo object which para
meters vary from the
SearchEngineWalkerStepInfo

class.




SearchEngineWalkerStepInfo

class


this is the same implementation as the
WebWalkerResultAnalyzer

class excepts it reads the
SearchEngineWalkerStepInfo

objects rather then the WebWalkerStepInfo ones.



A short guide for running the Project

The project can be run from the command line with the following parameters:



args[0]


the method we want to run MD or MH random walk.

MD


Maximum Degree Monte Carlo Method.

MH


Metropolis Hasting Monte Carlo Metho
d.



args[1]


the type of random walker we want to run in our case :

ss



Search Engine based Random Walk.

ww

-

Web Walker Algorithm.



args[2]


number of threads we wish to run simultaneously.



args[3]


number of steps we want to run.



args[4]


the initia
l URL we want to run the algorithm for.



A sample command line is:


java

jar

RandomWalker.jar
MD


1

5

50000000 http://www.cnn.com/WORLD










25

Presented below is a small example of an xml file generated by the run of the webwalker
algorithm (it’s a s
inge WebWalkerStepInfo object):


<randomwalker.WebWalkerStepInfo>


<similarity>100.0</similarity>


<stepNumber>0</stepNumber>


<monteCarloStep>


<numberOfSelfLoops>4566</numberOfSelfLoops>


<nextExternalEdge class="randomwalker.UrlEdge">


<so
urce>


<url>http://www.cnn.com/WORLD/</url>


</source>


<target>


<url>http://rss.cnn.com/~r/rss/cnn_topstories/~3/192353504/index.html</url>


</target>


</nextExternalEdge>


<incomingLinksSize>101</incomingLinksSize>



<outgoingLinksSize>118</outgoingLinksSize>


</monteCarloStep>


<date>2008
-
04
-
07 19:04:48.921 IDT</date>


<url>http://www.cnn.com/WORLD/</url>

</randomwalker.WebWalkerStepInfo>
















26

5.
Results and Analysis


General

In this section, we analyze the
webwalker results according to different criteria.
Eventually, we will compare the results of the webwalker random walk, to the Search
Engine based random walker.

In order to examine the webwalker performance we ran the webwalker random walk with
10 differ
ent initial URLs. For each initial URL we collected 5 samples of 50,000,000
steps length.

The 10 different URLs were chosen to represent different types of common web pages on
the World Wide Web. The URLs are:

1.

News


CNN


www.cnn.com/world

2.

Commercial


EBAY


http://art.ebay.com

3.

Government
-

http://law.ato.gov.au/atolaw/view.htm?locid='PAC/19970038/775
-
55(2)'#775
-
55(2
)

4.

Aca
demic


The Technion


http://www.admin.technion.ac.il/President/default.htm

5.

Blog
-

http://sethgodin.typepad.com/

6.

Search engine


Yahoo


http://dir.yahoo.com/

7.

HealthCare
-

http://www.bbc.co.uk/health/

8.

Entertainment
-

http://hollywood.com/

9.

Search engine


Google


http://www.google.com/intl/en/about.html

10.

Amazon
-

http://www.amazon.com/gp/site
-
directory/ref=_gw_


Criteria:

In order to analyze the results different analysis criteria were chosen.

Similarity:

The
similarity

between two URLs is the number of unique words that
appear
in both URLs divided by the total number of unique words in both pages.

For each step of the algorithm the similarity between the current step and the initial URL
is measured.

The average similarity of the 5 samples of each initial URL was calculate
d for each
algorithm step, along with the standard deviation in order to present a graphic display of
the similarity changes throughout the random walk.


27

It is expected that, a random walk of high enough length will cause the similarity
parameter to "conver
ge" to a certain value.

convergence(X,Y)

-

We say that convergence(X,Y) begins at step (i) if and only if:

1.


the fraction of steps starting from step i (i....N) that exceed 0<X<1


from a value (S) is no more than Y.


Where S is a average(i,…N) similarity(j)

2.


i is the lowest step number that achieves (1).

N


Total number of steps

The convergence step of a given URL is the maximal convergence step observed among
the runs starting from the given initial URL.

The convergence steps were measured for each initia
l URL in two different ways:

1.

Algorithm step of convergence


the algorithm steps (including self loops) in
which convergence begins.

2.

Real step convergence


the real step (not including self loops) in which
convergence begins.

The values X=0.08 and Y=0.05
were chosen in this analysis.

Unique
Hosts Visited
% (UHV)
:

Defined as the number of unique hosts visited by the
random walk divided by the number of real steps taken

It is expected that URLs that hold a large number of outgoing links of the same host will

be harder to "escape" from. Generally, a high percentage of unique host changes indicate
that the algorithm visited many different hosts and thus provides better results.

Measurements of UHV% were taken for each initial URL at each algorithm step and real

step. The average UHV% of all the samples of each initial URL was calculated as well as
the standard deviation.

Final Similarity:
is defined as the average similarity of the last 10% of the algorithm
steps.

For all initial URL, convergence is achieved for

the last 10% of the steps (see
convergence analysis above). The final similarity average was collected for all initial
URLs as well as the standard deviation

This criterion is used to examine an expected phenomenon, in which the random walk
reaches a grou
p of URLs that are highly connected to each other and thus, this group of
URLs is hard to "escape" from. A high standard deviation of the final similarity indicates
a higher influence of this phenomenon and thus lower algorithm performance.




28

The experime
nts:


Our experiment included 2 main parts.

Part 1


Webwalker random walk
: I
n
this part
we collected 5 samples of 50,000,000 steps
for each of the 10 initial URL
s. Parameters used:

R=100 (Max number of incoming links).

d=1,000,000 (Maximal web index).

Par
t 2


Search Engine based random walk
: I
n
this part

we collected 5 samples for each
method (MD or MH) for 3 initial URLs (TECHNION, CNN and AMAZON).

For MH the
number of steps used was 400, and for MD the number of steps was 10,000.

Parameters
used:

Index_
depth=10,000 (The maximal number of shingles of a page).

Results
:

Section 1
-

average similarity

In this section the average similarity results are presented for 2 different initial URLs. The
URLs chosen are the ones that produced the lowest and highest co
nvergence steps
(TECHNION and AMAZON respectively).

Graph description

The graphs present the average similarity at each algorithm step samples at 100,000 steps
resolution. A different, sequence (marked in pink) presents the similarity results plus the
stan
dard deviation

of the 5 runs
. This is done in order to show graphic results of the
measurements accuracy.

The Graphs:




29

TECHNION - similarity with self loops
0
5
10
15
20
25
30
35
40
1
31
61
91
121
151
181
211
241
271
301
331
361
391
421
451
481
Step Number * 100K
Similarity (%)
similarity
similarity+sdev


Amazon - average similarity with self loops
0
10
20
30
40
50
60
70
80
90
100
1
29
57
85
113
141
169
197
225
253
281
309
337
365
393
421
449
477
Step number * 100K
Similarity(%)
similarity
similariy+sdev


Analysis
:


30

It is easy to deduce from the graphs that the running the webwalker with the TECHNION
as initial URL produces much faste
r similarity convergence.

It is also seen that while the run is not in the convergence phase the standard deviation is
significantly higher than in the convergence phase. This happens because even a single
non converged run can increase the standard error
significantly.


Section 2: Convergence.

In this section the real and algorithm convergence steps of each initial URL is presented.

The convergence step is measured according to the convergence definition above and for
X=0.08 and, Y=0.05. In case of real s
teps conversion, some runs may take more step than
others. However, since here we wish to estimate the number of queries
\

URLs opened this
fact doesn't affect our analysis.

Graph description
:

The graphs are column graphs of convergence steps for each initi
al URL.


The Graphs:


Algorithm Step Convergence
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
AMZON
BLOG
EBAY
HOLLYWOOD
CNN
YAHOO
GOOGLE
GOV
TECHNION
HEALTH
Initial URL
Algorithm Step



31

Real Step Convergence
0
100
200
300
400
500
600
700
800
AMZON
BLOG
EBAY
HOLLYWOOD
CNN
YAHOO
GOOGLE
GOV
TECHNION
HEALTH
Intial URL
Real Step

Analysis:

We observe that URLs like EBAY, AMZON, BLOG, CNN, YAHOO and
HOLLYWOOD require more time to converge while URLs like TECHNION, GOOGLE,
GOV, and HEALTH converge after only a few steps.

EBAY, AMAZON, CNN and HOLLYWOOD are
commercial
\
news URLs that include
many hyper links to similar pages under the same hosts. This means that "escaping" from
the initial host can take a lot of time.

These are also very popular pages as many unrelated URLs linking to them. This means
that th
e probability of winding up in a similar page during the run increases.

These facts explain why those URL's produce high convergence step.


BLOG also shows high convergence step. The reason for that is that it holds many links
to other posts of the same bl
og, which makes it harder to "escape" from the initial URL.

TECHNION, GOV and HEALTH are relatively independent URLs that do not hold many
hyperlinks to URLs of the same host. These pages are not as popular as the previous
URLs discussed and they hold text

of very specific nature. All of these can explain why
they produce low convergence values.

The search engines, YAHOO and GOOGLE, are indeed popular pages, however, as their
nature suggests, they hold many hyperlink to various URLs across the web. Thus, it

is
very easy to "escape" from the initial URL and this explains why they produce low
convergence value.

Finally, we observe that the URLS: TECHNION, GOOGLE, GOV and HEALTH
produce convergence steps of 1.This indicates that in terms of similarity convergen
ce
they produce optimal results. The fact the results are so extreme might suggest a

32

weakness in the similarity convergence criterion in producing exact convergence values
at high resolution.


Section
3
:
Unique Host Visited

This section presents the UHV r
esults for two Different initial URLs. Like section 1,
TECHNION and AMAZON were chosen as they produce the lowest and highest
convergence values respectively.


Graph description:

The graphs show the average UHV % at each algorithm/real step, as well as th
e standard
deviation. The algorithm steps were sampled at 100,000 steps resolution.

The Graphs:

Technion - sampled unique host changes
0
20
40
60
80
100
120
1
53
105
157
209
261
313
365
417
469
Step*100K
UHV (%)
average UHV
UHC+sdev



33

AMAZON - sampled unique host visted
0
20
40
60
80
100
120
1
40
79
118
157
196
235
274
313
352
391
430
469
Step*100K
UHV(%)
average UHV
UHV + Sdev




TECHNION - average Unique host changes
for real steps
0
20
40
60
80
100
120
1
51
101
151
201
251
301
351
401
451
501
551
Real Step
UHV (%)
average UHV
UHC+Sdev


34



TECHNION - average Unique host changes
for real steps
0
20
40
60
80
100
120
1
51
101
151
201
251
301
351
401
451
501
551
Real Step
UHV (%)
average UHV
UHC+Sdev


Analysis
:

We observe

that in the TECHNION graphs, the average UHV % begins at a high value
and quickly converges to a value around 20%.

AMAZON, o
n the other hand, starts at a low UHV% values and later, converges to
around 20%. The AMAZON UHV convergence is much slower than the TECHNION's.

Both URLS eventually converge in similarity terms, and this explains why they
eventually reach roughly the sam
e UHV value (20%).

The fact that AMAZON produces a higher similarity convergence step explains why it
also produces a higher UHV convergence step.

This happens because of the initial URL's nature. AMAZON has many hyperlinks to
pages of the same host while

the TECHNION has very few of those. This is why the
UHV% of the TECHNION begins high and later decreases while the AMAZON UHV%
begins low and later increases.




35



Section 4
-

Final Similarity

This section examines the standard deviation of the average fina
l similarity of each URL.
A higher standard deviation value indicates "weaker" results and higher effect of the
highly connected group of URLs phenomenon as described above.


Graph description:

This column graph presents the standard deviation of the avera
ge final similarity of each
initial URL

The Graphs:


Sdev of the average similarity of the last 10% of the steps
0
1
2
3
4
5
6
7
8
9
10
AMZON
BLOG
CNN
EBAY
GOOGLE
GOV
TECHNION
HEALTH
YAHOO
HOLLYWOOD
Initial URL
Sdev (%)

Analysis:

We observe that none of the initial URLs produce a high standard deviation of the final
similarity which indicates a weak effect of the highly connected URLs phenomenon.


Section 5
-

Comparing

the
webwalker with the search Engine based
random walker

This section compares the webwalker results to the results of the search engine based
random walkers.

Recall, that the we applied the MD method to the webwalker random walk and for the
search engine

based walker we applied both MD and MH methods.


36

In this section we compare convergence steps of 3 chosen URLs (TECHNION, CNN and
AMAZON) in attempt determine which algorithm produces better results.

In order to do this we use real step convergence as the

objective criterion.

Graph description:

This is a multiple column graph that presents for each initial URL the convergence real
step for 3 different algorithms:

1) The webwalker with MD applied (as presented earlier).

2) The search engine based walker wit
h MD applied.

3) The search engine based walker with MH applied.


The Graphs:


Real step convergence of different
random walks
0
100
200
300
400
500
600
700
800
AMAZON
CNN
TECHNION
Initial URL
Convergence Step
WebWalker MD
Search Engine MD
Search Engine MH


Analysis:

We observe that for Amazon and CNN, the search engine based walker produces way
better results. As explained before, since these URLs contain many hyperlinks to the

same host, it is difficult to "escape" from the initial URL. The search engine based
walkers avoid this problem by extracting neighbours according to shingles rather than
hyper links.


37

The TECHNION initial URL is of different nature than the other two, and
, indeed it
shows that the webwalker produces better results for it.

The TECHNION initial URL is of
a relatively small number of shingles, and therefore,

the MH method in the search engine
based walker finds it hard to escape from the initial URL.

When co
mparing the two methods (MH and MD), observe that MD produces better
results for all initial URLs. This is explained, by the nature of the two methods.

In general MD is a better method since calculating the number of self loops is done
without having to o
pen a new URL. While for MH, every self loop requires the opening
of a new URL.

We observe that for initial URLs that have a small number of shingles the
gap between the two methods is high, while for URLs with a high number of shingles the
gap is small.


























38

6. Conclusions


4 different Criteria were analyzed in regard to the webwalker random walk.

It was proven that, for every URL, the average similarity converges. For some the
convergence occurs very fast, for others it takes longer to
converge.

The nature of the initial URL has a significant effect on the random walk performance.

URLs that hold many hyper links to pages of the same host take longer to converge than
more "isolate" URLs. This is due to less host changes during the run.

W
hen comparing between the webwalker and the search engine based walkers we
conclude that the initial URL has a significant influence on the algorithm's performance.

For initial URLs that hold many hyperlinks of the same hosts, as many commercial, news
and
blog URLs do, the search engine based produces way better results.

For more "isolated" URLs we observe slightly better performance for webwalker based
algorithms.






















7.References



[1]
Approximating Aggregate Queries about Web Pages via Random Walks


39


Ziv Bar
-
Yossef
,
Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror
Weitz.




http://www.ee.technion.ac.il/people/zivby/papers/webwalker/webwalker.ps


http://www.ee.technion.ac.il/people/zivby/papers/webwalker/webwalker.ppt



[
2
]
Ziv Bar
-
Yossef , Maxim Gurevich
Random Sampling from a Search Engine’s
Corpus

http://www.ee.
technion.ac.il/people/zivby/papers/se/se.techreport.pdf


[
3
]

Yahoo API
http://developer.yahoo.com/search/


[
4
] XStream project tutorial
http://xstream.codehaus.org/tutorial.html


[5] Automatic Evaluation of Search Engines
Levin Boris, Laserson Itamar


Instructor Gurevich Maxim


http://softlab
-
pro
-
web.technion.ac.il/Projects/2006Spring/Automatic%20Evaluation%20of%20Sear
ch%20Engines/final
%20report%20remarks.doc