PageSim: A Novel Link-based

blabbingunequaledAI and Robotics

Oct 24, 2013 (4 years and 2 months ago)

82 views

1

PageSim: A Novel Link
-
based
Measure of Web Page Similarity

LIN Zhenjiang, 28 April 2006

zjlin@cse.cuhk.edu.hk

http://www.cse.cuhk.edu.hk/~zjlin

2

Outline


1. Background


2. Motivation


3. Existing approaches


4. PageSim: a new approach


5. Demonstrations


6. Conclusion and future work

3

1. Background I

Mining the World
-
Wide Web

I


Web mining
-

data mining techniques to
automatically discover and extract information
from Web documents/services (Etzioni, 1996).


Web mining research


integrate research from
several research communities (Kosala and
Blockeel, July 2000) such as:


Database (DB)


Information retrieval (IR)


The sub
-
areas of machine learning (ML)


Natural language processing (NLP)

4

1. Background II

Mining the World
-
Wide Web

II


WWW is huge, widely distributed, global
information source for


Information services: news, advertisements,
consumer information, financial management,
education, government, e
-
commerce, etc.


Hyper
-
link information


Access and usage information


Web Site contents and Organization

5

1. Background III

Mining the World
-
Wide Web

III


Growing and changing very rapidly


Broad diversity of user communities


Only a small portion of the information on the
Web is truly relevant or useful to Web users


How to find high
-
quality Web pages on a
specified topic?


WWW provides rich sources for data mining

6

1. Background IV

Challenges on
the Web


Finding Relevant Information


Creating knowledge from Information available


Personalization of the information


Learning about customers / individual users




7

1. Background V

Web Mining Taxonomy


Web Content Mining
:

extract

/

mine useful
information or knowledge from web page
contents
, including

text, image, audio, video
,

and
metadata
, etc
.


Web Structure Mining
:

discover useful
knowledge from the structure of hyperlinks.


Web
Usage

Mining
:

refers to the discovery of
user access patterns from Web usage logs
.

8

1. Background VI

Web Structure Mining

I


Hyperlinks can infer the notion of authority


The Web consists not only of pages, but also of
hyperlinks pointing from one page to another


These hyperlinks contain an enormous amount of
latent human annotation


A hyperlink pointing to another Web page, this can be
considered as the author's endorsement of the other
page
.

9

1. Background VII

Web Structure Mining

II


Web pages categorization (Chakrabarti, et al.,
1998)


Discovering micro
-
communities on the web


-

Example: Clever system (Chakrabarti, et al., 1999),
Google (Brin and Page, 1998)


Schema Discovery in Semi
-
structured
Environment

(identify typical structuring info.)

10

2. Motivation I

Fi
nding
related

or
similar

web pages I


web search engines

11

2. Motivation II

Fi
nding
related

or
similar

web pages II


web document
c
lassi
fi
cation

12

3. Existing approaches I


Text
-
based


Classic IR
, Jaccard’s coefficient, Adamic/Adar


Pure link
-
based


Single
-
step: cocitation,
common neighbor, …


Multi
-
step
:


Companion (Dean, Henzinger,
19
98)


SimRank (Jeh, Widom,
20
02)


Hybrid


Anchor text based (Haveliwala et al.
20
02)

13

3. Existing approaches II


Notations


Sim(
a
,
b
): similarity score of web page
a
and
b
.


I(
a
): in
-
link neighbors of web page
a
.


O(
a
): out
-
link neighbors of web page
a
.


Common neighbor method


Sim(
a
,
b
) = |O(
a
)∩O(
b
)|


= |(
c
,
d
)| = 2


Cocitation method


Sim(
a
,
b
) = |I(
a
)∩I(
b
)|


= |(
c
,
d
)| = 2

14

3. Existing approaches III


SimRank

“two pages are similar if they are referenced
(cited, or linked to) by similar pages”


(1)

Sim(
u
,
u
)=1; (
2)

Sim(
u
,
v
)=0 if |I(
u
)| |I(
v
)| = 0.


Recursive definition





C

is a constant between 0 and 1.


The i
teration

starts with Sim(u,u)=1, Sim(u,v)=0

if u≠

v.

15

4. PageSim: a new approach I


Two considerations


On the Web, not all links are equally important.


Common neighbor, cocitation


A similarity measure should be able to measure
the similarity between
any

two web pages.


SimRank


PageSim


Take the above problems into account.

16

4. PageSim: a new approach II


Cocitation






Which page is more similar to
d
,
c

or
e
?


Suppose page
a

is
YAHOO!
’s homepage, and
b

is
a personal web page.

A
uthoritative

pages are more important.

17

4. PageSim: a new approach III


SimRank






Are
a

and
b

similar?


SimRank says “NO”s.


Are the answers reasonable?

18

4. PageSim: a new approach IV


Page
a

linking to
b

and
c

means
a

“thinks”





b

and
c

are similar.


both
b

and
c

are similar to
a
.


Intuitions


Page
a

spreads similarity to its neighbors.


Authoritative

pages spread more similarity.

19

4. PageSim: a new approach V


PageSim


In PageSim, PageRank (PR) score is used to
measure the authority of a web page.


PR assigns global
importance

scores to all web pages.


Each page spreads its
own

similarity score (PR
score) to its neighbors.


Each page also propagates
other pages

similarity scores to its neighbors.


After the similarity score propagation finished,
each page contains an array of similarity scores.


PageRank score propagation

20

4. PageSim: a new approach VI


Example: similarity propagation

(page
a

only)


PR(
a
)=100, PR(
b
)=55, PR(
c
)=102


Each page propagate 80% of its similarity score
averagely to its neighbors.

21

4. PageSim: a new approach VII


Example: similarity propagation II


PR(a)=100, PR(b)=55, PR(c)=102


Each page contains a similarity score vector(SV
).


SV(a) = (100, 35, 82 ),


SV(b) = ( 40, 55, 33 ),


SV(c) = ( 72, 44, 102 ),


PageSim score (PS) computation


PS(a,b)=
Σ
min( SV(a), SV(b) )



= 40+35+33 = 108


Two pages are more similar if they share more
common

similarity scores.

22

4. PageSim: a new approach VIII


Example: similarity spreading III


PageSim score matrix


PS_matrix = (PS(
u
,
v
))
nxn
=


a: 217


b: 108 128


c: 189 117 219


PS_matrix is symmetric.


PS(
a
,
b
) = PS(
b
,
a
)


Any web page is most similar to itself.


PS(
u
,
u
) = max ( PS(
u
,
v
) ), for any
v.

23

4. PageSim: a new approach IX


Propagation radius pruning I


The time complexity of propagating one page’s
similarity score to all the others is O(k
n
), where k
is the average number of out
-
links.


Similarity score propagated to distant pages is
too small to be omitted.


Reducing complexity of propagation to O(kr) by
limiting the radius of propagation to r.


24

4. PageSim: a new approach X


Propagation radius pruning II


Real data (CSE homepage) and synthetic data


25

5. Demonstrations I


Example 1: single link




PageSim matrix

a:
100

b:
80

265

c:
64




212



469.2

d:
51.2

169.6


375.
4



694.1


PR = (100, 185, 257.2, 318.6)




SimRank matrix

1




0

1



0


0

1


0


0


0

1

26

5. Demonstrations II


Example 2: loop link




PageSim matrix

a: 295.2

b: 246.4

295.2


c: 230.4

246.4

295.2

d: 246.4

230.4

246.4

295.2


PR = (100, 100, 100, 100)




SimRank matrix

1




0

1



0


0

1


0


0


0

1

27

5. Demonstrations III


Example 3: more complex


PageSim matrix

1: 100.0

2: 40.0

487.6

3:
50.7


159.4

397.4

4: 10.7

238.5

130.0

275.5

5: 10.7

130.0

130.0

130.0

314.9

PR = (100, 40.0, 50.7, 10.7, 10.7)


SimRank matrix

1:
1



2: 0 1


3: 0
0.25

1

4: 0 0 0.5 1

5: 0 0 0.5 1 1



PageSim results


v
3

is most similar to
v
1
.


v
4

is most similar to
v
2
.

28

6. Conclusion and future work I


Conclusion


Web Mining


Web page similarity measures

Text
-
based, Link
-
based, and Hybrid


PageSim:
PageRank score propagation.


Propagation radius pruning


PageSim vs SimRank

29

6. Conclusion and future work II


Future work


Evaluation of PageSim


Taking traditional text
-
based similarity measure
TFIDF as ground truth.


Efficiency of computation


Since computing PageSim score of two web pages is
O(n), computing all n
2

pairs of pages is O(n
3
).


Storage issue


Since each page needs an array of length n to store
similarity scores issued from all web pages, the
storage needed by PageSim is O(n
2
).

30

Q & A


Thank you!