Web Page Clustering based on Web Community Extraction

dealerdeputyAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)

62 views

Web Page Clustering based on

Web Community Extraction

Chikayama
-
Taura Lab.

M2 Shim Wonbo

Background

Directory

= Category

Open Directory Project


Used by Google, Lycos, etc.


Categorizing Web pages by hand


Accurate



Lately updated


Unscalable


World Wide Web


Rapid increase (= # of clusters changes)


Daily updated (= cluster centers move)



Due to these two properties of the Web..


A Web page clustering system without human
effort is needed.

Purpose


Constructing a Web page clustering
system which


finds clusters without human help


is scalable


clusters Web pages in high speed


clusters Web pages accurately

Brief System View

(a) Web pages

DBG

Extraction

(b) Web Communities

(c) Web Page Clustering

Partitioning of

remaining pages

based on TF
-
IDF

Contribution


Web Community


A new Web community topology is defined.


Extracted Web community shows higher
precision than existing work.



Web Page Clustering


An approach to exploit Web communities as
centroids of clusters in TF
-
IDF space is taken.


Experimental results show meaningful clusters.

Agenda


Introduction


Related Work


Proposal


Evaluation


Conclusion

Existing Work


Text
-
based clustering


Use of terms as feature


Generally used algorithm


ex) k
-
means, Hierarchical algorithm, Density
-
based
clustering



Link
-
based clustering


Called as Web community extraction


Extracting dense subgraphs from the Web graph



Conjunction of text and link information


ex) Contents
-
Link Coupled Web Page Clustering [
Yitong
et al., DEWS2004
]

Text
-
based Clustering


Merit


Accurate (because of considering text)


Problem


Unsupervised clustering


Complex to decide the number of clusters


Supervised learning and clustering


Difficult to label each training datum

Contents
-
Link Coupled Web Page Clustering
[Yitong et al., DEWS2004]


Feature

Term frequency (
p
term
), Out
-
link (
p
out
), In
-
link (
p
in
)



Similarity





Clustering Algorithm


An extension of the k
-
means algorithm

Extraction of Web Community based on
Link Analysis


An Approach to Find Related Communities
Based on Bipartite Graphs [P.Krishna
Reddy et al., 2001]


PlusDBG: Web Community Extraction
Scheme Improving Both Precision and
Pseudo
-
Recall [Saida et al, 2005]

Terminology


Fan and Center





Bipartite Graph (BG)


Complete BG (CBG)


Dense BG (DBG)

Fan

Center

(a) CBG

(b) DBG

p

q

)
,
(
t
t
q
q
p
p


Algorithm for Extracting DBG

[Reddy et al., 2001]


Finds bipartite graph using co
-
citing and
co
-
cited Web pages


Extracts a DBG from above graph

Seed
page

2

4

3

3

1

DBG(3, 3)

1

3

3

3

3

PlusDBG


Uses distance defined by co
-
citing page
rate between two pages


Finds co
-
citing pages which are within
distance threshold


Extracts a DBG from above graph



PlusDBG shows higher precision than DBG
does.

Web Community Extraction

O High speed

O Finding out topics over the Web


X Possibility of extracting unrelated Web
pages as a community


Problem of DBG

Improvement of PlusDBG

Agenda


Introduction


Related Work


Proposal


Evaluation


Conclusion

Proposal

1.
Extracts Web communities using link
structure.

2.
Assigns remainders to the closest Web
community in TF
-
IDF space.


Connecter


Fan which is citing two centers.


Connectable


If two centers are connectable, the centers
have more than two connecters.


Web Community


A Web Community C is a DBG composed of
connectable centers and connecters.

Connectable centers

Connecter

Proposed Web Community

All center is connectable

to another one.

Proposed Web Community

Extraction Algorithm

b

c

d

e

f

g

h

i

a

j

S={}

T={g}

S’={a,b,c,d}

T’={e,f,h,i,j}

t’=j

# connecters = 1

T’={e,f,h,i}

t’=i

# connecters = 3

S={b,c,d}

T={g,i}

Output Community = {a,b,c,d,e,f,g,h,i}

Labeling Remainders


Remainder: a Web page which is not
extracted as a member of communities.


1.
Calculate centroids of Web communities.



2.
Label remainders with Web community ID

w.r.t
v
i

is the TF
-
IDF vector of a page
v

Agenda


Introduction


Related Work


Proposal


Evaluation


Preprocess


Web community extraction


Labeling result


Conclusion

Preprocess


Data set


2.34 M pages, 20 M links


Almost 80% of data set is Japanese pages.



Create a link
-
only file


Links to out of data set are deleted.


Duplicates are deleted which share 90% of links.


Pages including 50 links are deleted.


Remained data set: 1.45 M pages, 5.09 M links



Create a TF
-
IDF file


Used TF
-
IDF:


Parser: MeCab


Terms which appeared in less than 0.1% or more than 90% of
total documents are removed

Distribution of Web Community Size

Distribution of Web Community Size

# communities

# extracted
pages

PlusDBG 0.8

22,902

865,945

PlusDBG 1.0

8,077

922,053

PlusDBG 1.2

7,527

923,100

Proposed
method

50,065

648,626

Distance from centroids to term vectors

Variance of distance

Example of Web communities


About motor bike manufacturers and links.


http://bike.ak
-
m.jp/


http://www.bike
-
cube.jp/


http://bike.ak
-
m.jp/2006/01/post_32.html


http://www.bike
-
cube.jp/index.php


http://bike.ak
-
m.jp/2006/11/post_20.html


http://www.kymco.co.jp/


http://www1.suzuki.co.jp/motor/


http://www.yamaha
-
motor.jp/mc/


http://bike.ak
-
m.jp/


http://www.peugeot
-
moto.com/


http://www.apriliajapan.co.jp/index.html


http://www.buell.jp/


http://www.cagiva.co.jp/


http://www.mitsuoka
-
motor.com
/


http://www.ducati.com/od/ducatijapan/jp/index.jhtml


http://www.triumphmotorcycles.com/japan/


http://www.harley
-
davidson.co.jp/index.html


http://www.ktm
-
japan.co.jp/


Comparing to ODP


Definition of precision

1.
From a Web community
C
, let page subset existing in
ODP
OC
.

2.
If |
OC
| < 3, the precision of
C

is undefined.

3.
For r in OC, the Pscore of r is:



4.
With Pscore, the precision of C is:




Comparing to the 4
th

and 5
th

level of ODP
directories (Top/Regional/Japan/Arts/Movie)


The number of ODP pages included in the data
set: 47,093

score(p, q) = 1, p, q in same directory

score(p, q) = 0, otherwise

Comparing to ODP

# pages of ODP

# communities
including ODP
pages

# directories
which the pages
belong to

PlusDBG 0.8

23,287

459

426

PlusDBG 1.0

25,016

156

430

PlusDBG 1.2

25,405

81

435

Proposed
Method

12,406

4811

337

Precision of Web Communities(4
th

level)

Precision of Web communities(5
th

level)

Summary of Web Community Extraction


The proposed method extracted smaller
Web communities than PlusDBG did.


Members of each community were closer
to the centroid in the TF
-
IDF space than
members of PlusDBG were.


My communities showed higher precision
than PlusDBG’s when comparing to ODP.

Labeling Result


Ignore pages including less than 10 terms.


Compare to the ODP


ODP pages: 29,153


ODP directories: 1,862

Labeling Result (the 4
th

level)

Labeling Result (the 5
th

level)

Labeling example

Labeling example

Summary and Conclusion


A DBG structure is defined as the Web
community topology.


All two centers should be connectable.


All fan is a connecter of centers.


My DBG structure extracts more compact and more
precise Web communities than existing work does.



Clustering based on the Web community
extraction is proposed.


The centroids of communities in TF
-
IDF space are used
in labeling of remainders.


Clustering result showed meaningful page groups.

Future Work


Coupling feature selections for
improvement on the labeling result.


Clustering extracted centroids.

発表文献


(発表予定)

ウェブコミュニティ抽出アルゴリズムの
改良、沈

垣甫、田浦

健次郎、近山

隆、データ工
学ワークショップ、
2007

Thank you for attention


1.
Select seed page t and set T={t}, S={}.

2.
Find S’ of which members cite any page in T.

3.
Find T’ of which members cited by any page
in T and are not in T.

4.
Determine that t’

T’ is connectable to all
pages in T.

1.
If t’ is connectable, set T=T

{t’} and
S={connecters} and go to 2.

2.
If not, select other t’

T’ and go to 4.

5.
If |S| > 3 and |T| > 3, extract the page set
as a Web Community and delete from the
Web Graph.

6.
If any t exists, go to 1.

Extraction Algorithm