PHRASE-BASED KNOWLEDGEABLE DOCUMENT INDEX MODEL FOR WEB DOCUMENT CLUSTERING

overratedbeltAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)

66 views


1

PHRASE
-
BASED KNOWLEDGEABLE DOCUMENT INDEX
MODEL FOR WEB DOCUMENT CLUSTERING


M.MOHAN REDDY

B.T
ech Computer Science Engineering

Dr. S G Institute of Engineering & Technology, Markapur

E
-
Mail: mohanmallela@yahoo.co.in

SAPTAGIRI PRASAD

B.T
ech Computer Science

Engineering

Dr. S G Institute of Engineering & Technology, Markapur

E
-
Mail: saptha_1987@yahoo.co.in


ABSTRACT
-
Most of the document clustering techniques relies on single term analysis of the document
data set, such as the Vector space model. More informat
ive features including phrases and their weights
are particularly important to achieve more accurate document clustering. Document clustering is
particularly useful in many applications such as automatic categorization of documents, grouping search
engine
results, building taxonomy of documents and others. The motivation behind the work in this paper is
that we believe that document clustering should be based not only on single word analysis, but on phrases
as well. Phrase based analysis means that the simi
larity between documents should be based on matching
phrases rather than on single words only. In this paper, we propose a system for Web clustering based on
two key concepts. The first is the use of weighted phrases as an essential constituent of document
s.
Similarity between documents will be based on matching phrases and their weights. The second concept is
the incremental clustering of documents to maximize the tightness of clusters by carefully watching the
similarity distribution inside each cluster.




1
INTRODUCTION

The system consists of four components:

1. A Web document
-
restructuring scheme
that identifies different document parts, and
assigns levels of significance to these parts
according to their importance.

2. A novel phrase
-
based documen
t indexing
model, the Document Index Graph (DIG)
that captures the structure of sentences in the
document set, rather than single words only.
The DIG model is based on graph theory
and utilizes graph properties to match any
-
length phrase from a document to

any
number of previously seen documents in a
time nearly proportional to the number of
words of the document.

3. A phrase
-
based similarity measure for
scoring the similarity between two
documents according to the matching
phrases and their significance.

4. An incremental document clustering
method based on maintaining high cluster
cohesiveness.

The integration of these four components
proved to be of superior performance to
traditional document clustering methods.
The overall system design is illustrated

in
Fig. 1.




2 WEB DOCUMENT STRUCTURE
ANALYSIS

The proposed system analyzes the HTML
document and restructures the document
according to a predetermined structure that
assigns different levels of significance to
different document parts. The result is a
well
-
structured XML document that
corresponds to the original HTML
document, but with the significance levels
assigned to the different parts of the original
document.


We assign one of three levels of
significance to the different parts: HIGH,
ME
DIUM, and LOW. Examples of HIGH
significance parts are the title,
metakeywords, metadescription, and section
headings. Examples of MEDIUM
significance parts are text that appears in
bold, italics, colored, hyper
-
linked text,
image alternate text, and table

captions.


2

LOW significance parts are usually
comprised of the document body text that
was not assigned any of the other levels.

This structuring scheme is exploited in
measuring the similarity between two
documents For example, if we have a phrase










Web documents






Document clusters




Fig 1. Web document clustering system



match of HIGH significance in both
documents, the similarity is rewarded more
than if the match was for LOW significance
phrases. T
his is justified by arguing that a
phrase match in titles, for example, is much

more informative than a phrase match in
body text.



3 DOCUMENT INDEX GRAPH

The proposed Document Index Graph (DIG
for short) indexes the documents while
maintaining the senten
ce structure in the
original documents. This allows us to make
use of more informative phrase matching
rather than individual words matching.
Moreover, DIG also captures the different
levels of significance of the original
sentences, thus allowing us to ma
ke use of
sentence significance.


3.1

DIG Construction

The DIG is built incrementally by
processing one document at a time. When a
new document is introduced, it is scanned in
sequential fashion, and the graph is updated
with the new sentence information a
s
necessary. New words are added to the
graph as necessary and connected with other
nodes to reflect the sentence structure. The
graph building process becomes less
memory demanding when no new words are





















introduced by a new document (or

very few
new words are introduced). At this point, the
graph becomes more stable, and the only
operation needed is to update the sentence

structure in the graph to accommodate
the new sentences introduced. It is very
critical to note that introducing a ne
w
document will only require the
inspection (or addition) of those words
that appear in that document and not
every node
in the graph. This is where the
efficiency of the model comes from. Along
with indexing the sentence structure, the
level of significan
ce of each sentence is also
recorded in the graph. This allows us to
recall such information when we measure
the similarity with other documents.


4 APHRASE
-
BASED SIMILARITY
MEASURE

This phrase
-
based similarity measure is a
function of four factors:

. The
number of matching phrases P,

. The lengths of the matching phrases

(li : i = 1; 2; . . . ; P);

Document

Structure

ide
ntification




DIG

representation

Web documents

Document
similarity

calculation

Incremental
clustering

(Histogram
based
clustering)

Document clusters

Incremental
clustering

(Semantic
based
clustering)


3

. The frequencies of the matching phrases in
both documents (f1i and f2i : i = 1; 2; . . . ;
P), and

. The levels of significance (weight) of the
matching phra
ses in both documents

(w1i and w2i : i = 1; 2; . . . ; P).

The phrase similarity between two
documents, d1 and d2, is calculated using
the following empirical equation:





i=1

p

[g(li).(f1i w1i + f2i w2i)]
2

sim
p
(d1, d2)=




j

s1j

+

k

s2k

.w2k


Where g(li) is a function that scores the
matching phrase length, giving higher score

as the matching phrase length approaches
the length of the original sentence;

s1j


and

s2k

are the original sentence lengths
from document d1

and d2, respectively. The
equation rewards longer phrase matches
with higher level of significance, and with
higher frequency in both documents. The
function g (li) in the implemented

system was used as:



g (li) = (li/

s1j

)



where

s1j


is the origi
nal phrase length
and _ is a sentence fragmentation factor with
values greater than or equal to 1. If


is 1,
two halves of a sentence could be matched

independently and would be treated as a
whole sentence match. However, by
increasing


we can avoid this

situation

and score whole sentence matches
higher than fractions of sentences. A
value of 1.2 for


was found to produce
best
results.



5 INCREMENTAL DOCUMENT
CLUSTERING

The role of a document similarity measure is
to provide judgment on the closeness of

documents to each other.

Incremental clustering is an essential
strategy for online applications, where time
is a critical factor for usability. Incremental
clustering algorithms work by processing
data objects one at a time, incrementally
assigning data

objects to their respective
clusters while they progress. The process is

simple enough, but faces several challenges.
How to determine to which cluster the next
object should be assigned? How to deal with
the problem of insertion order? Once an
object has

been assigned to a cluster, should
its assignment to the cluster be frozen or is it
allowed to be reassigned to other clusters
later on? Usually, a heuristic method is
employed to deal with the above challenges.
A “good” incremental clustering algorithm
h
as to find the respective cluster for each
newly introduced object without
significantly sacrificing the accuracy of
clustering due to insertion order or fixed
object
-
to
-
cluster assignment. Some of the
methods are single pass clustering, K
-
nearest neighbor
, Suffix tree clustering, DC
-
tree clustering, Histogram based incremental
clustering.


5.1 Similarity Histogram Based
Incremental Clustering

The clustering approach proposed here is an
incremental dynamic method of building the
clusters. We adopt an overl
apped cluster
model. The key concept for the similarity
histogram
-
based clustering method (referred
to as SHC hereafter) is to keep each cluster
at a high degree of coherency at any time.
We represent the coherency of a cluster with
a new concept called Cl
uster Similarity
Histogram: A concise statistical
representation of the set of pair
-
wise
document similarities distribution in the
cluster. A number of bins in the histogram
correspond to fixed similarity value
intervals. Each bin contains the count of
pai
r
-
wise document similarities in the
corresponding interval.

5.2 Creating Coherent Clusters
Incrementally

The quality of a similarity histogram (cluster

cohesiveness) is judged by calculating the
ratio of the count of similarities above a
certain similarity

threshold S
T

to the total
count of similarities. The higher this ratio,
the more coherent the cluster is.

Let n
c

be the number of the documents in a
cluster. The number of pair
-
wise similarities
in the cluster is
m
c

= n
c
(n
c

+1)/2.


4

Let S={si: i=1;…;mc}
be

the set of
similarities in the cluster. The histogram of
the similarities in the cluster is represented
as:

H = {hi : i = 1; . . .;B}

hi = count (sk) s
li

<= s
k

< s
ui
;

where

B: the number of histogram bins,

hi: the count of similarities in bin i,

sli: th
e lower similarity bound of bin i, and

sui: the upper similarity bound of bin i.

The histogram ratio (HR) of a cluster is the
measure of cohesiveness of the cluster as
described above, and is calculated as:



i=T

B

hi


HRc =



j=1
B hj

T =

ST . B


where

HRc: the histogram ratio of cluster c,

ST : the similarity threshold, and

T: the bin number corresponding to the
similarity threshold.

Basically, we would like to keep the
histogram ratio of each cluster high.
However, since we al
low documents that
can degrade the histogram ratio to be added,
this could result in a chain effect of
degrading the ratio to zero eventually. To
prevent this, we set a minimum histogram
ratio HRmin that clusters should maintain.
We also do not allow addin
g a document
that will bring down the histogram ratio
significantly (even if still above HRmin).

This is to prevent a bad document from
severely bringing down cluster quality by
one single document addition.

















Fig. 2 Semantic base
d clustering system design






5.2 Semantic Based Clustering

The histogram based clustering produces
wrong results for the following example
sentences:

Ex 1: “He is an intelligent boy”


“He is a brilliant boy”



Ex2: “Mother baked for 3 hrs.”


“The p
ie baked for 3 hrs.”


The overall semantic based clustering
system is shown in fig.2

The parse trees for ex.2 are shown in fig.3

Knowledge representation for the same is
shown in fig.4


6

APPLICATIONS

Applications of this framework include
automatic groupi
ng of search engine results,
semantic based information retrieval,
improve the organization and viewing of
documents, improve information retrieval
system performance and improve automatic
speech recognition systems.






















Web documents

Syntactic
analysis

Parse trees

Semantic
based
clustering

Semantic
analysis

Document

clusters

Knowledge
representation


5























Fig.3 Parse trees for “Mother baked for 3 hrs.” and “The pie baked for 3 hrs.”












Fig.4 Knowledge representation for above







7 CONCLUSIONS AND FUTURE
RESEARCH

We presented a system composed of four
components in an attemp
t to improve the
document
-
clustering problem in the Web
domain. Information in Web documents
does not lie in the content only, but in their
inherent semi structure. We presented a Web
document analysis component that is
capable of identifying the weights o
f various
Web documents phrases and breaking down
the document into its sentence constituents
for further processing. Further these clusters
are grouped based on semantic similarities

of the phrases. By this we improved the
efficiency of the clustering mo
del.

There are a number of future research
directions to extend and improve this work.






One direction that this work might continue
on is to improve on the accuracy of
similarity calculation between documents by
employing different similarity calculat
ion
strategies. Although the current scheme
proved more accurate than traditional
methods, there is stillroom for improvement.


8 REFERENCES:

1.

K. Hammouda and M. Kamel,
“Phrase
-
Based Document Similarity
Mother

S

NP

VP

baked

for 3 hrs.

V

PP

The pie

S

NP

VP

baked

for 3 hrs.

V

PP

Obj1

Act 1

St
1

The pie

baked

For 3 hrs.

Obj1

Act 1

St 1

Mother

baked

For 3 hrs.


6

Based on an Index Graph Model,”
Proc. 2002 IEEE Int’l
Conf.

2.

R. Kosala and H. Blockeel, “Web
Mining Research: A Survey,”

3.

J.L. Fagan, “Experiments in
Automatic Phrase Indexing for
Document Retrieval: A Comparison
of Syntactic and Non
-

Syntactic
Methods,” PhD thesis, Dept. of
Computer Science, Cornell Univ.,
Sep
t. 1987.

4.

M. Charikar, C. Chekuri, T. Feder,
and R. Motwani, “Incremental
Clustering and Dynamic
Information Retrieval,” Proc. 29th
Ann. ACM Symp. Theory of
Computing, pp. 626
-
635, 1997.

5.

F. Beil, M. Ester, and X. Xu,
“Frequent Term
-
Based Text
Clustering,” P
roc. Eighth Int’l Conf.
Knowledge Discovery and Data
Mining (KDD 2002), pp. 436
-
442,
2002.






























6.

.W. Wong and A. Fu, “Incremental
Document Clustering for Web Page
Classification,” Proc. 2000 Int’l
Conf. Information Soc. in the 21st
C
entury: Emerging Technologies
and New Challenges (IS2000),
2000.

7.

D. Boley, “Principal Direction
Divisive Partitioning,” Data Mining
and Knowledge Discovery, vol. 2,
no. 4, pp. 325
-
344, 1998.