1
A Survey of
Web Document Clustering
Wenyi Ni
Southern Methodist University
Department of computer of Computer science and Engineering
Abstract.
This
survey
presents the research advances in web document clustering field in recent years. Several
algorit
hms are introduce
d
,
including
agglomerative hierarchical clustering, K

means, suffix tree
clustering, link

based clustering, frequent item based clustering
and some other refined approaches
.
This survey also introduces the methods (Entropy and F measure) t
o evaluate the quality of clusters.
Some algorithm comparisons are made in this paper based on them.
Introduction
.
With the booming of the Internet, the number of the web pages is incremented in an explosive
rate
.
According to the statistics from Goog
le, there are about 3 billion web pages
in the year of 2003
.
E
fficiently retrieva
l of
information has become a challenge. Dozens of Internet search engines have
been designed to help users
retrieve
valuable information, while there are still a lot of
issu
e
s to be
researched and improved
. This
survey
will address on researches
that have been done on web
document clustering technology in recent years.
Web document clustering is the core topic in
the
information retrieval
field
. It uses unsupervised
algo
rithms
to cluster a large amount web pages into several groups.
Let’s take an example to
2
illustrate why web document clustering is necessary.
Everyone has
experienced
Google search
for
information from Internet.
In response to a query of a web client
, Goog
le will send back
tons
of web
pages. Although they are listed by the order of its importance, users still sometimes have to browse
hundreds of web page to find what they want. If we can group the web pages into groups, users can
skip the group they are no
t interested in.
They
will not
have to browse too many
web pages before
reaching their target
s
. This will help users to do their queries efficiently.
However the problem is
How to group web pages? The answer is
using
web document clustering. In this
surv
ey
, we will
review
the following aspects of web document clustering technology
that are developed in recent
years
.
Key requirements for Web document clustering.
How to present a document in the mathematical model
Different kind
s
of web document cluster
algorithms.
Some ref
inement
s to the clustering algorithm
How to choose an appropriate topic to
present the clusters.
How to evaluate the algorithms and resulting clusters.
Evaluate and compare different algorithm
s
.
A real world web document clusterin
g application
1. Key requirements for Web document clustering
:
1.1 Relevance
The algorithm should produce clusters that group documents relevant to the user
’
s query from
irrelevant ones. The documents in one cluster should talk about the same topic.
3
1.2
Browsable Summary
The user can determine which cluster he is interested in by a glance. So we need to
give a topic
name
to each cluster.
The topic name should be representative.
1.3
CPU and Memory
The algorithm should be efficient to deal with a large amou
nt of documents. The user need not wait
a long time for the result. And the method should be approachable by real world application.
1.4 Overlap:
Each document may have multiple topics, it is important to
avoid
confining each document to only
one cluster.
2 Mathematic
al
model
s of web document
Different approaches use different mathematical models to calculate web documents. Here,
we will
discuss
four common models that are used in recent research papers.
2.1. tf

idf model
The full name of tf

idf is term
frequency

inverse
document
frequency. In this model, each document
D is consider to be represented by a feature of the form (d
1
, d
2
,
…
,d
n
), where d
i
is a word in D. The
order of the d
i
is based on the weight of each word. The formula below is the common wei
ght
calculating formula that is widely used, while in different approaches, the formula is not exactly the
same. Some extra parameters may be added to optimize the whole clustering performance. Such as
in [8], the term
weigh
is determined by the formula,
which
combines Okapi
’
s tf score and IN

QuERY
’
s normalized idf score.
Weight calculating formula in tf

idf:
4
In the above formula, each factor is explained below:
tf
ij
is
number of occurrences of the term
t
j
in the Web page P
i
idf
j
is Inverse document fr
equency.
df
j
is the number of Web pages in which
term
t
j
occurs in the
web
document collection
.
n is the total number of Web pages in the database.
From this formula, we can see that the weight of each term is based on its inverse document
frequency (IDF
) in the document collection and the occurrences of this term in the document..
Normally we have a word dictionary to strip out the very common words. Also we just select the n
(zemir[1] uses 500) highest

weighted terms as the feature.
2.2 Suffix Tree mod
el.[1]
The whole web document is treated as a string. The string is a little difference with the origin
HTML file. Delete the word prefix and suffix, reduce plural to singular. Sentence boundaries are
marked and non

word tokens
(such as numbers, HTML ta
gs and most punctuation) are stripped. The
identification of base clusters is the creation of an inverted index of phrases for the web document
collection.
The following
figure
is an ST example[1].( It builds on three phrases: (1) cat ate cheese
(2)cat ate
mouse too (3) mouse ate cheese too)
j
j
j
ij
ij
df
n
idf
idf
tf
w
log
where
5
Figure1. Sample
of
Suffix Tree
(courtesy of zemir[1])
T
he algorithm of suffix tree clustering will be introduced
in the next section of
this survey
.
2.3 Link based model
[18]
:
From each web document in search resul
t R, we extract all its out

links and in

links. Finally we get
N out

links and M in

links for all URLs.
Each web document P is represented as 2 vectors: P
out
(N

dimension)and P
in
(M

dimension). P
out,
i
represents whether the web document P has a out

link in
the i
th
item of vector Pout. If has, the i
th
item is 1, else 0. Identical meaning for P
in,j
2.4 Frequent item set model [4]:
In this model, each document corresponds to an item and each possible feature corresponds to a
transaction. The entries in the ta
ble (domain of document attribute) represent the frequency of
occurrence
of a specified feature (word) in that document.
6
Doc1
Doc2
Doc3
…
䑯a
p灯牴p
P
4
R
…
R
佬y浰mcs
O
4
O
…
P
牡ce
0
N
0
…
P
…
…
…
…
…
…
扩步
O
N
0
…
N
c

N
P
T
O
…
4
Figure 2.
A collection document present as a transactional database
3.Different kinds of web document clustering approach:
3.1
algorithms with using tf

idf model
In tf

idf model, we can use the following formula to
calculate the similarity between
pair of
documen
ts
or clusters
. Here are some popular similarity measures
:[15]
Figure3. Common similarity measurements.
Where
t
i
and
t
j
represent a pair of documents
or clusters
.
t
ih
represent the
h
th
term in document i.
Using one
of those formulas and
a proper threshold, we can easily determine
whether the documents
are similar.
7
The idea to use tf

idf model to represent document is that the more common words between two
documents, the more likely they are similar.
Agglomerative hi
erarchical clustering and Partition (K

means) clustering are two clustering
techniques that are commonly used for document clustering
with tf

idf model
.
3.1.1
Algorithm for agglomerative hierarchical clustering.
1. Start with
regarding
each documen
t as an individual cluster
2. Merge the most similar or closest pair of clusters.(use the similarity or distance measure
above)
3. Step 2 is iteratively executed until all objects are contained within a single cluster, which
become the root
of the tree.
4.
Instead of step 3, we can set a threshold to determine the number of clusters as step 2 stop
criterion.
3.1.2 A
lgorithm for K

means clustering.
1.
Arbitrary select K web documents as seed
s
, they are the initial centroids of each cluster.
2.
Assign all other web
documents to the closest centroid.
3.
Computer the
cen
troid of each cluster again. Get new centroid of each cluster
4.
Repeat step2,3, until the centroid of each cluster doesn’t change.
3.
1.3
A variant of K

means clustering
—
bisect
ing K

means
1. Select a cluster to split (There are several ways to select which cluster to split.
No significant difference exists in terms of clustering accuracy
)
. We normally
8
choose the largest cluster or the one with the least ove
rall similarity.
2. Employ the basic k

means algorithm to subdivide the chosen cluster.
3. Repeat step 2 for a constant number of times. Then perform the split that
produces clusters with the highest overall similarity.
4.
Repeat the above ste
p1,2,3, until the desired number of clusters is reached.
3.1.4
Principal direction divisive partitioning [4].
In this algorithm, we use tf

idf model to calculate the distance of pair documents or clusters. All the
documents are in a single large cluster
at the beginning. Then proceeds by splitting it. Split the
original cluster into subclusters in recursive fashion. At each stage, do the following
:
1.selects an unsplit cluster to split
2.splits that cluster into two subclusters.
The split
criteria
is t
he a threshold that measuring the average distance for the documents in a cluster
to the
mean of the cluster size if it is desired to keep the resulting clusters in approximately the same
size.
The entire cycle is repeated as many times as desired resultin
g in a binary tree. The leaf nodes
constitute a partitioning of the entire documents set.
3.2
Suffix Tree Clustering algorithm
[1]:
The idea is that each document is an order sequence of
words
, the tf

idf model lose this valuable
information. With suff
ix tree, we can exploit this document attribute to improve the quality of
cluster.
STC algorithm includes the following 5 steps:
1.
D
ocument cleaning.
9
The string of text representing each web documents transformed using a light stemming algorithm.
As I menti
oned before (in model section), we
transform
the origin document to a standard STC
model document.
2.
Identifying Base Cluster.
Create an inverted index of phrases from the web document collection with using a suffix tree. The
suffix tree of a collection of
strings is a compact trie containing all the suffixes of all the strings in
the collection.
E
ach node of the suffix tree represents a group of documents and a phrase that is
common to all of them. The label of the node represents the common phrase. Each n
ode represent
s
a
base cluster.
3.
Each base cluster is assigned a score that is a function of the number of documents it contains,
and the words that make up its phrase. For example:
S(B)=B*f(P)
Where B is the number of documents in base clu
ster B. P is the number of words in phrase P that
has a non

zero score. Words
are
in stop
list, or the
number of
appear
ances
in too few
(3 or less) or
too many
(more than 40% of the web document collection)
documents
receive a score 0. The
function f
penalizes single word, linear for phrase that is two to six words long. And become
constant for longer phrases.
4.
Combine base clusters
A binary similarity measure between base clusters based on the overlap of their document sets. For
example: two base clust
ers B
m
and B
n
with size B
m
 and B
n
. B
m
B
n
 represent the number of
documents common to both base cluster
s
. Define the similarity of B
m
and B
n
to be 1 if:
B
m
B
n
/B
m
>0.5 and B
m
B
n
/B
n
>0.5.
Otherwise is 0.
10
So two base cluster
s
are connected if they ha
ve similarity of 1. Using a single

link clustering
algorithm, all the connected base clusters are clustering together. All the documents in these base
clusters constitute a web document cluster.
5.
The clusters are scored based on the scores of their base clu
sters that are calculated in step 3.
Report the top score clusters (for example 10 clusters)
3.3 Link based algorithm.[18]
The idea is web pages that share common links each other are very likely to be tightly
related
. In this
approach [16], we use the
link state model for document and Cosine similarity measure:
For document P and Q, Similarity is calculated with the following formula:
Cosine(P,Q)=(P*Q)/(P* Q)
Where P*Q=P
o
ut
*Q
out
+ P
in
*Q
in
,
P
²
=(
P
out,i
²
+
P
in,j
²
) (Total number
of out

links and in

links of document P)
Q
²
=(
Q
out,i
²
+
Q
in,j
²
) (Total number of out

links and in

links of document Q)
For document P and cluster S, Similarity(P,S) =
Cosine(P,
C
)=
(P*C)/(P C) =
(P*C)/ )/(P*C)= ((P
o
ut
*C
out
+ P
i
n
*C
in
))/(
P* C)
W
here
C is the centroid of cluster S,
C
out
= 1/S*
P
out,
i
,
C
in
= 1/S*
P
in
,j
.
C
²
=(
C
out,i
²
+
C
in,j
²
)
S is num
b
er of documents in cluster S.
Term
“
near

common Link of cluster
”
means links shared by major
ity members of one cluster. If link
is shared by 50%members of the cluster, we call it
“
50% near common link
”
.
Link based algorithm include 4 steps:
1. Filter irrelevant web documents:
11
A web document is regarded as irrelevant if the sum
of in

links and out

links less than 2
.
2.
Use near

command link of cluster to grantee intra

cluster cohesiveness.
With the test of [16], they found 30%
near

command link is appropriate. They require that
every cluster should have at least one 30% nea
r common link.
3.
Assign each web document to cluster, generate base clusters.
Each web document is assigned to an existing cluster if (a) similarity between the document
and the correspondent cluster is above the similarity threshold and (b)
the document has a link
in common with near common links of the correspondent cluster.
If no such cluster meet the demands, the document will become a new cluster it self.
Centroid vector is used when calculated the similarity between document and c
luster. It is
recalculated after a new document joins the cluster.
4.
Generate final clusters by merging base clusters.
Recursively merging two base clusters if they share majority members. Merging threshold is
used to control merging procedure.
3.4
Association Rule
H
ypergraph Partitioning
(ARHP)
[4].
In this algorithm, we use a hyper graph H=(V,E) consists of a set of vertices and
asset
of hyperedges
where V corresponds to the set of documents being clustered, and each hyperedge corresponds to a
s
et of related documents. We use association rule discovery algorithm to find the frequent item set.
Association rules capture the
relationships
among items that are present in a transaction. The
algorithm is composed of two steps:
1. Discover all the fr
equent item

set.
2. Generate association rules from theses frequent item

sets.
12
Support
(C) is the number of
transactions that contain C where C is the subset of item set I. An
association rule is an expression for form x
y where x,y belong to I. The supp
ort S of the rule x
y
is defined as Support(x
y)/T, where T is the set of transaction. And the confidence
is defined as
Support(x
y) / Support(x). The task of discovering an association rule is to find all rules x
y,
Support S is greater than a given m
inimum support threshold and confidence
is greater than a
given minimum confidence threshold.
Apriori is such an association rule algorithm to find related items(web documents).
3.5 DC

tree algorithm
[2]
This algorithm is implemented by building a docum
ent cluster tree
(DC

tree). In the DC

tree, every
leaf
node represents a cluster
that is made up of its entries(documents)
.
Every non

leaf node
represents a cluster that is made up of all its entries( subclusters or documents).
It
’
s an incremental
algor
ithm. Every incoming document is guided to its corresponding cluster
that
is a leaf node.
In the DC

tree, there are four parameters:
1.
B
ranching factor
(B):
E
ach non

leaf node contains at most B entries. Each entry is a sub

cluster or
a document.
2.
T
wo
similarity threshold
(S1 and S2): 0<= S1<S2<=1
3. Minimum number of children in a DC leaf node (M).
Algorithm for
Document insertion:
1.
Identify the corresponding leaf node.
Start from the node. The document iteratively descends the DC three by choosing th
e most similar
child node and the similarity value is higher than threshold S1. If no similarity value is higher than
13
S1, inset as a new document leaf node to an empty entry of the node. If there is no empty entry in
this node (>B), than split the node to
two groups.
2.
Modify the leaf node:
If the document reaches a DC leaf node and the closest leaf entry (X) satisfies the similarity
threshold requirement (S2), it will combined with X. The DC entry for X is updated for the
combination. If no such X is found a
nd there is an empty entry in the leaf node, then just insert the
document to this entry. Otherwise split the node to two groups.
3.
Modify the path from the leaf node to the root:
After a document is inserted into a leaf node, update all the non

leaf nodes t
hat are included in the
path from the root to the leaf node. If no split, just add the entry to this node. If splitting, two entries
will be added to the parent node instead of one entry. If there is no room in the parent, split will
cascade to it until re
aching the root.
4.Some ref
inement
s to the clustering algorithm.
4.1. A grammar based phrase extraction algorithm [10].
Most document clustering techniques treat document as a list of words, ignoring the order or context
of the words. This algorithm captu
res some context of the words by using phrases rather individual
words as the features.
4.1.1
Parameters in the algorithm
1. N(wi) and N(wi,wj): The frequency count of a word wi and a bigram<wi,wj> where bigram
means a pair of adjacent words.
2. The mut
ual information association
measure
between words wi and wj:
14
A
MI(
wi, wj) = log
2
((N(wi, wj)/Nb)/(N(wi)N(wj)/Nw
²
))
Using this formula, the association weight is
calculated
for each of the bigrams in the training corpus
with a new non

terminal symbol wk, and
association weight A(wi, wj) are stored in the grammar
table.
4.1.2
The grammar generation algorithm is as follows:
1
.
Make a frequency table for all the words and bigrams in the training corpus.
2
.
From the frequency table calculate the association weig
ht for each bigram.
3. Find a bigram<w
i
, w
j
> with the maximum positive association weight A(w
i
, w
j
). Quit the
algorithm if none found.
4. In the training corpus, replace all the instances of the bigram<w
i
, w
j
> with a new symbol wk.
This is now the new cor
pus.
5. Add<w
i
, w
j
>, wk, and A(w
i
, w
j
) to the grammar.
6. Update the frequency table of all the words and bigrams.
7. Go to step 2.
The merge operation creates a new symbol w
k
that can be used in subsequent merge. This approach
achieves better performance
than traditional words approach. It
’
s running result is more accurate
with the proof of Entropy and F

measure.
4.2
Feature coverage method:
Recall in tf

idf mode, we select the n terms with highest weight in the document selection. But for
web document cl
ustering, a web document usually just contains ten words or less, and these words
are not included in the n highest rank list. It leads to a situation that some documents are all zero
15
weight in its feature vector. Let’ s say coverage is the percentage of
documents containing at least
no zero weight in its feature vector. We need to improve the coverage percentage.
K is the approximate number of the cluster size. So the number of clusters is N/K, where N the
number of documents in the collection. The algor
ithm includes the following steps:
1
.
Randomly select a subset of documents with size m from the corpus.
2
.
Extract all the words in the documents. Remove stop words from dictionary. Combine words with
the same root. Count the document frequency of the
se words.
3
.
Set lower = k and higher = k
4
.
Select all words with document frequency in the range from lower to upper
5
.
If the coverage of these words is larger than the target coverage threshold, quit the iteration.
Otherwise set lower = lower
–
1 and upper = upper +1, goto step 4.
After the execution of this method, we get the variables “ lower” and “higher”. Use them as the two
bounds in the document frequency table. Select all the words between these two bounds to represent
the document with it
s corresponding feature vector.
4.3 Scatter/Gather [20].
It’s a mixed clustering algorithm. The whole algorithm is made up of three major steps. Every step
has some sub algorithm. The major steps include:
1.Find k centers
2.Assigh each document in this
collection to each center.
4.
Optimize the partitions.
4.3.1
Find k centers.
16
We can use either of the two following algorithm to find the k centers.
1. Buckshot
This is linear algorithm to find K approximate centers from a web documents collection. The idea
is
straightforward. Choose a small random sample of the documents (the number of the documents is
kn^1/2) from the collection. With the cluster subroutine (AHC), we can get K approximate cluster
centers in O(kn). Although this method is not deterministic,
several random collections of sample
documents often get similar partition according to the statistics theory.
2. Fractionation.
This algorithm is another linear time method to find k approximate centers from . It iteratively
breaks a collection C of N do
cuments into N/m buckets, where m>k. Then the cluster subroutine is
applied to each bucket. Since the collection is much smaller than original one and each sub
collection document clustering need at most m² time. The total time for this sub clustering is N
/m *
m², that is Nm. We iteratively partition the C with N/m, until the number of cluster is less than k.
Recall that the number of cluster decrease rate is N/m = p<1, then the overall time needed for this
algorithm is O((1+P+P²+…)mN). So it’s a linear alg
orithm.
4.3.2
. Assign documents to the K centers
After we get the K centers, we use the assign

to

nearest neighbor algorithm to assign each
document in Collection C to its nearest center.
In this case, we need at most kn time to finish this step.
4.3
.3.
Refine the partition
17
1
.
Split groups that score poorly. The can be the average similarity between documents in the cluster,
as well as to the average similarity of a document to the cluster centroid.
2
.
Join.
The join refinement is to merge docume
nts groups in a partition P to the another most similar group.
3
.
Assign to the nearest
4
.
Iterate the 1,2,3 in this step to improve the quality of clusters.
Each
step in this algorithm is O(kn), So it’s linear.
Since all the three major steps in the Scat
ter/Gather algorithm is linear. The whole algorithm is linear.
5.How to choose an appropriate topic to present the clusters.
After the cluster algorithm creating the clusters, we ordered the documents with their original rank in
each cluster. One importan
t thing is that we should show users the clusters with meaningful titles.
For different cluster approach, we use different method.
With
documents similarity approach with tf

idf, we use the most representative word of all documents. In phrase based or STC,
we can use a
representative phrase.
As an alternative we can use a representative
document
in the cluster. We
can
choose:
1.
The cluster centroid, or the document that is most similar to the actual centroid.
2.
The highest ranked document in the cluster. the re
ason is if this document is non

relevant, then
the rest of the cluster is very likely non

relevant.
3.
Use the lowest ranked document. If the lowest is relevant, then it is very likely all documents in
this cluster is relevant.
6. How to evaluate the quality
of the result clusters.
18
Validating clustering algorithms and comparing performance of different algorithms are complex
because it is difficult to find an objective measure of quality of clusters.
There are two common measures to evaluate the quality. One
type of measure allows us to compare
different sets of clusters without reference to external knowledge. It is called an internal quality
measure. We can use a measure of
“
overall similarity
”
base on the similarity of pair documents in a
cluster. The other
type of measure is external quality measure. Evaluating the quality of the cluster
is based on comparing the clusters produced by the clustering technology to known classes.
There are two common external quality measures. One is entropy [13]. Which prov
ides a measure of
good ness for un

nested cluster of for the clusters at one level of a hierarchical clustering. The other
is F

measure, which is more oriented toward measuring the effectiveness of the hierarchical
clustering.
6.1. Entropy
For each clust
er, the class distribution of the data is calculated first. For example, for cluster j we
calculate pij, where pij is the probability that a document in cluster j belongs to class i. Then using
this class distribution, the entropy of each cluster j is calc
ulated using the following formula:
E
j
=

p
ij
log(
p
ij
)
Where the sum is taken over all classes
The total entropy for a set of clusters is calculated as the sum of the entropies of each cluster
weighted by the size of each cluster:
Ecs =
(
n
j*
E
j
/
n
)
Where
n
j is the size of cluster j, m is the number of clusters, and n is the total number of documents.
19
So from the formula, we can see the best quality is that all the documents in the cluster fall into the
same class that is known before clustering.
6.2. F m
easure
It
’
s a measure that combines the precision and recall ideas. We treat each cluster as if it were the
result of a query and each class as if it were the desired set of documents for a query. We then
calculate the recall and precision of that cluster
for each given class. More specifically, for cluster j
and class
i:
Recall(
i
, j) =
n
ij
/
n
i
Percision(
i
, j) =
n
ij
/
n
j
Where
n
ij
is the number of
members
of class
I
in cluster j,
n
j
is the number of members of cluster j
and
n
i
is the number of members of class
i.
The F measure of cluster j and class
i
is then calculated by the following:
F(
i
, j) = ( 2 * Recall(
i
, j) * P
re
cision(
i
, j)) / ((P
re
cision(
i
, j) + Recall(
i
, j))
For an entire hierarchical clustering the F measure of any class is the maximum value it
att
ains
at
any node in the tree and an overall value for the F measure is computed by taking the weighted
average of all values for the F measure as given by the following:
F =
(n
i
* F(i, j)) /
n
i
Where n
i
is the number of documents and F(i,j) is the F mea
sure of cluster j and class i.
7. Evaluation and comparison of different algorithm.
20
Agglomerative Hierarchical Clustering (AHC) algorithms are probably most commonly used. They
are typically slow when applied to large document collections. Single

link and
group

average
methods typically take O(n
²
) time, while complete

link methods typically take O(n
³
)time. AHC also
need several halting criteria. They are based on predetermined constants (for example: halt when 10
clusters remain). So AHC is very sensitive
to the halting criterion. If it mistakenly merges multiple
“
good
”
clusters, the resulting cluster will be meaningless to the user. In web clustering region, this
kind of halting criterion often cause poor results.
The clusters in the resulting hierarchy
a
re non

overlapping. The parent cluster contains only the general documents
.
K

means is much faster. It is a linear time clustering algorithm. It
’
s time complexity is O(kTn)
where k is the number of desired clusters and T is the number of iterations. The si
ng

pass method is
O(Kn), where K is the number of clusters created.
Both the basic and bisecting k

means algorithms are relatively efficient and scalable.
In addition,
t
hey are so easy to implement that they are widely used in different clustering
applica
tions.
Another advantage is that K

means algorithm can produce overlapping clusters.
A
major disadvantage of k

means is that it requires user
s
to specify k, the
number of clusters, in
advance
that
may be impossible to estimate in some cases.
Incorrect esti
mation of k may lead to poor
clustering accuracy. Also, it is not suitable
for discovering clusters of very different size that
are
very
common in document clustering.
Moreover, the k

means algorithm is sensitive to noise and outlier
data objects
as they m
ay substantially influence the mean value, which in turn lower the clustering
accuracy.
Suffix tree clustering is a linear time clustering algorithm. It
’
s time complexity is O(n). It first builds
a suffix tree based on the collection of document. So the ti
me to build the tree and the memory need
to be used for storing STC tree are huge. Zamir[1] use the snippets returned by search engine to
21
build the tree. Each snippet contains 50words on average and 20 words after word cleaning. Instead
of the whole docum
ent, this method dramatically decreases the size of the suffix tree. The precision
of the result doesn
’
t affect much
when using snippets.
It
’
s a good trade

off. In the later of this
survey, I will introduce a really world application that is implemented
by STC (metacrawler).
Link based algorithm is linear time clustering algorithm with time complex O(mn), where m is the
number of iterations needed for clustering process to converge( the convergence is guaranteed by k

means algorithm), n is the number of p
ages that
processing
in the clustering algorithm. Since m<<n,
it
’
s linear algorithm.
Use Aprori algorithm for computing frequent itemsets in a document based transaction database.
This algorithm uses a level

wise search . K itemsets are used to explore (K
+1) itemsets to mine
frequent itemsets from the database. A large number of frequent itemsets and N scans to the
database affect the speed of this algorithm. The quality of the resulting cluster is better than
traditional cluster algorithm.
Figure4.
Executing time comparison to common clustering methods (
Courtesy of [3])
22
We can see that almost all algorithms are linear except the AHC (It is a O(n
²
) algorithm)
The following
is
the comparison of the Entropy for different clustering algor
ithms for K=16 ( K is
the number of clusters).
[16]
Figure5.
Precision
comparison to
K

means, AHC and their variations
(16 clusters)
(courtesy of [16])
The following
is
the comparison of the Entropy for different clustering algorithms
for K=32 (K is the
number of clusters).
Figure6. Precision comparison to K

means, AHC and their variations (32 clusters)
The following
is
the comparison of the Entropy for different clustering algorithms for K=64 (K is the
number of c
lusters).
23
Figure7. Precision comparison to K

means, AHC and their variations (64 clusters)
Form the above statistics, Bisecting K

means algorithm always get the best quality cluster when
compare to AHC and normal k

means. Also Bisectin
g K

means is a linear algorithm (O(kn)). So in
traditional clustering algorithm (AHC, k

means),
bisecting
k

means is the best.
Compare the AHC,K

means and STC using F

measure:
Figure8. Average precision between K

mans, AHC and STC (cou
rtesy of [1])
We can see phrases base clustering algorithm usually get better quality than word based one. STC
get the best quality, because it is a context

based algorithm.
8.A real world web documents clustering Application.
24
MetaCrawler[21] is a web d
ocument clustering application which is developed by researchers from
university of
Washington
.
It uses two steps to
process
a query:
1.
It posts the users query to multiple search engines in parallel, such as google, altavista,
Findwhat… and so on
2.
It gets t
he results from these different search engines, combines the search results to a refined
rank list.
3.
Use STC clustering algorithm to cluster these web documents. Give each cluster a representative
topic.
For example, a traveler comes to dallas and what to g
et some information about this city. He types
the key words “dallas” in the search area. Let’s check what he gets.
Figure 9. Use Metacrawler to cluster web search result
.
25
From the search result, we can see the web documents from the sea
rch result are organized in
to
10
clusters. User can easily find the topic he is interested in. This is the power of web document
clustering.
Conclusion:
This survey presents the researches on web document clustering domain. The most important aspects
we
concerned are the quality of the resulting clusters and the time complexity of the algorithm. For
the tf

idf based algorithms, we use high dimensional vectors to represent documents. That leads to a
time complexity problem. So until now, we can only apply
these algorithms to a small collection of
documents. This means if we want to cluster the result from a web search (usually >100,000), we
just process the top

ranked documents rather than the entire collection. Some other approaches suffer
the memory prob
lem, such as STC, Frequent itemsets. Every method has its own advantage and
deficiency. Some refinements are made to improve the efficiency. We use snippet instead of the
whole document in STC to solve the memory problem. We present document
with
phrases
instead of
words in tf

idf model to decrease the data dimension and increases the accuracy (Because this
approach considers the document context related feature). Some divide

conquer algorithms are
invented by researchers, such as scatter/gather which is i
ntroduced in this survey. The
importance
of
web document clustering is continuing grow with the rapid growth of Internet. How to calculate the
web document to improve the quality of the cluster in a reasonable time is a key point in this field.
Reference
s:
1.Oren Zamir and Oren Etzioni.
Web document Clustering:
A Feasibility Demonstration
. 1998.
26
2.Wai

chiu Wong and Ada Wai

chee Fu.
Incremental Document Clustering for Web Page
Classification.
July1,2000
3.Jian Zhang, JianfengGao,Ming Zhou, Jiaxing Wang
.
Improving the Effectiveness of Information
Retrieval with Clustering and Fusion.
2001
4.Faniel Boley, Maria Gini, Rebert Gross, Eui

Hong Han, Kyle Hastings,George Karypis.
Partitioning

Based Clustering for Web Document Categorization
. 1999
5.Jun

Hui
Her, Sung

Hae Jun, Jun

Feyog Choi,Jung

Hyun Lee.
A Bayesian Neural Network Model
for Dynamic Web Document Clustering
.1999 IEEE TENCON
6.Noam Solnim, Nir Friedman, Naftali Tishby.
Unsupervised Document Classification using
Sequential Information Maximi
zation
. 2002
7.Anton Leuski and James Allan.
Improving Interactive Retrieval by Combining Ranked Lists and
Clustering
.2000
8.Anton Leuski.
Evaluating
D
ocument Clustering for Interactive Information Retrieval
.2000
9.
Krishna Gade, George Karypis.
Multivev
el K

way Document Clustering: Experiments &
Analysis
.
1999
10.J.Bakus,M.F.Hussin, M.Kamel.
A SOM

Based Document clustering Using Phrases.
2000
11.Guihong Cao, Dawei Song, Peter Bruza.
Suffix Tree Clustering on Post

retrieval Documents.
2003.
12.Steven Noel,
Vijay Raghavan, C.H.Henry Chu.
Document Clustering, Visualization, and
Retrieval via Link Mining.
2001
13. Taeho C. Jo.
Evaluation Function of Document Clustering based on Term Entrop
y
1997
14. Hinrich Schutze, Craig Silverstein.
Xerox
Palo Alto Research
Center.
Projections for Efficient
Document Clustering
2001
27
15.Margaret H.Dunham.
Data Mining introductory and advanced topics
2003
16. Michael Steinbach, Geoge Karypis, Vipin Kumar.
A Comparison of Document Clustering
Techniques.
2001
17.Xiaofeng he, Ho
ngyuan Zha, Chris H.Q.Fing, Horst D. Simon,
Web document clustering using
hyperlink structures.
2002
18. Yitong Wang and Masaru Kitsuregawa.
Link based clustering of web search results
2002
19.Benjamin C.M. Fung Ke Wang, Martin Ester.
Hierarchical document c
lustering using frequent
itemsets
. 2002
20.Douglass R. Cutting, David R.Karger, Jan O. Pedersen, John W.Tukey. Scatter/Gather:
A cluster

based approach to browsing large document collections.
1992
21.
http://www.m
etacrawler.com
22.
Inderjit S. Dhillon, James Fan, Yuqiang Guan
,
efficient clustering of very large document
collections.
2001.
23. Ying Zhao, George Karypis.
Criterion Functions for
Document
Clustering *experiments and
analysis
2002
24.Chris Ding, Xiaofe
ng He.
Cluster merging and splitting in hierarchical clustering algorithms
.
2002
25.Zhexue Huang.
Extensions to the k

means algorithm for clustering large data sets with
categorical values
. 1998.
Comments 0
Log in to post a comment