A Survey of
Web Document Clustering
Southern Methodist University
Department of computer of Computer science and Engineering
presents the research advances in web document clustering field in recent years. Several
hms are introduce
agglomerative hierarchical clustering, K
means, suffix tree
based clustering, frequent item based clustering
and some other refined approaches
This survey also introduces the methods (Entropy and F measure) t
o evaluate the quality of clusters.
Some algorithm comparisons are made in this paper based on them.
With the booming of the Internet, the number of the web pages is incremented in an explosive
According to the statistics from Goog
le, there are about 3 billion web pages
in the year of 2003
information has become a challenge. Dozens of Internet search engines have
been designed to help users
valuable information, while there are still a lot of
s to be
researched and improved
will address on researches
that have been done on web
document clustering technology in recent years.
Web document clustering is the core topic in
. It uses unsupervised
to cluster a large amount web pages into several groups.
Let’s take an example to
illustrate why web document clustering is necessary.
information from Internet.
In response to a query of a web client
le will send back
pages. Although they are listed by the order of its importance, users still sometimes have to browse
hundreds of web page to find what they want. If we can group the web pages into groups, users can
skip the group they are no
t interested in.
have to browse too many
web pages before
reaching their target
. This will help users to do their queries efficiently.
However the problem is
How to group web pages? The answer is
web document clustering. In this
, we will
the following aspects of web document clustering technology
that are developed in recent
Key requirements for Web document clustering.
How to present a document in the mathematical model
of web document cluster
s to the clustering algorithm
How to choose an appropriate topic to
present the clusters.
How to evaluate the algorithms and resulting clusters.
Evaluate and compare different algorithm
A real world web document clusterin
1. Key requirements for Web document clustering
The algorithm should produce clusters that group documents relevant to the user
s query from
irrelevant ones. The documents in one cluster should talk about the same topic.
The user can determine which cluster he is interested in by a glance. So we need to
give a topic
to each cluster.
The topic name should be representative.
CPU and Memory
The algorithm should be efficient to deal with a large amou
nt of documents. The user need not wait
a long time for the result. And the method should be approachable by real world application.
Each document may have multiple topics, it is important to
confining each document to only
s of web document
Different approaches use different mathematical models to calculate web documents. Here,
four common models that are used in recent research papers.
The full name of tf
idf is term
frequency. In this model, each document
D is consider to be represented by a feature of the form (d
), where d
is a word in D. The
order of the d
is based on the weight of each word. The formula below is the common wei
calculating formula that is widely used, while in different approaches, the formula is not exactly the
same. Some extra parameters may be added to optimize the whole clustering performance. Such as
in , the term
is determined by the formula,
s tf score and IN
s normalized idf score.
Weight calculating formula in tf
In the above formula, each factor is explained below:
number of occurrences of the term
in the Web page P
is Inverse document fr
is the number of Web pages in which
occurs in the
n is the total number of Web pages in the database.
From this formula, we can see that the weight of each term is based on its inverse document
) in the document collection and the occurrences of this term in the document..
Normally we have a word dictionary to strip out the very common words. Also we just select the n
(zemir uses 500) highest
weighted terms as the feature.
2.2 Suffix Tree mod
The whole web document is treated as a string. The string is a little difference with the origin
HTML file. Delete the word prefix and suffix, reduce plural to singular. Sentence boundaries are
marked and non
(such as numbers, HTML ta
gs and most punctuation) are stripped. The
identification of base clusters is the creation of an inverted index of phrases for the web document
is an ST example.( It builds on three phrases: (1) cat ate cheese
mouse too (3) mouse ate cheese too)
(courtesy of zemir)
he algorithm of suffix tree clustering will be introduced
in the next section of
2.3 Link based model
From each web document in search resul
t R, we extract all its out
links and in
links. Finally we get
links and M in
links for all URLs.
Each web document P is represented as 2 vectors: P
represents whether the web document P has a out
item of vector Pout. If has, the i
item is 1, else 0. Identical meaning for P
2.4 Frequent item set model :
In this model, each document corresponds to an item and each possible feature corresponds to a
transaction. The entries in the ta
ble (domain of document attribute) represent the frequency of
of a specified feature (word) in that document.
A collection document present as a transactional database
3.Different kinds of web document clustering approach:
algorithms with using tf
idf model, we can use the following formula to
calculate the similarity between
. Here are some popular similarity measures
Figure3. Common similarity measurements.
represent a pair of documents
term in document i.
of those formulas and
a proper threshold, we can easily determine
whether the documents
The idea to use tf
idf model to represent document is that the more common words between two
documents, the more likely they are similar.
erarchical clustering and Partition (K
means) clustering are two clustering
techniques that are commonly used for document clustering
Algorithm for agglomerative hierarchical clustering.
1. Start with
t as an individual cluster
2. Merge the most similar or closest pair of clusters.(use the similarity or distance measure
3. Step 2 is iteratively executed until all objects are contained within a single cluster, which
become the root
of the tree.
Instead of step 3, we can set a threshold to determine the number of clusters as step 2 stop
lgorithm for K
Arbitrary select K web documents as seed
, they are the initial centroids of each cluster.
Assign all other web
documents to the closest centroid.
troid of each cluster again. Get new centroid of each cluster
Repeat step2,3, until the centroid of each cluster doesn’t change.
A variant of K
1. Select a cluster to split (There are several ways to select which cluster to split.
No significant difference exists in terms of clustering accuracy
. We normally
choose the largest cluster or the one with the least ove
2. Employ the basic k
means algorithm to subdivide the chosen cluster.
3. Repeat step 2 for a constant number of times. Then perform the split that
produces clusters with the highest overall similarity.
Repeat the above ste
p1,2,3, until the desired number of clusters is reached.
Principal direction divisive partitioning .
In this algorithm, we use tf
idf model to calculate the distance of pair documents or clusters. All the
documents are in a single large cluster
at the beginning. Then proceeds by splitting it. Split the
original cluster into subclusters in recursive fashion. At each stage, do the following
1.selects an unsplit cluster to split
2.splits that cluster into two subclusters.
he a threshold that measuring the average distance for the documents in a cluster
mean of the cluster size if it is desired to keep the resulting clusters in approximately the same
The entire cycle is repeated as many times as desired resultin
g in a binary tree. The leaf nodes
constitute a partitioning of the entire documents set.
Suffix Tree Clustering algorithm
The idea is that each document is an order sequence of
, the tf
idf model lose this valuable
information. With suff
ix tree, we can exploit this document attribute to improve the quality of
STC algorithm includes the following 5 steps:
The string of text representing each web documents transformed using a light stemming algorithm.
As I menti
oned before (in model section), we
the origin document to a standard STC
Identifying Base Cluster.
Create an inverted index of phrases from the web document collection with using a suffix tree. The
suffix tree of a collection of
strings is a compact trie containing all the suffixes of all the strings in
ach node of the suffix tree represents a group of documents and a phrase that is
common to all of them. The label of the node represents the common phrase. Each n
Each base cluster is assigned a score that is a function of the number of documents it contains,
and the words that make up its phrase. For example:
Where |B| is the number of documents in base clu
ster B. |P| is the number of words in phrase P that
has a non
zero score. Words
list, or the
in too few
(3 or less) or
(more than 40% of the web document collection)
receive a score 0. The
penalizes single word, linear for phrase that is two to six words long. And become
constant for longer phrases.
Combine base clusters
A binary similarity measure between base clusters based on the overlap of their document sets. For
example: two base clust
with size |B
| and |B
| represent the number of
documents common to both base cluster
. Define the similarity of B
to be 1 if:
>0.5 and |B
Otherwise is 0.
So two base cluster
are connected if they ha
ve similarity of 1. Using a single
algorithm, all the connected base clusters are clustering together. All the documents in these base
clusters constitute a web document cluster.
The clusters are scored based on the scores of their base clu
sters that are calculated in step 3.
Report the top score clusters (for example 10 clusters)
3.3 Link based algorithm.
The idea is web pages that share common links each other are very likely to be tightly
. In this
approach , we use the
link state model for document and Cosine similarity measure:
For document P and Q, Similarity is calculated with the following formula:
) (Total number
links and in
links of document P)
) (Total number of out
links and in
links of document Q)
For document P and cluster S, Similarity(P,S) =
(P*C)/(||P|| ||C||) =
(P*C)/ )/(||P||*||C||)= ((P
C is the centroid of cluster S,
|S| is num
er of documents in cluster S.
common Link of cluster
means links shared by major
ity members of one cluster. If link
is shared by 50%members of the cluster, we call it
50% near common link
Link based algorithm include 4 steps:
1. Filter irrelevant web documents:
A web document is regarded as irrelevant if the sum
links and out
links less than 2
command link of cluster to grantee intra
With the test of , they found 30%
command link is appropriate. They require that
every cluster should have at least one 30% nea
r common link.
Assign each web document to cluster, generate base clusters.
Each web document is assigned to an existing cluster if (a) similarity between the document
and the correspondent cluster is above the similarity threshold and (b)
the document has a link
in common with near common links of the correspondent cluster.
If no such cluster meet the demands, the document will become a new cluster it self.
Centroid vector is used when calculated the similarity between document and c
luster. It is
recalculated after a new document joins the cluster.
Generate final clusters by merging base clusters.
Recursively merging two base clusters if they share majority members. Merging threshold is
used to control merging procedure.
In this algorithm, we use a hyper graph H=(V,E) consists of a set of vertices and
where V corresponds to the set of documents being clustered, and each hyperedge corresponds to a
et of related documents. We use association rule discovery algorithm to find the frequent item set.
Association rules capture the
among items that are present in a transaction. The
algorithm is composed of two steps:
1. Discover all the fr
2. Generate association rules from theses frequent item
(C) is the number of
transactions that contain C where C is the subset of item set I. An
association rule is an expression for form x
y where x,y belong to I. The supp
ort S of the rule x
is defined as Support(x
y)/|T|, where T is the set of transaction. And the confidence
is defined as
y) / Support(x). The task of discovering an association rule is to find all rules x
Support S is greater than a given m
inimum support threshold and confidence
is greater than a
given minimum confidence threshold.
Apriori is such an association rule algorithm to find related items(web documents).
This algorithm is implemented by building a docum
ent cluster tree
tree). In the DC
node represents a cluster
that is made up of its entries(documents)
represents a cluster that is made up of all its entries( subclusters or documents).
s an incremental
ithm. Every incoming document is guided to its corresponding cluster
is a leaf node.
In the DC
tree, there are four parameters:
leaf node contains at most B entries. Each entry is a sub
(S1 and S2): 0<= S1<S2<=1
3. Minimum number of children in a DC leaf node (M).
Identify the corresponding leaf node.
Start from the node. The document iteratively descends the DC three by choosing th
e most similar
child node and the similarity value is higher than threshold S1. If no similarity value is higher than
S1, inset as a new document leaf node to an empty entry of the node. If there is no empty entry in
this node (>B), than split the node to
Modify the leaf node:
If the document reaches a DC leaf node and the closest leaf entry (X) satisfies the similarity
threshold requirement (S2), it will combined with X. The DC entry for X is updated for the
combination. If no such X is found a
nd there is an empty entry in the leaf node, then just insert the
document to this entry. Otherwise split the node to two groups.
Modify the path from the leaf node to the root:
After a document is inserted into a leaf node, update all the non
leaf nodes t
hat are included in the
path from the root to the leaf node. If no split, just add the entry to this node. If splitting, two entries
will be added to the parent node instead of one entry. If there is no room in the parent, split will
cascade to it until re
aching the root.
s to the clustering algorithm.
4.1. A grammar based phrase extraction algorithm .
Most document clustering techniques treat document as a list of words, ignoring the order or context
of the words. This algorithm captu
res some context of the words by using phrases rather individual
words as the features.
Parameters in the algorithm
1. N(wi) and N(wi,wj): The frequency count of a word wi and a bigram<wi,wj> where bigram
means a pair of adjacent words.
2. The mut
ual information association
between words wi and wj:
wi, wj) = log
Using this formula, the association weight is
for each of the bigrams in the training corpus
with a new non
terminal symbol wk, and
association weight A(wi, wj) are stored in the grammar
The grammar generation algorithm is as follows:
Make a frequency table for all the words and bigrams in the training corpus.
From the frequency table calculate the association weig
ht for each bigram.
3. Find a bigram<w
> with the maximum positive association weight A(w
). Quit the
algorithm if none found.
4. In the training corpus, replace all the instances of the bigram<w
> with a new symbol wk.
This is now the new cor
>, wk, and A(w
) to the grammar.
6. Update the frequency table of all the words and bigrams.
7. Go to step 2.
The merge operation creates a new symbol w
that can be used in subsequent merge. This approach
achieves better performance
than traditional words approach. It
s running result is more accurate
with the proof of Entropy and F
Feature coverage method:
Recall in tf
idf mode, we select the n terms with highest weight in the document selection. But for
web document cl
ustering, a web document usually just contains ten words or less, and these words
are not included in the n highest rank list. It leads to a situation that some documents are all zero
weight in its feature vector. Let’ s say coverage is the percentage of
documents containing at least
no zero weight in its feature vector. We need to improve the coverage percentage.
K is the approximate number of the cluster size. So the number of clusters is N/K, where N the
number of documents in the collection. The algor
ithm includes the following steps:
Randomly select a subset of documents with size m from the corpus.
Extract all the words in the documents. Remove stop words from dictionary. Combine words with
the same root. Count the document frequency of the
Set lower = k and higher = k
Select all words with document frequency in the range from lower to upper
If the coverage of these words is larger than the target coverage threshold, quit the iteration.
Otherwise set lower = lower
1 and upper = upper +1, goto step 4.
After the execution of this method, we get the variables “ lower” and “higher”. Use them as the two
bounds in the document frequency table. Select all the words between these two bounds to represent
the document with it
s corresponding feature vector.
4.3 Scatter/Gather .
It’s a mixed clustering algorithm. The whole algorithm is made up of three major steps. Every step
has some sub algorithm. The major steps include:
1.Find k centers
2.Assigh each document in this
collection to each center.
Optimize the partitions.
Find k centers.
We can use either of the two following algorithm to find the k centers.
This is linear algorithm to find K approximate centers from a web documents collection. The idea
straightforward. Choose a small random sample of the documents (the number of the documents is
kn^1/2) from the collection. With the cluster subroutine (AHC), we can get K approximate cluster
centers in O(kn). Although this method is not deterministic,
several random collections of sample
documents often get similar partition according to the statistics theory.
This algorithm is another linear time method to find k approximate centers from . It iteratively
breaks a collection C of N do
cuments into N/m buckets, where m>k. Then the cluster subroutine is
applied to each bucket. Since the collection is much smaller than original one and each sub
collection document clustering need at most m² time. The total time for this sub clustering is N
m², that is Nm. We iteratively partition the C with N/m, until the number of cluster is less than k.
Recall that the number of cluster decrease rate is N/m = p<1, then the overall time needed for this
algorithm is O((1+P+P²+…)mN). So it’s a linear alg
. Assign documents to the K centers
After we get the K centers, we use the assign
nearest neighbor algorithm to assign each
document in Collection C to its nearest center.
In this case, we need at most kn time to finish this step.
Refine the partition
Split groups that score poorly. The can be the average similarity between documents in the cluster,
as well as to the average similarity of a document to the cluster centroid.
The join refinement is to merge docume
nts groups in a partition P to the another most similar group.
Assign to the nearest
Iterate the 1,2,3 in this step to improve the quality of clusters.
step in this algorithm is O(kn), So it’s linear.
Since all the three major steps in the Scat
ter/Gather algorithm is linear. The whole algorithm is linear.
5.How to choose an appropriate topic to present the clusters.
After the cluster algorithm creating the clusters, we ordered the documents with their original rank in
each cluster. One importan
t thing is that we should show users the clusters with meaningful titles.
For different cluster approach, we use different method.
documents similarity approach with tf
idf, we use the most representative word of all documents. In phrase based or STC,
we can use a
As an alternative we can use a representative
in the cluster. We
The cluster centroid, or the document that is most similar to the actual centroid.
The highest ranked document in the cluster. the re
ason is if this document is non
the rest of the cluster is very likely non
Use the lowest ranked document. If the lowest is relevant, then it is very likely all documents in
this cluster is relevant.
6. How to evaluate the quality
of the result clusters.
Validating clustering algorithms and comparing performance of different algorithms are complex
because it is difficult to find an objective measure of quality of clusters.
There are two common measures to evaluate the quality. One
type of measure allows us to compare
different sets of clusters without reference to external knowledge. It is called an internal quality
measure. We can use a measure of
base on the similarity of pair documents in a
cluster. The other
type of measure is external quality measure. Evaluating the quality of the cluster
is based on comparing the clusters produced by the clustering technology to known classes.
There are two common external quality measures. One is entropy . Which prov
ides a measure of
good ness for un
nested cluster of for the clusters at one level of a hierarchical clustering. The other
measure, which is more oriented toward measuring the effectiveness of the hierarchical
For each clust
er, the class distribution of the data is calculated first. For example, for cluster j we
calculate pij, where pij is the probability that a document in cluster j belongs to class i. Then using
this class distribution, the entropy of each cluster j is calc
ulated using the following formula:
Where the sum is taken over all classes
The total entropy for a set of clusters is calculated as the sum of the entropies of each cluster
weighted by the size of each cluster:
j is the size of cluster j, m is the number of clusters, and n is the total number of documents.
So from the formula, we can see the best quality is that all the documents in the cluster fall into the
same class that is known before clustering.
6.2. F m
s a measure that combines the precision and recall ideas. We treat each cluster as if it were the
result of a query and each class as if it were the desired set of documents for a query. We then
calculate the recall and precision of that cluster
for each given class. More specifically, for cluster j
, j) =
, j) =
is the number of
in cluster j,
is the number of members of cluster j
is the number of members of class
The F measure of cluster j and class
is then calculated by the following:
, j) = ( 2 * Recall(
, j) * P
, j)) / ((P
, j) + Recall(
For an entire hierarchical clustering the F measure of any class is the maximum value it
any node in the tree and an overall value for the F measure is computed by taking the weighted
average of all values for the F measure as given by the following:
* F(i, j)) /
is the number of documents and F(i,j) is the F mea
sure of cluster j and class i.
7. Evaluation and comparison of different algorithm.
Agglomerative Hierarchical Clustering (AHC) algorithms are probably most commonly used. They
are typically slow when applied to large document collections. Single
methods typically take O(n
) time, while complete
link methods typically take O(n
)time. AHC also
need several halting criteria. They are based on predetermined constants (for example: halt when 10
clusters remain). So AHC is very sensitive
to the halting criterion. If it mistakenly merges multiple
clusters, the resulting cluster will be meaningless to the user. In web clustering region, this
kind of halting criterion often cause poor results.
The clusters in the resulting hierarchy
overlapping. The parent cluster contains only the general documents
means is much faster. It is a linear time clustering algorithm. It
s time complexity is O(kTn)
where k is the number of desired clusters and T is the number of iterations. The si
pass method is
O(Kn), where K is the number of clusters created.
Both the basic and bisecting k
means algorithms are relatively efficient and scalable.
hey are so easy to implement that they are widely used in different clustering
Another advantage is that K
means algorithm can produce overlapping clusters.
major disadvantage of k
means is that it requires user
to specify k, the
number of clusters, in
may be impossible to estimate in some cases.
mation of k may lead to poor
clustering accuracy. Also, it is not suitable
for discovering clusters of very different size that
common in document clustering.
Moreover, the k
means algorithm is sensitive to noise and outlier
as they m
ay substantially influence the mean value, which in turn lower the clustering
Suffix tree clustering is a linear time clustering algorithm. It
s time complexity is O(n). It first builds
a suffix tree based on the collection of document. So the ti
me to build the tree and the memory need
to be used for storing STC tree are huge. Zamir use the snippets returned by search engine to
build the tree. Each snippet contains 50words on average and 20 words after word cleaning. Instead
of the whole docum
ent, this method dramatically decreases the size of the suffix tree. The precision
of the result doesn
t affect much
when using snippets.
s a good trade
off. In the later of this
survey, I will introduce a really world application that is implemented
by STC (metacrawler).
Link based algorithm is linear time clustering algorithm with time complex O(mn), where m is the
number of iterations needed for clustering process to converge( the convergence is guaranteed by k
means algorithm), n is the number of p
in the clustering algorithm. Since m<<n,
s linear algorithm.
Use Aprori algorithm for computing frequent itemsets in a document based transaction database.
This algorithm uses a level
wise search . K itemsets are used to explore (K
+1) itemsets to mine
frequent itemsets from the database. A large number of frequent itemsets and N scans to the
database affect the speed of this algorithm. The quality of the resulting cluster is better than
traditional cluster algorithm.
Executing time comparison to common clustering methods (
Courtesy of )
We can see that almost all algorithms are linear except the AHC (It is a O(n
the comparison of the Entropy for different clustering algor
ithms for K=16 ( K is
the number of clusters).
means, AHC and their variations
(courtesy of )
the comparison of the Entropy for different clustering algorithms
for K=32 (K is the
number of clusters).
Figure6. Precision comparison to K
means, AHC and their variations (32 clusters)
the comparison of the Entropy for different clustering algorithms for K=64 (K is the
number of c
Figure7. Precision comparison to K
means, AHC and their variations (64 clusters)
Form the above statistics, Bisecting K
means algorithm always get the best quality cluster when
compare to AHC and normal k
means. Also Bisectin
means is a linear algorithm (O(kn)). So in
traditional clustering algorithm (AHC, k
means is the best.
Compare the AHC,K
means and STC using F
Figure8. Average precision between K
mans, AHC and STC (cou
rtesy of )
We can see phrases base clustering algorithm usually get better quality than word based one. STC
get the best quality, because it is a context
8.A real world web documents clustering Application.
MetaCrawler is a web d
ocument clustering application which is developed by researchers from
It uses two steps to
It posts the users query to multiple search engines in parallel, such as google, altavista,
Findwhat… and so on
It gets t
he results from these different search engines, combines the search results to a refined
Use STC clustering algorithm to cluster these web documents. Give each cluster a representative
For example, a traveler comes to dallas and what to g
et some information about this city. He types
the key words “dallas” in the search area. Let’s check what he gets.
Figure 9. Use Metacrawler to cluster web search result
From the search result, we can see the web documents from the sea
rch result are organized in
clusters. User can easily find the topic he is interested in. This is the power of web document
This survey presents the researches on web document clustering domain. The most important aspects
concerned are the quality of the resulting clusters and the time complexity of the algorithm. For
idf based algorithms, we use high dimensional vectors to represent documents. That leads to a
time complexity problem. So until now, we can only apply
these algorithms to a small collection of
documents. This means if we want to cluster the result from a web search (usually >100,000), we
just process the top
ranked documents rather than the entire collection. Some other approaches suffer
the memory prob
lem, such as STC, Frequent itemsets. Every method has its own advantage and
deficiency. Some refinements are made to improve the efficiency. We use snippet instead of the
whole document in STC to solve the memory problem. We present document
words in tf
idf model to decrease the data dimension and increases the accuracy (Because this
approach considers the document context related feature). Some divide
conquer algorithms are
invented by researchers, such as scatter/gather which is i
ntroduced in this survey. The
web document clustering is continuing grow with the rapid growth of Internet. How to calculate the
web document to improve the quality of the cluster in a reasonable time is a key point in this field.
1.Oren Zamir and Oren Etzioni.
Web document Clustering:
A Feasibility Demonstration
chiu Wong and Ada Wai
Incremental Document Clustering for Web Page
3.Jian Zhang, JianfengGao,Ming Zhou, Jiaxing Wang
Improving the Effectiveness of Information
Retrieval with Clustering and Fusion.
4.Faniel Boley, Maria Gini, Rebert Gross, Eui
Hong Han, Kyle Hastings,George Karypis.
Based Clustering for Web Document Categorization
Hae Jun, Jun
A Bayesian Neural Network Model
for Dynamic Web Document Clustering
.1999 IEEE TENCON
6.Noam Solnim, Nir Friedman, Naftali Tishby.
Unsupervised Document Classification using
Sequential Information Maximi
7.Anton Leuski and James Allan.
Improving Interactive Retrieval by Combining Ranked Lists and
ocument Clustering for Interactive Information Retrieval
Krishna Gade, George Karypis.
way Document Clustering: Experiments &
Based Document clustering Using Phrases.
11.Guihong Cao, Dawei Song, Peter Bruza.
Suffix Tree Clustering on Post
Vijay Raghavan, C.H.Henry Chu.
Document Clustering, Visualization, and
Retrieval via Link Mining.
13. Taeho C. Jo.
Evaluation Function of Document Clustering based on Term Entrop
14. Hinrich Schutze, Craig Silverstein.
Palo Alto Research
Projections for Efficient
Data Mining introductory and advanced topics
16. Michael Steinbach, Geoge Karypis, Vipin Kumar.
A Comparison of Document Clustering
17.Xiaofeng he, Ho
ngyuan Zha, Chris H.Q.Fing, Horst D. Simon,
Web document clustering using
18. Yitong Wang and Masaru Kitsuregawa.
Link based clustering of web search results
19.Benjamin C.M. Fung Ke Wang, Martin Ester.
Hierarchical document c
lustering using frequent
20.Douglass R. Cutting, David R.Karger, Jan O. Pedersen, John W.Tukey. Scatter/Gather:
based approach to browsing large document collections.
Inderjit S. Dhillon, James Fan, Yuqiang Guan
efficient clustering of very large document
23. Ying Zhao, George Karypis.
Criterion Functions for
Clustering *experiments and
24.Chris Ding, Xiaofe
Cluster merging and splitting in hierarchical clustering algorithms
Extensions to the k
means algorithm for clustering large data sets with