Written By: Zainab M. Al-Qenaei and Bela Sharma

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

66 εμφανίσεις

School of Information Sciences

Department of Information Science and Telecommunications








INFSCI 2140
-

Information Storage and Retrieval

Final Project

Summer, 2004









Professor
Peter Brusilovsky












Written By:

Zainab M. Al
-
Qenaei and Bela Sharma

2

Table of Contents



1. Introduction










3

2. Hierarchical Clustering Algorithm







3

3. Non
-
hierarchical Clusterin
g Algorithm






4

4. K
-
Means










4

5. The C++ code









6

5.1 Given Information








6

5.2 Variables and Parameters used in the code




6

5.3 Program Flow








7

6. Drawbacks of the algorithm







9

7.
Contribution:









10

7.1 Bela Sharma








10

7.2 Zainab Al
-
Qenaei








10

8. Bibliography










11






3

1. Introduction


Clustering is the process of organizing objects into some groups in which the
objects of the group are similar in some way. Moreover the objects in one group are
d
issimilar to the objects in the other group. Thus clustering is the process of grouping
objects into subsets that have meaning in the context of a particular problem.
Clustering does not need any information about the objects. It helps to form
relationsh
ips between very complex data set.


Today, there are many applications for cluster analysis available in the market.
The following are some of the fields that use clustering algorithm in their
applications:




Biology: uses this algorithm in the research w
ork for classifying plants and
animals given their features. The other use is in genes classification.



Marketing: uses clustering algorithm to find groups of the customers with
similar behavior like previous buying records from a large database of
custome
rs.



Medical field: uses to cluster diseases, cures for diseases and symptoms of
diseases that are can lead to useful taxonomies.



Psychiatry field: uses diagnosis of clusters of symptoms for successful
therapy.



Library: uses in ordering books.



City planni
ng: uses to identify groups of houses as per house type and
geographical location.



Earthquakes studies: uses to identify earthquake epicenters of dangerous
zones.



WWW: uses to cluster web log data to find groups of similar access patterns.


In general, whe
never one needs to classify a large data of information into
manageable and meaningful information, clustering is of great use. There are two
main kinds of clustering algorithms.




Hierarchical Clustering Algorithms



Non
-
Hierarchical Clustering Algorithms


2. Hierarchical Clustering Algorithm


Hierarchical clustering is a sequence of partitions in which each partition is
nested into the next partition in the sequence. Hierarchical algorithms can be either
agglomerative or divisive, that is top
-
down or bott
om
-
up. An agglomerative
hierarchical clustering algorithm starts with each object as a separate group. These
groups are successively combined based on similarity until only one group is
remaining. The divisive algorithm performs the task in the reverse o
rder of
agglomerative algorithm. The two main hierarchical clustering methods are Single
-
link method and Complete
-
link method. The problem with hierarchical algorithm is
that once merge has been done, it cannot be undone. This could cause a problem if an
e
rroneous merge is done. Moreover, merge points need to be carefully chosen.




4

3. Non
-
hierarchical Clustering Algorithm


Nonhierarchical clustering is a partition in which the relationships between
clusters are not determined. In this algorithm, an arbit
rary number of clusters are
temporary chosen. The member belonging to each cluster will be checked by selected
distance and then are relocated into the more appropriate clusters with higher
similarity. In this clustering method new clusters are formed in
successive clustering
either by merging or splitting clusters. The three main non
-
hierarchical clustering
methods are single
-
pass, relocation and nearest neighbor. Single
-
pass methods
produce clusters that are dependent upon the order in which the compoun
ds are
processed, and so will not be considered further. Relocation methods, such as
k
-
means, assign compounds to a user
-
defined number of seed clusters and then
iteratively reassign compounds to see if better clusters result. Such methods are prone
to re
aching local optima rather than a global optimum, and it is generally not possible
to determine when or whether the global optimum solution has been reached; Nearest
neighbor methods assign compounds to the same cluster as some number of their
nearest neig
hbors. User
-
defined parameters determine how many nearest neighbors
need to be considered, and the necessary level of similarity between nearest neighbor
lists. The problem with non
-
hierarchical clustering is that it can be difficult in
comparing the quali
ty of clusters produced. Moreover, different initial partitions can
result in different final clusters.


Distance Functions:


With both of these clustering algorithms, the main important issue is to
determine the similarity between objects, so that clust
ers can be formed from objects
with a high similarity to each other. The most common distance functions used to
determine the similarity are Manhattan and Euclidean distance functions. A distance
function yields a higher value for pairs of objects that are

less similar to one another.


For this project, we were given set of data of information in the form of html
files. Our goal was to cluster these files using the best algorithms available for
clustering. The main points we considered our clustering algo
rithm should satisfy
were scalability, dealing with different types of attributes, insensitive to the order of
input records, high dimensionality, interpretability and usability. After doing research
on all the types of the clustering algorithm and keepin
g our data of information in
mind, we decided to Euclidean distance function and K
-
mean algorithm.


The Euclidean function formula is as follows:




4. K
-
Means:


K
-
Means clustering generates a specific number of disjoint, flat, non
-
hierarchical clusters. The K
-
Means method is numerical, unsupervised, non
-
deterministic and iterative.
After some research, we found the following K
-
Means
algorith
m from Dr.
Mohammed Waleed Kadous’ website to serve the best purpose of
our project.


5





The K
-
Means algorithm shown above can be stated as follows:


We hav
e n objects in a given diimensional metric space. We then need to
determine a partition of the objects into k clusters such that the objects in a cluster are
more similar to each other than to objects in different clusters. We can specify the
value of k
and clustering criterion. In our project we have selected the value of k and
a clustering criterion, then for each data object we select the cluster that optimizes the
criterion. The k
-
means algorithm initializes k clusters by arbitrarily selecting one
o
bject to represent each cluster. Each of the remaining objects are assigned to a
cluster and the clustering criterion is used to calculate the cluster mean. These means
are used as the new cluster points and each object is reassigned to the cluster that
it is
most similar to. This continues until there is no longer a change when the clusters are
recalculated.


To improvise our clustering method for our project, we removed stop
-
word.
Moreover the words which were not stop
-
words, we then stemmed those wor
ds and
checked if that word is available in our dictionary created for this project. If those
words were available in our dictionary we collected the vectors and then used
Euclidian distance to calcute the distance between each vector. We then used K
-
mea
ns algorithm and generated the clusters.





6

5. The C++ code


This section explains the C++ code that has been written to perform the K
-
means
clustering for the html files.


5.1 Given Information


The following were given prior writing the code:



Six html f
iles with a total of 471 subsections.



Ten clusters were decided to be used for clustering those subsections.



After scanning the html files 126 dictionary terms were extracted to be used as
basis for clustering.



Stemming C++ code.


5.2 Variables and Paramet
ers used in the code


These are the most important variables and data classes that were used in the program:


struct

subsection {





char

filepointer[9];





int

cluster_no;





int

sub_no;





int

vector[126];

};


subsection arraysub[ARRAYSIZE];


An arra
y of structures was used to save the following information about each
subsection:




filepointer, which equals inpHTML1, inpHTML2, inpHTML3, inpHTML4,
inpHTML5 or inpHTML6.



cluster_no, this is the cluster number to which each subsection should be
assigned to
.



sub_no represents the section order in the html file fro example if it equals 23
so this subsection is the 23
rd

subsection in a html file.



vector[126], this vector is used to assign the term weight. The way to assign
the weight of each term is by using
the frequency number such that the
frequency of a term represents the weight of it in a subsection.


int

cluster[10][127]

A two dimensional array was used for the clusters. The first dimension represents the
number of clusters. Next each cluster has 127 lo
cations, the first 126 locations to
assign the term weights of the dictionary terms and the last location to save the
number of subsections to be assigned to the cluster. This number is needed when
modifying the weights of terms in each cluster according t
o the K
-
means algorithm.


char

sectionword[11]="<!
--
#2#
--
>";

This was used to determine the beginning and end of each subsection in an html file.



7


char

dictionary [126][20]

A two dimensional array to store the terms that were thought to be the most releva
nt
to the html files content.


int

tracksub=0;

Counter used to track the order of a subsection in an html file.


int

trackarray=0;

Counter used to track the total number of subsections used.


ifstream inpHTML1( "file1.htm", ios::in );

This line of code w
as used to open an html file to scan through its subsections.


ofstream outpTitle1( "title1.htm", ios::out );

This line of code was used for each html file to copy the title of the book in order to
print it out before each subsection in the cluster output
file.


5.3 Program Flow


For each html input file the following steps were coded:


Open html f
ile

Copy title

Reach the
beginning of the
1
st

subsection

Input word in
subsection

Increment
subsection number

Stem word

Increment the
weight in
subsection vector

Relevant
term?

No

Yes

Stop
word?

Check against
dictionary of terms

Yes

No


8

After that the K
-
means clustering algorithm was coded in the following way:


Seed points for the 10 clusters (the weight for each term in the

vector) were chosen to
be the first 10 subsections just as a start point.


for
(i=0;i<10;++i)

{
for
(
int

j=0;j<126;++j)

cluster[i][j]=arraysub[i].vector[j];}


While the term weights of the clusters didn't equal the new weights the following was
coded for eac
h subsection. The new weights were calculated as the average of the
weights the subsections that were included in a cluster.



Open a subsection

Calculate the
Euclidean distance
between each
subsection and
cluster

Update the weight
of the clusters

End Cluster
Algorithm

No

Yes

Sam
e weights
as before?

Assign the
subsection to
cluster of the least
distance


9

Finally after the subsections were assigned to the least distance cluster those
subsections were copied to the

clusters with the title of the book before each
subsection.


6. Drawbacks of the algorithm


By examining the output of the cluster html files the following remarks were
concluded:




When assigning subsections to clusters in the first loop of the algorithm
, the
subsections we're scattered around clusters 3
-
8. However after 6 loops of the
algorithm they were all centered on clusters 3, 4 and 5.



The number of loops of the algorithm (6 loops) indicates that after the 6
th

time
running the algorithm on the clust
ers, the cluster center calculation was not
changed so this was the optimal solution.



The dictionary of terms was extracted manually so if different terms were used
there will be definitely different clustering outputs or the subsections.



The seed points u
sed for the clusters were chosen to be the first 10 subsections
which had a great effect on the later calculation of the new cluster center
calculation. So assigning different seed points will result in different output.



The algorithm requires that the num
ber of clusters to be used must be a predefined
decision. That was inefficient in terms of having hard coding the C++ program to
except 10 clusters. A dramatic change in the program is required to see the output
if more clusters were assigned. Basically th
e increase in the number of clusters
results in a longer C++ program.



























10

7.
Contribution:


7.1 Bela Sharma


I did thorough research on different types of clustering algorithm. I did my
research using some books in the library and thro
ugh the web sites. I found some
very good websites that had good information on clustering. Then Zainab and
I

met
and tried to divide our work. After I was confident about clustering,
I

tried to select
the best algorithm that would suit our needs for thi
s project.
We

found that K
-
Means
was the best solution for us. Then I researched for different algorithm available for
K
-
means. Finally I liked the one from
Dr.
Mohammed Waleed Kadous’ website. I
handed that to Zainab.



Then I wrote the psuedocode to b
e used for our program

For my

next step I
read through all the html documents and created a dictionary of 126 words to be used
in our program The next step for me was to do research on stop word and stemming.
I found some good stop words and stemming algo
rithm for our program on the
Internet. I also wrote a program to calculate Euclidean distance, which was then
integrated in our program by Zainab. I wrote loop for K
-
means in psuedocode format.
Zainab integrated all the codes. I designed and developed the

web site for our demo
i.e. the html files for our main cluster page, which has the link to all the clusters files.
Finally wrote part of this paper.


7.2 Zainab Al
-
Qenaei



After
deciding on the pr
oject topic
I started to read about clustering
algorithms

and what is exactly clustering since it was a new subject to me. After
having enough information and deciding as a team to cluster using the K
-
means
algorithm I did more readings on several algorithm for the K
-
means. For me the next
step was to start prog
ramming. I realized that the basic part of the algorithm is to have
"stemmed words" so I contacted Martin Porter by email:
martin.porter@grapeshot.co.uk

and asked him for the C++ version of the Porter
s
temming algorithm. He directed me to the snowball version of the stemmer at:
http://snowball.tartarus.org/projects.php
. After successfully downloading it and
testing it
in my actual C++ code
I started the real programming given the dictionary
terms from Bela. Professor Richard Franklin from the Katz Graduate School o
f
Business, University of Pittsburgh also gave me the permission to use part of his
codes I took in his C++ course to read and write to files in C++. Given the
pseudocode of the algorithm I programmed the part of scanning each document
removing stop words
and stemming the rest. Then finally I programmed the K
-
means
clustering algorithm itself. One final step I did was realizing that the stemmer we used
in our program had some strange way of stemming so I actually stemmed the
dictionary terms as they would h
ave been stemmed by the stemmer we used. That's
why we have in the code "fuzzi" instead of "fuzzy".

Finally, after running the C++
using the Microsoft Visual Studio .NET 2003, I added some line of codes to change
the format of the output clusters.




11

8. Bi
bliography




http://www.clustan.com/k
-
means.html



http://www2.cs.uregina.ca/~hamilton/courses/831/notes/clus
tering/clustering.html



http://www.statsoftinc.com/textbook/stcluan.html



http://www.elet.polimi.it/upload/
matteucc/Clustering/tutorial_html/



http://www.profc.udec.cl/~gabriel/tutoriales/rsnote/cp11/cp11
-
3.htm



http://www.cse.unsw.edu.au/~waleed/phd/tr9806/node67.html#SECTION009300
00000000000000

(
Mohammed Waleed Kadous)



H. M. Deitel and P.J. Deital
, C++ How to program
, Prentice Hall, 2003