Media Analyzer – Information Retrieval - WordPress – www ...

overratedbeltAI and Robotics

Nov 25, 2013 (3 years and 4 months ago)

83 views

Media Analyzer
David Witherspoon

University of Colorado at Boulder

Boulder, Colorado 80309







ABSTRACT

Users of document and/or web search engines often have to scan
through the many short descriptions (snippets) of the documents
that have been retur
ned as relevant to their search query. In this
paper I will be presenting the ability to organize the search results
into groups or clusters, which will provide the user the ability to
focus their search for the information they are looking for by
selectin
g specific cluster of common documents that are
interesting to them based off the initial search query.

This paper presents a search query application that allows the user
to perform multiple search queries and present the resulting
relevant documents in t
he traditional linear text and using a
graphical visualization. The application allows the user to select
between K
-
means clustering algorithm and the Lingo Carrot2
clustering algorithm, which will be used to cluster the resulting
documents of their search

query.

Keywords

Information Retrieval, K
-
means, Lingo, Clustering
, Lucene
Indexing
, Prefuse
.

1.

INTRODUCTION

1.1

Motivation

I am currently working on a Research and Development

(R&D)

project that has a basic search capability, which includes the basic
search que
ry and present a list of results to the user. This has a
limitation if the user provides a search query that is too general,
small number of terms / words within the query and the terms
have a high document frequency. Having a high document
frequency indic
ates that the term(s) exist in every document
(TODO insert document frequency definition from the book and
reference it here), which returns too many documents that are not
relevant to what the user is trying to search for.

After the query
has been perform
ed and the results are presented in a list on a
page, where there
are

more than likely multiple
pages

and the user
then has to read through all of the snippets to determine if the
document is relevant or not to what they are truly trying to find. In
genera
l if users cannot find what they are looking for on the first
page and need to continue to the next page of N, they will
typically abandon this search and try again.

Another concern in dealing with a typical search query is the
presentation of the document

results

to the user
.

In a typical
search application the list of results are typically presented in a
linear list of the document snippets with a limit of the number of
documents displayed on a page. The search component developed
on the R&D project follo
ws this typical search results format.

The
problem with this is the fact that there could be some relevant
documents on pages 5, 10, and 15 of the total N pages of results,
but the user will never scan through that many documents to find
them. The task for

finding the relevant documents

located on
different pages requires too much time that any common user
would not spend to find them
.

I am looking at providing a solution to both of these problems that
will help our current search component within our R&D p
roject.

1.2

Contributions

The above

le
d
to
the
develop
ment

of
the application presented in
this paper
. To take on the first concern in dealing with a list of
results that contain too many relevant and not relevant documents
for the user to look through, is by

clustering the documents that
are similar
. The advantage of organizing similar results into a
group or cluster is

to help the user locat
e

a
cluster

of documents
,
based on the cluster name,

which

would be relevant to what they
are trying to accomplish with
out having to abandon the search
query.

By presenting the cluster of documents with a cluster name
allows the user to scan many cluster topic names quickly and
determine if they want to look at the documents contained in the
cluster or not. Once the user h
as selected a cluster that interests
them, they will be able to see all of the documents that are part of
the cluster and therefore should be relevant to what they are
looking for.

The clustering algorithms that I will be using are my
implementation of K
-
m
eans and integration of Carrot2’s Lingo
clustering algorithm.

This leads to
the second
aspect of the project which is dealing
with the way that data is present to the user.
In this application I
will be looking at providing two different ways of presenting

the
cluster of documents from the preformed search in a text based
and a graphical visualization. The text based display will provide
each of the cluster names as a header and the documents that are
part of that cluster will be presented below. This will
give the user
the ability to look at the cluster’s name to see if they might be
interested or not. The second approach is to present the results in
a graphical way utilizing Prefuse.

This capability will allow the
user to see all of the clusters linked off

the search query and then
all of the documents linked off their cluster.

Due to the fact that the current search component is tightly
integrated with company specific code, I will be creating a system
from scratch. Also due to the fact that the data used
by the R&D
project is classified, I will be utilizing the Reuters news data set.
While developing the application I will keep in mind that this
needs to be designed so that it can fit within our R&D project
with little modifications.

2.

R
ELATED WORK

2.1

Vector Sp
ace Model

In order to compare the similarities between documents the idea
of representing each document in a high dimensional space came
about. The structure used to support this high dimensional space
representation is the Vector Space Model (VSM). The VS
M is a
technique that transforms the problem of comparing the textual
documents into a problem of comparing algebraic vector
s in a
high dimensional space [4
]

[1]

[7]
. When the document is being
index, each of the unique terms and the number of occurrences
of
each term within the document is kept within the VSM. Then we
are able to compare term by term between multiple documents and

their term frequency to determine how similar the two documents
are to each other. This concept will be utilized within the
app
lication to perform the similarity of documents in the k
-
means
and the Lingo clustering algorithms.

We will see later how the
Lucene indexing is setup to utilize the ability of store the vector
space model or term frequency vector for each document. Then
t
he application will utilize this information to calculate the
similarities of the documents.

2.2

Suffix Tree Clustering

Suffix Tree Clustering (STC) is a clustering algorithm that is
based on indentifying the phrases that are common to groups of
documents [8].

Each cluster is defined by the common phrase that
is shared between all of the documents that are part of that cluster.
The phrase for each cluster is a sequential
group of words or terms

that is

common between all of the documents and then assigned as
th
e phrase for that cluster.

Suffix Tree Clustering has three logical
steps and the first step is document cleaning [8]. Document
cleaning is the process of removing terms that have a high
document frequency and therefore are not unique to a few
documents wi
thin the corpus. As with other clustering algorithms
first processing step this utilizes a stemmer and other filters to
remove these common terms. The next step is Identifying Base
Clusters, which can be viewed as the creation of an inverted index
of phras
es for the corpus [8]. This is accomplished by utilizing a
technique called suffix trees [8].

The final step of the suffix tree
clustering is to combine the base clusters into clusters. This is
done because the base clusters may overlap with each other as
far
as the documents containing the phrases of multiple base clusters
causing the overlap. Therefore, the algorithm will combined these
base clusters that have high overlap into a new cluster. This is also

done in other clustering algorithms as the last st
ep called cluster
pruning.


2.3

Lingo Clustering Algorithm

Lingo clustering algorithm is a novel algorithm for clustering
search results with an emphasis on quality of the cluster’s
description [4]

[3]
. Lingo was designed to get the labels of the
clusters firs
t and assign the documents to the clusters last. This
was done to avoid the issue of having documents that have been
clustered together without being able to provide a good label to
describe the cluster.
The Lingo clustering algorithm is m
ade up of
5 major

steps and the algorithm (provided in pseudo code) can be
found in [4]. The first step, as with mostly all

clustering
algorithms
, they

begin with the preprocessing step that performs
the stemming, stop word removal and other filtering.

Then they
moved on t
o frequent phrase extraction, where frequent phrases
are recurring ordered sequences of terms appearing in th
e input
documents [4]. This was

also
similar to
the second step within the
suffix tree clustering algorithm.

During this process the Lingo
algorith
m is trying to find candidate cluster labels by finding these
frequent phrases or single terms. The criterion that needs to be
met by these candidate labels is that they must fulfill the follow
[4]:

1.

Appear in the input documents at least certain number
of
times.

2.

Not cross sentence boundaries.

3.

Be a complete phrase

4.

Not begin or end with a stop word.

Once the frequent phrases have been discovered, they are used in
the next step of the algorithm which is the Cluster Label
Induction. There are three steps to the

cluster label induction
process that include term
-
document matrix building, abstract
concept discovery, phrase matching, and label pruning that are
defined in [4].

After the cluster label induction process is
completed there will be k cluster and a label
assigned to each. The
next step is to use the Vector Space Model to assign documents to
the clusters label [4].

The documents will be assigned to a cluster
if it exceeds the Snippet Assignment Threshold, which is another
control parameter that is set. If a

document does not exceed the
threshold for any of the clusters, then the Lingo algorithm will
assign that document to the cluster labeled Others. At this point
the algorithm has assigned all of the documents within the corpus
to one of the k+1 clusters (t
he Other cluster makes the +1) and the
final step is to sort the clusters by their score. The cluster’s score
is determined by the label score multiplied by the number of
documents within the cluster [4]. At this point the Lingo
clustering algorithm has fi
nished and provided a clustering of the
documents contained within the search results with a human
friendly label.

2.4

K
-
means Clustering Algorithm

K
-
means

is different than both of the other clustering algorithms
that have been discussed. The major difference

with k
-
means is
the fact that it finds the clusters first instead of finding the labels
of the clusters. K
-
means clustering starts off with having k
clusters and assigning k randomly selected centroids or means to
those clusters.
At this point the algorit
hm will look at the
documents within the corpus and compare their term frequency
vector to the k cluster’s mean vector to determine which cluster
the document is the most similar with.

Once the documents have
been assigned to the clusters the next step is
to calculate the new
mean of each cluster based on the term frequency vectors of the
documents that are part of that cluster. The new mean vector is
then assigned to each cluster. The process of assigning the
documents and calculating the new mean vector i
s repeated until
there is convergence [1] [5] [6].

After convergence has been met,
the final step is to select the label of the cluster. This can be
accomplished by either selecting a list of words that are common
to all the documents in the cluster or by
selecting the title of the
document that is the closest to the cluster’s mean vector [5].

3.

RESOURCES AND
DATA SOURCES

3.1

Prefuse

Prefuse is a visualization toolkit that I will be using to display the
search query, the query’s relationships to the different clu
sters and
each cluster’s relationship to the documents contained in it. You
can find Prefuse at
http://prefuse.org/
, where you can download
the latest version and see many examples of visualizing your data.

3.2

Reuter

D
ata
S
ource

The data source that I used was the Reuters 21578 xml version,
which has defined xml elements and attributes defined. The data
is easy to find out on the web just by using your standard search
engine and search for Reuters 21578 xml dataset. For exa
mple
one of the locations that I found the information at is
http://modnlp.berlios.de/reuters21578.html
. The data
set contains
22 files, where 21 of the files contain 1,000 documents each and
the l
ast file contains 578 documents.

3.3

PostgreSQL

The database that is being used within this application is
PostgreSQL 8.3, which can be found at
http://www.postgresql.org/
. There is more recent version of
PostgreSQL
out there, but this is the same version that the R&D
project that I am working on is using. Therefore, it made sense for
me to keep with the same version, since the plan is to integrate
what I have done here into a feature within the application. There
is
no reason that you cannot use any other type of relational
database like Oracle, SQL Server, etc. Any of them will work for
what the application is using the database for, which will be
explained in section 4.

3.4

Lingo Clustering

Lingo c
lustering algorithm is

one of the clustering algorithms that
are

provided by Carrot2. Carrot2
is an open source framework for
building search engines and the API for Java can be downloaded
at
http://project.carrot
2.org/download
-
java
-
api.html
. Carrot2
supports multiple clustering algorithms, but I decided to work
with the Lingo clustering algorithm.

3.5

Lucene

Lucene is an open source search framework, which will be used to
perform the indexing and search capabilities
within the
application. Lucene can be downloaded from
http://lucene.apache.org/
.

4.

PROJECT APPROACH

4.1

Data Gathering and Indexing

I started off working on creating the application by first locating
and downloading the

XML v
ersion of the Reuters data
set,
which
provided

me with a corpus containing 21578 documents
.
Once the
documents are downloaded, I needed to create the table within the
database that would hold all of the documents and their docId.
The docId is

an attri
bute

contained within the document on the
Reuter

element
. After creating the table, I was ready to process the
21578 xml documents. I create a file reader that would read the 23
files, extract out each document into a JDOM document, and store
the document
in a collection. This collection of JDOM documents
we be used by two components. The first component would store
the documents by docId in the table that was created in the
PostgreSQL database. The application will take advantage of
storing the complete do
cument, which will be discussed in section
4.4.

The second component takes the document and goes through
the process of utilizing Lucene to index it.

The
process of indexing each JDOM document consists

of
starting
out by creating a Plain Old Java Object (P
OJO) that will contain
all of the data from the JDOM document. Then take the POJO and
assign it all of the elements and attributes contained in the JDOM
document using XPath.

Once this is completed I now have a
POJO that represents the original Reuter xml
document
, which I
use to create a Lucene document. This abstraction allows me to be
able to switch out the Indexing component to be something other
than Lucene if needed. When creating the Lucene document, I
determine what the fields are going to be and ho
w they are going
to be stored and indexed. The fields that I decided to create are the

DocId, Title, and Subject. All of these fields are stored and all
except for the DocId are indexed. The Body field is the only one
that I indicated that I want to store
the TermVector for. This tells
Lucene to store the relative Term
Frequency

Vector for each
document, which I will use later on in the application.

The Term
Frequency Vector is a vector of all the terms within the document
and their associated number of tim
es the te
rm exists within the
document [
2
]. The Term Frequency Vector does not contain every
term within the dictionary and only contains the term if it exists in
the document (so the term frequency is one or more).

Now that the Lucene Document has been cr
eated I need to get the
Anal
yzer and Index Writer, so that Lucene

can write the document

to the Index.

The Analyzer that I am using is the
SnowballAnalyzer, which takes the Lucene document and uses
the

StandardTokenizer

to split the Lucene document into in
dividual
tokens. The StandardTokenizer contains many filters that are used
to perform different filtering functions on the tokens. The first
filter normalizes the tokens, then the tokens are changed to all
lower case, then the stop word list is applied to
the tokens and
finally the SnowballFilter is applied. The stop word list
contains a
list of words that are extremely common and do not provide much
help in matching the users query to the documents within the
index

[
2
].

So the filter that applies the stop
word list will remove
all of the words that it comes across from the tokenizer and
remove those words that match. The SnowballFilter actually
applies a stemmer to the words from the tokenizer, which is the
processing of applying heuristics to removing lett
ers from the
word in the hopes of finding the root of the word [
2
]. For example
the stemmer is hoping to have the words cat and cats to both be
reduced to the same word of cat.
Now that we have the Analyzer
defined, we can use that and the location of the
index to create the
Lucene IndexWriter. I use the IndexWriter to call the method to
add the Lucene Document that I had created earlier, so that it can
be processed by the Analyzer and Indexed. At this point I have
indexed all of the Reuter Documents and st
ored the complete
documents in the database.

4.2

Search

After storing and indexing the entire corpus, I added the
capability of allowing the user to enter a search query and to
return the top 200 relevant documents.

We start out by creating a
QueryParser that
will utilize the same Analyzer that we used to
index the documents to analyze the query statement provided by
the user. The other part of the QueryParser is the definition of
what field(s) will be searched across. In this case we are searching
over the tit
le and the body fields.

When the user provides the
query to the application, it will use the QueryParser to parse the
query and create a search statement that contains the disjunction
of all the resulting term across both fields.
If I had used the
conjunct
ion of the terms, the application would miss many
documents that were relevant but did not contain all of the terms
within it. Therefore, it makes more sense to perform the
disjunctive of the terms, so that we do not leave out relevant
documents from the r
esult set. Next, the

search statement is
passed off to the Lucene IndexSearcher that will search across the
entire corpus and return a collection of the top 200 documents.

4.3

Clustering

4.3.1

K
-
means Clustering

Algorithm

At this point we have the top 200 relevant d
ocuments and need to
get the TermFreqVector for each document that was returned,
since this information will be used by the
k
-
means
clustering
algorithms.

K
-
means is a method of cluster analysis which aims to
partition n documents into k clusters in which
each document
belongs to the cluster with the nearest mean[]
.
My implementation
of the k
-
means clustering algorithm started off by selecting the
initial means of the k clusters. Instead of being completely
random, I decided to select a document within the
result set by a
defined offset. The offset is the number of documents divided by
the number of clusters. Then

I started

from the fist document in
the collection and assigning its term freq vector to the first cluster
.

To find the next document, I would add

the offset to get the term
freq vector from that document and assign it to the next cluster. I
would repeat this until all of the clusters had an initial
mean

as
seen in F
igure 1
.


Figure

1.
Step 1
Assign Initial Means

Once the initial means have been es
tablished, the next step is to
assign all of the documents to the cluster which has the nearest
mean using a
similarity algorithm as see in F
igure 2.



Figure

2.
Step 2
Assign Documents to Clusters

The similarity algorithm that I am
using is the Euclidean

distance
to compare the similarity between the cluster’s mean

vector

and
the document’s term freq vector. Since the term freq vector does
not contain every term in the dictionary and only contains terms
that are in that document, I might have terms that a
re in the
document that are not in the cluster’s mean

vector

and vice versa.

Therefore, I create a collection S that contains all of the terms
from both the document’s term freq vector and the cluster’s mean

vector
. Then I
compare all of the terms
and sum
the squares
of the

value of the term frequency of that term sub
tracted from the
cluster’s mean term value. Then after summing across all the
terms, I take the square root of that value and return the distance.
The formula is provided
below:






(1)


At

this point w
e will look at each document and compare its term
freq vector with every cluster’s mean to determine which cluster
has the shortest distance. Once we determine which
cluster

has th
e
shortest distance from the document, we will assign that docu
ment
to that cluster. We will continue this process until all of the
documents are assigned to a cluster.
Now that all of the
documents are assigned to a cluster, we will go through each
cluster and calculate its new mean vector. So for each t
erm across
al
l of the documents

we will c
alculate the mean for that term,

t
hus
producing a new mean term vector for the
cluster as shown in
Figure 3
.



Figure

3
.
Step 3
Assign New Mean
to

Clusters

We will now compare the new mean term vector with the original
mean ter
m vector to determine if we have reached convergence or
not. If the mean term vector is the same then the documents
should remain in the same clusters and we have reach
convergence. If the mean term vector continues to change, then
the documents within tha
t cluster must be changing. If we have
not reached convergence, we will repeat steps 2 and 3 until we
have. Once we have we will have the k clusters defined and the
documents that belong to them as shown in Figure
4
.



Figure

4
.

Step 4
Clusters Converge

A
t this point all of the clusters and their associated documents
have been discovered
and the
labels for the clusters

need to be
created
.

In order to come up with the clusters label, the
application will search though all of the documents associated
with th
at cluster and determine which one is the closest to the
centroid using Euclidean distance formula. Once that document
has been determined, the title from the document is taken and
used as the label for the cluster. This works well due to the fact
that the

documents in the cluster are

very similar and the titles
have a limited number of terms in them.

4.3.2

Lingo Clustering Algorithm

I decided that it would be good to allow multiple algorithms to be
used to do the clustering depending on the choice of the user.
This
allows for future implementation of different algorithms to be
added to the application without much integration. This is the
reason that I included the Lingo clustering algorithm to prove that
integration of a new clustering algorithm would work. Aft
er
downloading the Lingo clustering algorithm and all of its
dependent jars from the Carrot2 web site, I began the integration
of it into that application. I first started off by integrating the
search capability of the Lingo API into the search box within

the
application. Once I was able to get the application to allow a user
to enter a search query, send that request off to the Lingo API and
get the result set back, I was ready to integrate the result set into
the different java objects that I had already

created when doing
the k
-
means clustering. At this point I just needed a way to get all
the information from their result set into my result set to be able to
display the clusters and related documents within the two views
that had been created. Once the
integration was completed, I
could easily utilize the Lingo clustering algorithm to cluster the
top 200 documents returned by the search query.

4.4

User Interface Design

The application is a thick client that provides the typical area to
fill in your search qu
ery. Currently it allows the user to select
between the k
-
means and the lingo clustering algorithms. This can
be seen in Figure5.



Figure 5. Clustering Algorithm Selection

Once the algorithm has been selected and the user enters the
search query, a new t
ab is created that contains both the text
version of displaying the text and the graphical visualization of
the clusters and documents.

The default is to show the text version
of the clustered results where the header of each section is the
name of the clu
ster, the number of documents apart of that cluster
and the

score of the cluster. The score of the cluster can be
implemented differently depending on the clustering algorithm.

An example of using the application with the search query “corp”,
shown below i
n Figure 6, demonstrates the results that are
presented to the user.
The user is also presented with a search
header that contains the search terms in the tab label and
that the
search
result contains

200 documents and 41 clusters.



Figure 6. Text Based
Results

In the text area of the results page you can see a list of all the
clusters denoted by the header and each document associated with
the cluster below. What
are

being presented in the document
are

the title and the snippet of the body, which allows
the user to see
if they are interested in the document or not.

Switching over to the Visualization tab under the search results
you have the ability to view the clusters and the documents in 6
different ways. In Figure 7 I am showing what the results look
like
presented in the cloud view. When the user selects nodes within
the graph, the other nodes that are linked to them are highlighted.
In Figure 7 I had selected the search query, which is highlighted
in red. At the same time all of the clusters that con
nect to the
search query are highlighted in tan, which allows the user to scan
over the cluster titles and sub documents much quicker than a list.



Figure 7. Cloud Visualization


When you select a cluster node, you can see that both the query
node and th
e document nodes that belong to that cluster are
highlighted as shown in Figure 8.

Also in Figure 8 you can see
that the graph is being presented in the tree visualization, which
demonstrates that the user can work in the visualization that
works the best
for them. As you can see below this allows the user
to scan easily over the documents that are part of the cluster to see
if they are relevant to what they are looking for or not. So in this
example of the “corp”

search, the user

might be interested in
fin
ding more information on Payouts that have occurred.

Therefore, the user

would select the “PAYOUT” cluster node to
see the links to the relevant documents.



Figure 8. Tree Visualization

After looking at the document titles, the user might be more
interes
ted in what the contents of the document are. So the user
would select the document and right click. As you can see in
Figure 9, the Display Document right click menu is presented to
the user and if they click on that then the document will be
displayed.



Figure 9. Document Selection

When the user makes the request to the application to display the
document, this is where storing the DocId in the index and the
database comes in.
The application

take
s

the DocId from the
document object and select the recor
ds based on that DocId,
where it then get the original record back from the database and
presents it in the dialog as shown in Figure 10.

At this point the
user is able to see the complete contents of the relevant document
from a general search query witho
ut having to perform another
more specific query by following the cluster of documents.



Figure 10. Displaying Document Contents

Finally the application also supports presenting multiple queries
at the same time to the user as presented in Figure 11. Her
e I
followed the standard opening the new search query in a new tab
as most web browsers work. The user can always remove the
search tabs that they are no longer interested in.

This allows you
to compare search results between tabs by looking back and fort
h.


Figure 11. Multiple Searches

For navigation of the graph I have provided an overview on the
right hand side of the application, which shows what is being
displayed on the left hand side within the white box. The user can
select the box
and

move it aro
und to change what is being viewed
in the left hand side or select and move the left hand side around.
The user can also us the wheel on the mouse to zoom in and out
and finally the user can select the left most button in the tool bar
of the tab to center
the contents of the graph within the frame. All
of these features allow the user to interact with the graph of the
search results and to more quickly find cluster of documents that
are relevant to them.

5.

EVALUATION and

RESULTS

The way that I am evaluating t
he clusters that are generated by the
two algorithms is by comparing how closely each cluster
generated by the application matches the set of previously
assigned topics. In the Reuter’s corpus all of the document have
already been judged and assigned a top
ic. I am taking advantage
of the work done by people to select a specific topic for the
document.

After the search query has been run, the top 200 relevant
documents have been return, and clustered; then I will take those
clusters and calculate the Precisi
on, Recal
l

and F(1) measure for
all the clusters. Each of the formulas will use the following
variables:

RelevantItems
Retrieved

= number of documents judged to be of
topic T in cluster C.

RetrievedItems =
number of documents in cluster C.

RelevantItems = n
umber of documents judged to be of topic T in
all the clusters.

The following are the formulas used for calculating Precision

(P)
,
Recal
l

(R)

and F(1) for cluster C and topic T:

Precision(C,

T) =
RelevantItemsRetrieved
/ RetievedItems

Recal
l
(C, T) = Relev
antItemsRetrieved / RelevantItems

F(1) measure = (2PR) / (P + R
)

Performing the search query “Corp” on the Lingo Clustering
A
lgorithm, I got a Recall of 0.97
, Precision = 0.80 and F(1) =
0.87. Performing the same query on the K
-
means Clustering
Algorithm,
I got Recall of 0.97, Precision = 0.91 and F(1) = 0.94
.
So there seems to be a slight precision increase when you decide
to create the clusters based on the similarity to a centroid, instead
of using the labels to define what the clusters are and how the
d
ocuments are going to be assigned to them.

6.

CONCLUS
ION

The goal of this project was to provide a way for the user to be
able to scan through multiple documents returned by a general
search query more easily by clustering the similar documents.
Again this al
lows the user to look at the cluster labels to
determine if the documents are relevant or not without having to
scan all of the documents. By providing the text version of
displaying the clusters and related documents and the graphical
visualization of th
e documents I believe that
I

have achieved this
goal.

With supporting both the k
-
means and the lingo clustering
algorithms, I have been able to experience the two different ways
of looking at building the clusters.

The lingo clustering algorithm,
with its
focus on producing meaningful cluster names, is much
more advanced then the technique that I had implemented. On the
other hand, the work that was done with the k
-
means clustering
algorithm and the clustering labeling has been a great learning
experience a
nd produced good results.

7.

FUTURE WORK

Presenting the search query results in a graphical visualization
does help will scanning through the results quicker due to the
clustering and the relationships between the query, clusters, and
documents. In the future

development the graphical display
presents the opportunity of adding more information to the
display that could help the user even more. With clustering
algorithms that provide a relevance score to the query, then lines
that link the two could reflect thi
s score by being different colors,
having the score on the link as a label, or even change the
thickness of the lines. Since this has become more of a visual
display allows you to add more attributes or present data in a
different and more meaningful way.

Some future enhancements to the k
-
means clustering would be to
look further into the use of the kd
-
trees to help with helping speed
up the entire process of determining the k
-
means clusters.

Another
area to look into would be comparing the clusters generat
ed by the

k
-
means vs. the k
-
mediod and see which one produces better
clusters.

Other areas of future work that I would like to look into to add
more functionality to this application is to perform Information
Extraction and more specifically Named Entity E
xtraction and
Relationship mining. I can see the possibilities of clustering
documents based off the Named Entity that are extracted. So for
example the application could group documents by common
people in them, or locations found in the documents, and so

on.
This could be another way of trying to sort the information to find
things that are relevant.

The last area of improvement that I would look at doing was to
perform some type of cluster reduction step for the k
-
mean
algorithm. This would allow the fle
xibility of the number of
clusters to find, because there could be some cases
that the

true
number of unique and non
-
overlapping clusters there might be
much less than the k clusters that the system originally starts out
with.

8.

REFERENCES

[1]

Han, J. and Kamber
, M. 2006.
Data
Mining
: Concepts and
Techniques,
Second Edition
.

Morgan Kaufmann Publishing,
San Francisco, CA.

[2]

Manning, C. D., Raghavan, P., and Schutze, H. 2008.
Introduction to Information Retrieval. Cambridge University
Press.

[3]

Osinski, S. Improving Qua
lity of Search Results Clustering
with Approximate Matrix
Factorizations
. Poznan
Supercomputing and Networking Center at Poznan, Poland.

[4]

Osinski, S., Stefanowski, J., and Weiss, D. 2004. Lingo:
Search Results Clustering Algorithm Based on Singular
Value De
composition
.

University of Technology

at Poznan,
Poland.

[5]

Russell, S. and Norvig, P.
2003
Artificial Intelligence: A
Modern Approach, Sec
ond Edition. Prentice Hall.

[6]

Segaran, T. 2007. Programming Collective Intelligence:
Building Smart Web 2.0 Applications
,
First Edition.
O’Reilly.

[7]

Steinbach, M., Karpypis, G., and Kumar, V. 2000. A
Comparison of Document Clustering Techniques. University
of Minnesota, Technical Report #00
-
034.

[8]

Zamir, O. and Etzioni, O. Web Document Clustering: A
Feasibility Demonstration. Dep
artment of Computer Science
and Engineering at University of Washington. Seattle,
Washington
.