University of Massachusetts Dartmouth
MULTIVARIATE DATA MI
NING USING
INDEXED K

MEANS.
A Thesis in
Computer Engineering
by
TR Satish Kumaar
Submitted in Partial Fulfillment of the Requirements
for the Degree of
Master of Science
University of Massachuset
ts, Dartmouth
January 2003
Copyright 2003 by TR Satish Kumaar
I grant the University of Massachusetts Dartmouth the non

exclusive right to use the work
for the purpose of making single copies of the work available to the public on a not

for

profit basis
if the University's circulating copy is lost or destroyed.
____________________________________
TR Satish Kumaar
Date: ______________________________
We approve the thesis of TR Satish Kumaar
Date of signature
____
_____________________ __________________
Paul J. Fortier
Professor of Electrical and Computer Engineering
Thesis Advisor
__________________________ __________________
Hong Liu
Pro
fessor of Electrical and Computer Engineering
Graduate Committee
__________________________ __________________
Howard E. Michel
Assistant Professor of Electrical and Computer Engineering
Graduate Committee
___________
________________ __________________
Dr. Dayalan P. Kasilingam
Associate Professor
Graduate Program Director, Department of
Electrical and Computer Engineering
________________________
__________________
Antonio H. Costa
Professor
Chairperson,
Department of Electrical and Computer Engineering
________________________ __________________
Farhad Azadivar
Dean, College of Engineering
______
____________________ __________________
Richard J. Panofsky
Associate Vice Chancellor for Academic Affairs and Graduate Studies
ABSTRACT
Multivariate data mining using indexed k

means
By TR Satish Kumaar
Raw informa
tion grows at an ever

increasing rate, dictating a need for tools to turn such data
into useful information and knowledge; this is where data mining comes into play. The
knowledge gained can be used for applications ranging from business management,
produc
tion control, market analysis, to engineering design and science exploration. There are
many approaches for knowledge discovery ranging from Association rules, Decision trees, and
K

nearest neighbor, Classification, Cluster to Genetic algorithms.
The focu
s of this thesis is to mine a multivariate dataset, using indexing within a k

means
clustering algorithm to discover rules. The focus is also to compare the results with ordinary k

means methods so as to analyze and test the results for accuracy and import
ance. This thesis
was testing whether a clustering method is possible for a multivariate dataset with static
variables using indexed k

means algorithm as well as researching whether better information
can be formed using this process than the regular k

mea
ns methods. Indexed k

means method
gives a more precise and useful information than the regular k

means method. Questions that
are not answered by the k

means method are answered by indexed k

means method. E.g.: This
method can say that "Month of January o
ne can get Fish 120 at (42.25,70.25) with a market
value of 60.75 and landed weight of 26.25kg," while the regular k

means method didn’t have a
cluster for the month of January. Indexed k

means method has a big implication for it reduces
the computational
power needed to cluster a huge dataset by replacing it with a smaller dataset
without losing precious knowledge in the data and also gives a more useful and precise
information than the regular k

means method.
TABLE OF CONTENTS
ACKNOWLEDGEMENT
................................
................................
.............
v
LIST OF F
IGURES
................................
................................
.......................
vi
LIST OF TABLES
................................
................................
......................
viii
CHAPTER 1 INTRODUCTION
................................
................................
..
1
1.1 Emergence of data mining
................................
................................
...................
1
1.2 What is data mining?
................................
................................
.............................
1
1.2.1 Architecture of data mining
................................
................................
......
3
1.2.2 Data mining versus Query tool
................................
...............................
4
1.2.3 Data mining Functionalities
................................
................................
......
4
1.2.4 Classification of data min
ing
................................
................................
.....
7
1.2.5 Practical Problems of data mining
................................
...........................
8
1.2.6 Data mining issues and ethics
................................
................................
...
9
1.3 What is multivariate data mining?
................................
................................
....
11
1.3.1 When is multivariate analysis used?
................................
.......................
12
1.4 What is K

mean mining?
................................
................................
....................
13
1.4.1 How does k

mean work?
................................
................................
.........
13
1.4.2 Why k

means is no
t enough?
................................
................................
..
13
1.5 Motivation
................................
................................
................................
.............
13
1.5.1 How can we achieve indexed k

mean method?
................................
..
15
ii
1.6 Research contribution
................................
................................
.........................
15
1.6.1 Assumptions
................................
................................
...............................
15
1.7 Thesis organization
................................
................................
.............................
16
CHAPTER 2 RELATED WORK
................................
................................
17
2.1 Data mining approach
................................
................................
........................
17
2.2 Mining complex data in large data and i
nformation repositories
...............
19
2.3 Clustering Analysis
................................
................................
..............................
20
2.3.1 Partitioning methods
................................
................................
................
24
2.3.1.1 k

means algorithm
................................
................................
..........
25
2.3.1.2 K

medoids method
................................
................................
........
30
2.3.2 Hierarchical methods
................................
................................
................
33
2.3.3 Density

based methods
................................
................................
............
34
2.3.4 Grid

based methods
................................
................................
.................
34
2.3.5 Model

based methods
................................
................................
..............
35
2.3.6 EM algorithm
................................
................................
.............................
35
2.4 Indexed based algorithms
................................
................................
..................
38
2.5 Which techniques to use for which tasks
................................
........................
39
2.6 Multidimensional data model
................................
................................
............
41
CHAPTER 3 ALGORITHM AND DATASET
................................
..........
43
3.1 Different forms of knowledge
................................
................................
..........
43
3.2 Getting started
................................
................................
................................
......
45
iii
3.3 KDD Process
................................
................................
................................
.......
46
3.3.
1 Data selection
................................
................................
.............................
46
3.3.2 Cleaning
................................
................................
................................
.......
47
3.3.3 Enrichment
................................
................................
................................
.
48
3.3.4 Coding
................................
................................
................................
.........
49
3.3.5 Data mining
................................
................................
................................
49
3.3.6 Reporting
................................
................................
................................
....
50
3.4 Data Sources
................................
................................
................................
.........
50
3.4.1 Data modeling
................................
................................
............................
51
3.4.2 Preprocessing
................................
................................
.............................
53
3.4.3 Data cleaning
................................
................................
..............................
54
3.5 Algorithm for indexed k

means
................................
................................
........
55
3.6 Discover of
interesting patterns
................................
................................
........
58
3.6.1 Interestingness measure
................................
................................
...........
58
3.7 Presentations and visualization of discovered patterns
................................
60
3.8 Implementation Tools and software
................................
................................
61
CHAPTER 4 RESULTS AND ANALYSIS
................................
................
62
4.1 Experimental Results
................................
................................
..........................
62
4.1.1 Output for indexed k

means
................................
................................
...
6
2
4.1.2 Output for k

means
................................
................................
..................
71
4.2 Analyses
................................
................................
................................
.................
72
iv
4.2.1 Interpreting the patterns found
................................
..............................
73
4.2.2 Testing and Performance
................................
................................
.......
74
4.2.3 Comparison between k

means and indexed k

means
........................
75
4.3 Discussion
................................
................................
................................
.............
76
CHAPTER 5 FUTURE WORK
................................
................................
...
77
5.1 Conclusion
................................
................................
................................
............
78
5.2 Future Work
................................
................................
................................
.........
78
5.3 Research Directions
................................
................................
............................
79
APPENDIX A SOURCE CODE
................................
................................
.
81
BIBLIOGRAPHY
................................
................................
.........................
94
v
ACKNOWLEDGMENTS
The author wishes to
express appreciation to:
y supervisor Dr. Paul J. Fortier, whose ideas and comments made this
thesis possible
embers of my committee for their time and interest
My parents for their unconditional love and support
vi
LIST OF FIGURES
Number
Page
Fig 2.1: Flowchart for K

mea
ns Algorithm
................................
................................
..
29
Fig 3.1: Flowchart for Indexed k

means Algorithm
................................
...................
57
Fig 4.1.1: 3

Dimensional Figure of Landed_kg, Percentage of Occurrence and
Month
................................
................................
................................
................................
..
63
Fig 4.1.2: 3

Dimensional Figure of Latitude Degree, Longitude Degree and Fish
ID
................................
................................
................................
................................
.........
63
Fig 4.1.3: 3

Dimensi
onal Figure of Latitude Degree, Longitude Degree and
Landed Weight of fish
................................
................................
................................
......
64
Fig 4.1.4: 3

Dimensional Figure of Latitude Degree, Longitude Degree and
Market Values of Fish
................................
................................
................................
......
64
Fig 4.1.5: 3

Dimensional Figure of Latitude Degree, Longitude Degree and
Month
................................
................................
................................
................................
..
6
5
Fig 4.1.6: 3

Dimensional Figure of Market Values, Fish ID and Percentage of
Occurrence of Fish
................................
................................
................................
...........
65
Fig 4.1.7: 3

Dimensional Figure of Market values, Landed Kg and Fish ID
........
66
Fig 4.1.8: 3

Dimensional Figure of Market values, Landed Kg and Month
..........
66
Fig 4.1.
9: 3

Dimensional Figure of Market values, Landed Kg and Percentage of
Occurrence of Fish
................................
................................
................................
...........
67
vii
Fig 4.1.10: 3

Dimensional Figure of Market values, Month and Percentage of
Occurrence of Fish
................................
................................
................................
...........
67
Fig 4.1.11: 3

Dimensional Figure of Month, Landed KG and Fish Id
...................
68
Fig 4.1.12: 3

Dimensional Figure of Month, Market Values and Fish Id
..............
68
Fig 4.1.13: 3

Dimensional Figure of Percentage, Month and Fish Id
.....................
69
Fig 4.1.14: 3

Dimensional Figure of Latitude degree, Longitude degree and
Percentage of Occurrence
................................
................................
...............................
69
Fig 4.1.15:
3

Dimensional Figure of Fish Id, Landed KG and Percentage of
Occurrence
................................
................................
................................
.........................
70
viii
LIST OF TABLES
Number
Page
Table 2.1: Techniques and Tasks.
................................
................................
..................
39
Table 4.1: Output of K

means
................................
................................
........................
71
Table 4.2: Analysis of Indexed K

means
................................
................................
......
72
C h a p t e r 1
INTRODUCTION
1.1 Emergence of
data mining
In one of his short stories, The Library of Babel, the South

American writer Jorge
Louis Borges describes an infinite library, which consists of an endless network of
rooms with bookshelves. Although most of the books have no meaning and have
u
nintelligible titles like 'Axaxaxas mlo', people wander through it until they die,
and scholars develop wild hypotheses that somewhere in the library there must be
a central catalog; or that all the books that one could possibly think of must be
somewhere
in the library. None of these hypotheses can be verified because the
library contains an infinite amount of data but no information [1]. The library of
Babel may be interpreted as an interesting but cruel metaphor for the situation in
which modern humans f
ind themselves: we live in an expanding universe in
which there is too much data, (this growth of information is due to the
mechanical production of texts) and too little information. The development of
new techniques to find the required information from
huge amounts of data is
one of the main challenges for software developers today [1].
2
1.2 What is data mining?
Against this background, a great interest is being shown in the new field of 'data
mining' or KDD (knowledge discovery in databases). KDD is lik
e mining, where
enormous quantities of debris have to be removed before diamonds or gold can
be found. Similarly, with a computer, one can automatically find the one'
information

diamond' among the tons of data

debris in one’s database.
It was proposed at
the first international KDD conference in Montreal in 1995
that the term 'KDD' be employed to describe the whole process of extraction of
knowledge from data, which is a multi

disciplinary field of research where the
knowledge here means relationships and
patterns between data elements, data
mining is used exclusively for the discovery stage of the KDD process [1].
Knowledge discovery as a process consists of the following steps:
1.
Data cleaning (to remove noise and inconsistent data)
2.
Data Integration (where
multiple data sources may be combined)
3.
Data Selection (Relevant data from database are retrieved for analysis)
4.
Data transformation (where data are transformed or consolidated into
forms appropriate for mining)
5.
Data mining (Process of intelligent methods to
extract data patterns)
3
6.
Pattern evaluation (Identifies interesting patterns representing knowledge)
7.
Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined knowledge to the user)
1.2.1 Architecture
of data mining:
The architecture of a typical data mining system has the following components:
Database, data warehouse, or other information repository
Database or data warehouse server: It is responsible for fetching the
relevant data, based on the user'
s data mining request
Knowledge base: This is the domain knowledge that is used to guide the
search, or evaluate the interestingness of resulting patterns
Data mining engine: Consists of a set of functional modules for tasks
such as characterization, assoc
iation, classification, cluster analysis, and evolution
and deviation analysis
Pattern evaluation module: It employs interestingness measures and
interacts with the data mining modules in order to focus the search towards
interesting patterns.
4
Graphical u
ser interface: This communicates between users and the data
mining system, allowing the user to interact with the system by specifying a query
or task, information to help focus the search and visualize the patterns in
different forms.
1.2.2 Data mining v
erses query tools
Query tools and data mining tools are complementary. Normal queries can
answer questions like “who bought which product on which date?” While data
mining tools can answer questions like "what are the most important trends in
customer beha
vior?" which are much harder to find using SQL [1]. Of course,
these questions could be answered using SQL by a process of trial and error. It
could take days or months to find an optimal segmentation for a large database,
which the machine

learning algori
thm can automatically find the answer to in a
much shorter time. Once the data

mining tool has found segmentation, you can
use your query environment again to query and analyze the profiles found.
One could say that if you know exactly what you are looking
for, use SQL; but if
you know only vaguely what you are looking for, turn to data mining [1].
1.2.3 Data mining Functionalities:
Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks. Data mining tasks can b
e classified into two categories:
5
descriptive and predictive. Descriptive mining tasks characterize the general
properties of the data in the database. Predictive mining tasks perform inference
on the current data in order to make predictions.
Data mining
functionalities, and the patterns they can discover, are as follows:
(a)
Concept/Class Description: Characterization and Discrimination: Data
can be associated with classes or concepts. The description of a class or concept
is summarized, concise and yet preci
se terms are called class/concept descriptions
[2]. These descriptions can be derived via (1) data characterization or (2) data
discrimination or (3) both data characterization and discrimination
(b)
Data characterization is a summarization of the general char
acteristics or
features of a target class of data. The output of data characterization can be
presented in various forms of charts and tables. The resulting descriptions can be
also presented as generalized relations or in rule forms (characteristic rules)
(c)
Data discrimination is a comparison of the general features of the
specified discrimination descriptions will include comparative measures that help
distinguish between the target and contrasting classes and expressed in rule form
referred as discriminate
rules
6
(d)
Association Analysis: The discovery of association rules showing
attribute

value conditions that occur frequently together in a given set of data. It
is widely used for market basket or transaction data analysis
(e)
Classification and Prediction: The pr
ocess of finding a set of models that
describe and distinguish data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class label is unknown. The
derived model can be presented in various forms, such
as classification (IF

THEN) rules, decision trees, mathematical formulae, or neural networks.
Classification can also be used for predicting the class label of data objects
(f)
Cluster Analysis: Analyzes data objects without consulting a known class
label. Th
e objects are clustered or grouped based on the principle of maximizing
the intra class similarity and minimizing the interclass similarity. Each cluster that
is formed can be viewed as a class of objects, from which rules can be derived
(g)
Outlier Analysis:
A database, which contains data, objects that do not
comply with the general behavior or model of the data. These data objects are the
outliers. Most data mining methods discard outliers as noise or exceptions. In
some applications such as fraud detection,
the rare events can be more interesting
than the more regularly occurring ones
(h)
Evolution analysis: Describes and models regularities or trends for
objects whose behavior changes over time. This may include characterization;
7
discrimination, association, c
lassification, or clustering or time

related data,
distinct features of such an analysis include time

series data analysis, sequence or
periodicity pattern matching, and similarity

based data analysis
1.2.4 Classification of data mining
Diverse disciple co
ntributes data mining; hence data mining research is expected
to generate a large variety of data mining systems. Therefore a clear classification,
to help users identify those that best matches their needs.
Data mining can be categorized according to var
ious criteria, as follows.
(a)
Classification according to the kinds of databases mined: Can be
classified according to different criteria such as data models or types of data or
applications involved.
(b)
Classification according to the kinds of knowledge mined:
Can be
categorized based on knowledge like data mining functionalities.
(c)
Classification according to kinds of techniques utilized: Can be described
according to the degree of user interaction (e.g. autonomous, interactive
exploratory or query

driven system
s) or by methods of data analysis employed
(e.g. database or data warehouse oriented techniques etc.)
8
(d)
Classification according to the applications adapted: Different
applications like finance, telecommunications, DNA, stock markets and e

mail
require the i
ntegration of application

specific methods.
1.2.5 Practical problems of data mining:
A lot of data mining projects get bogged down in a forest of problems like:
Lack of long

term vision: “what do we want from our files in the future?”
Not all files are up
to date: Data vary greatly in quality
Struggle between departments: They may not want to give up their data
Poor cooperation from the electric data processing department: “Just give
us the queries and we will find the information you want.”
Legal and priv
acy restrictions: Data cannot be used for legal reasons.
Files are hard to connect for technical reasons: there is a discrepancy
between a hierarchical and a relational database, or data models are not up to date
Timing problems: files can be compiled cent
rally, with a six

month delay
Interpretation problem: Data’s meanings or usages are unknown
9
1.2.6 Data mining issues and ethics
The usage of data, particularly data about people has serious ethical implications,
and practitioners of data mining techniques
must act responsibly by making
themselves aware of the ethical issues that surround their particular application.
When applied to people, data mining is frequently used to determine who gets the
loan, special offer and so on. Certain kinds of discriminati
on like racial, sexual,
religious, and so on

are not only unethical, but also illegal. However, the situation
is complex because it depends on the application. Using such information for
medical diagnosis is certainly ethical, but using the same informati
on when
mining loan payments behavior is not. Even when sensitive information is
discarded, there is a risk that models will be bulky that rely on variables that can
be shown to substitute for facial or sexual characteristics. For example, people
frequentl
y live in areas that are associated with particular ethnic identities, and so
using an area code in a data mining study runs the risk of building models that are
based on race even though racial information has been explicitly excluded from
the data.
1.3 W
hat is multivariate data mining?
Multivariate data can be defined as a set of entities E, where the i
th
element of E
consists of a vector with 'n' variables. Each variable may be independent or
interdependent with one or more of the other variables.
10
An N

d
imensional dataset, E comprises elements E
i
= (x
i1
, x
i2
...x
in
).
Each observation x
ij
may be independent of or interdependent on one or more of
the other observations. Observations may be discrete or continuous in nature, or
may take on nominal values.
Mu
ltivariate data is difficult to visualize effectively because of
Dimensional constraints: Difficult to visualize data in higher than 3
dimensions
Size of data set

Occlusion: Data patterns are difficult to find
Saturation: Data visualization is difficult
Scarcity: Less number of data points to find patterns
Examples of the types of multivariate data:
Physical interpretation such as geographical data
A sequence of time

varying information such as stock prices
Multivariate data analysis can be used for
any tables of data, even one with a few
rows and many columns, is converted into a few meaningful plots that display the
information in the data, the real information, in a way that is easy to understand.
11
Typical applications:
Quality control and quality
optimization (food, beverages, paints, drugs)
Process optimization and process control
Development and optimization of measurement methods
Prospecting for oil, ore, water, minerals, etc
Classification of bacteria, viruses, tissues, and other medical specim
ens
Analysis of economic and administrative tables
Design of new drugs
1.3.1 WHEN IS MULTIVARIATE ANALYSIS USED?
"Variate" refers to variables, and "multi" means several or many. Multivariate
analysis is appropriate whenever the dataset consists of two or
more variables
observed a number of times of individuals. The result is often called a "data set."
It is customary to envision a data set as being comprised of rows and columns.
The rows pertain to each observation, such as each person or each completed
qu
estionnaire in a large survey. The columns pertain to each variable, such as a
response or an observed characteristic for each person.
12
Rows: records; individuals; cases; respondents; subjects; patients; etc.
Columns: fields; variables; characteristics; r
esponses; etc.
Data sets can be immense; a single study may have a sample size of 1,000
respondents, each answering 100 questions. Here the data set would be 1,000 by
100, or 100,000 cells of data. Hence, the need for summarization is evident.
Simple univ
ariate or bivariate statistics could not be applied for an average were
computed for each variable, 100 means would result, and if all pair wise
correlations were computed, there would be close to 5,000 separate values.
Cluster analysis might yield five c
lusters. Multiple regressions could identify six
significant predictor variables. Multiple discriminate analyses perhaps would find
seven significant variables, and so on. It should be evident that parsimony can be
achieved by using multivariate techniques
when analyzing most data sets. Another
reason for using multivariate techniques is that they automatically assess the
significance of all linear combinations of the observed variables.
1.4 What is k

mean mining?
The k

means algorithm takes the input param
eter, and partitions a set of n
objects into k clusters so that the resulting intracluster similarity is high but the
intercluster similarity is low. Cluster similarity is measured in regard to the mean
value of the objects in a cluster, can be viewed as t
he cluster's center of gravity.
13
1.4.1 how does K

means work?
The k

means algorithm proceeds as follows. First, it randomly selects k of the
objects, each of which initially represents a cluster mean or center. For each of
the remaining objects, an object
is assigned to the cluster to which it is the most
similar, based on the distance between the object (Typically, the squared

error
criterion is used.) and the cluster mean. It then computes the new mean for each
cluster. This process iterates until the cri
terion function converges.
1.4.2 why k

mean is not enough?
The k

means method can be applied only when the mean of a cluster is defined.
This may not be the case in some applications, such as when data with categorical
attributes are involved. The necess
ity for users to specify k, the number of
clusters, in advance can be seen as a disadvantage. The K

means method is not
suitable for discovering clusters with nonconvex shapes or clusters of very
different size. Moreover, it is sensitive to noise and outli
er data points since a
small number of such data can substantially influence the mean value.
1.5 Motivation:
To get a sense of how adding multivariate data mining can enrich a pattern
sequence, let us look at the same areas in which general sequential patt
erns are
useful. Quality control is one of the areas that multivariate data mining are used.
14
For example, eight different properties are measured on products as part of the
quality control before delivery. You have a table with these eight values measured
on one hundred and fourteen products samples from the last year. Quality
manager can answer questions like: Are there any trends? Are the eight properties
related to each other, and if so how? Was there a difference when the new
production process was star
ted six months ago? Is there any relation between the
products quality and the values of the sixteen process variables? Can we improve
the process? Do we have to measure all eight properties to guarantee good
products?
In effect this is achieved by using
different data mining algorithms like k

means
clustering. K

means clustering is distance calculations between cluster centroids
and patterns. As the number of the patterns and the number of centroids are
increased, time needed to complete computations incr
eased. This computational
load requires high performance computers and/or algorithmic improvements.
My research proposes a method in which to combine k

means algorithm with
indexes. These steps provide a better

localized result without any loss of
informat
ion along with less computation. In the above example, we can index the
dataset by products and then run k

means algorithm over each and every product
to find meaningful information. The thesis was investigating whether indexed k

means algorithm method is
possible for a multivariate dataset with static
variables. It was trying to answer whether better knowledge can be acquired using
15
this process than the regular k

means methods and find the truth in the
algorithm.
1.5.1 how can we achieve indexed k

mean met
hod?
The dataset was indexed using a static variable with a fewer number of discrete
values, which took the resulting dataset in that specific static variable and ran the
regular k

means algorithm, (implemented in java) over it to find the information
on t
he dataset. For comparison, whole dataset was taken without indexing and k

means algorithm was run over it.
1.6 Research Contribution
This thesis addresses a problem that has not been looked at before, namely how
to combine k

means algorithm with indexing
. The thesis proposes to use this
index k

means algorithm and compare the information gathered with k

means
method.
1.6.1 Assumptions:
1. We assume that knowledge is stored only in 5 attributes or dimensions, even
though the algorithm supports more dimens
ions. With more dimensions, more
computations is required
16
2. We also assume that within each unordered dimension, only one value may be
present in the database record. For example, if the additional dimension refers to
the fish ID, then there cannot be two
different fish ID associated with a record.
3. Noise and outlier data points have been discarded in the preprocessing of the
dataset, since a small number of such data can substantially influence the mean
value. These data’s may or may not have any useful
information
4. Datasets from a particular year have been taken, to reduce the computations.
However some knowledge may have been lost in that process
1.7 Thesis organization
This thesis is organized as follows. In Chapter 2, related work is discussed
incl
uding other approaches for finding knowledge and research done in the area
of multivariate data analysis. Included here is an in depth discussion of the two
algorithms, k

means and indexed data mining, on which our proposed indexed k

means algorithm is bas
ed. In Chapter 3 an explanation of how these two
algorithms are integrated to form the new algorithm, as well as the comparison
algorithm k

means are provided. Chapter 4 shows the results of the knowledge
analysis and possible optimizations. Chapter 5 conc
ludes with a look at the future
directions of this research.
17
C h a p t e r 2
RELATED WORKS
2.1 Data mining approaches
Data mining is a young interdisciplinary field, drawing from areas such as
database systems, data warehousing, statistics, machine learning, da
ta
visualization, information retrieval, and high performance computing [4]. Other
contributing areas include neural networks, pattern recognition, spatial data
analysis, image databases, signal processing, probabilistic graph theory, and
inductive logic p
rogramming. Data mining needs the integration of approaches
from multiple disciplines [4].
Large sets of data analysis methods have been developed in statistics. Machine
learning has also contributed substantially to classification and induction
problems.
Neural network have shown their effectiveness in classification,
prediction and clustering analysis tasks. However, with increasingly large amounts
of data stored in databases for data mining, these methods face challenges on
efficiency and scalability. Ef
ficient data structures, indexing and data accessing
techniques developed in database researches contribute to high performance data
mining. Many data analysis methods developed need to be re

examined and set

18
oriented; scalable algorithms should also be de
veloped for effective data mining
[4].
Another difference between traditional data analysis and data mining is that
traditional data analysis is assumption

driven in the sense that a hypothesis is
formed and validated against the data, whereas data mining
in contrast is
discovery

driven in the sense that patterns are automatically extracted from data,
which requires substantial search efforts [4]. Therefore, high performance
computing will play an important role in data mining. Parallel, distributed, and
in
cremental data mining methods should be developed, and parallel computer
architectures and other high performance computing techniques should also be
explored in data mining.
Human eyes identify patterns and regularities in data sets or data mining results
.
Data and knowledge visualization is an effective approach for the presentation of
data and knowledge, exploratory data analysis, and interactive data mining.
Data mining in data warehouse is one step beyond on

line analytic processing
(OLAP) of data ware
house data [3]. By integrating OLAP and data cube
technologies, on

line analytical mining mechanism contributes to interactive
mining of multiple abstraction spaces of data cubes.
19
2.2 Mining complex data in large data and information repositories [4]
Data
mining is not confined to relational, transactional, and data warehouse data.
There are high demands for mining spatial, text, multimedia and time

series data,
and mining complex, heterogeneous, semi

structured and unstructured data,
including the web

bas
ed information repositories [5,6].
Complex data may require advanced data mining techniques. For example, for
object

oriented and object

relational databases, object

cube based generalization
techniques can be developed for handling complex structured obje
cts, methods,
class/subclass hierarchies, etc. Mining can then be performed on the multi

dimensional abstraction spaces provided by object

cubes.
A spatial database stores spatial data, which represents points, lines, regions, and
non

spatial data, which r
epresent other properties of spatial objects and their
non

spatial relationships. Spatial data cube can be constructed which consists of
both spatial and non

spatial dimensions and/or measures. Since a spatial measure
may represent a group of aggregation t
hat may produce a great number of such
aggregated spatial objects, it is impossible to pre

compute and store all of such
spatial aggregations. Therefore, selective materialization of aggregated spatial
objects is a good tradeoff between storage space and o
nline computation time [4].
Spatial data mining can be performed in a spatial data cube as well as directly in a
spatial database. A multi

tier computation technique can be adopted in spatial
20
data mining to reduce spatial computation. For example, when app
lying mining
spatial association rules, one can first apply rough spatial computations, such as
minimal bounding rectangle method to filter out most of the sets of spatial
objects (e.g., not spatially close enough), and then apply relatively costly, refine
d
spatial computation only to the set of promising candidates.
Text analysis methods and content

based image retrieval techniques play an
important role in mining text and multimedia data, respectively. These techniques
can be integrated with data cube and
data mining techniques for effective mining
of such types of data.
It is challenging to mine knowledge from the World

Wide

Web because of the
huge amount of unstructured and semi

structured data. However, Web access
patterns can be mined from the preproce
ssed and cleaned Web log records; hot
Web sites can be identified based on their access frequencies and the number of
links pointed to the corresponding sites.
2.3 Clustering Analysis:
Clustering Analysis is to identify clusters embedded in the data, where
a cluster is
a collection of data objects that are "similar" to one another. Similarity can be
expressed by distance functions, specified by users or experts. A good clustering
method produces high quality clusters to ensure that the inter

cluster similar
ity is
21
low and the intra

cluster similarity is high. For example, one may cluster the
houses in an area according to their house category and geographical locations.
Unlike classification, clustering and unsupervised learning do not rely on
predefined clas
ses and class

labeled training examples. For this reason, clustering
is a form of learning by observation, rather than learning by examples.
Conceptual clustering groups objects to form a class, described by a concept.
This differs from conventional cluste
ring, which measures similarity based on
geometric distance. Conceptual clustering has two functions: (1) discovers the
appropriate classes; (2) forms descriptions for each class, as in classification. The
guideline of striving for high intraclass and low
interclass similarity still applies.
Data mining research has been focused on high quality and scalable clustering
methods for large databases and multidimensional data warehouses.
An example of clustering would be what most people perform when they do
lau
ndry

grouping permanent press, dry cleaning, wash whites and brightly
colored clothes, which is important for they have similar characteristics. It turns
out that these clusters have important common attributes about the way they
behave when washed. Clust
ering is straight forward, but of course, difficult to be
made; Clustering is often more dynamic.
An example of the nearest neighbor prediction algorithm is when you look at the
people in the neighborhood. It may be noticed that, in general, all have somew
hat
22
similar income. However, there may still be a wide variety of incomes among
even your closest neighbors. The nearest neighbor prediction algorithm works in
very much the same way except that nearness in a database may consist of a
variety of factors an
d it performs quite well in terms of automation because many
of the algorithms are robust with respect to dirty and missing data.
The nearest neighbor prediction algorithm simply stated is as follows: "Objects
that are 'near' each other will also have simi
lar prediction values. Thus, if you
know the prediction value of one of the objects, you can predict it from its
nearest neighbors [7]."
The typical requirements of clustering in data mining are:
Scalability: Highly scalable clustering algorithms are neede
d for a sample
of a given large data set, which may lead to biased results
Ability to deal with different types of attributes: Many algorithms are
designed to cluster interval

based (numerical) data. However, applications may
require clustering other types
of data, such as binary, categorical (nominal), and
ordinal data, or mixture of data types
Discovery of clusters with arbitrary shape: Clusters can be of any shape.
Hence, it is important to develop algorithms that can detect clusters of arbitrary
shape
23
M
inimal requirement for domain knowledge to determine input
parameters: The clustering results can be sensitive to input parameters
Ability to deal with noise: Some clustering algorithms are sensitive to
missing, unknown, outliers or erroneous data and lead
to clusters of poor quality
Insensitivity to the order of input records: Some clustering algorithms are
sensitive to the order of input data
High dimensionality: It is challenging to cluster data objects in high

dimensional space, especially considering t
hat such data can be very sparse and
highly skewed
Constraint

based clustering: Applications may need to perform clustering
under various kinds of constraints. A challenging task is to find groups of data
with good clustering behavior that satisfy specifie
d constraints.
Interpretability and usability: Clustering needs to be tied up with specific
semantic interpretations and applications. It is important to study how an
application goal may influence the selection of clustering methods.
There are many cluste
ring techniques, organized into following categories:
partitioning, hierarchical, density

based, grid

based, and model

based methods.
Clustering can also be used for outlier detection.
24
2.3.1 Partitioning methods:
Given a database of n objects or data tupl
es, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k <= n. That
is, it classifies the data into k groups, which together satisfy the following
requirements:(1) each group must contain at least one o
bject; (2) each object must
belong to exactly one group.
Given k, the number of partitions to constrict, a partitioning method creates an
initial partitioning. It then uses an iterative relocation technique that attempts to
improve the partitioning by mov
ing objects from one group to another.
To achieve global optimality, partitioning

based clustering would require the
exhaustive enumeration of all of the possible partitions. Instead, most
applications adopt one of two popular heuristic methods: (1) the k

means
algorithm, where each cluster is represented by the mean value of the objects in
the cluster; (2) the k

mediods algorithm, where each cluster is represented by one
of the objects located near the center of the cluster. These heuristic clustering
meth
ods work well in finding spherical

shaped clusters in small to medium

sized
databases. To find clusters with complex shapes and for clustering very large data
sets, partitioning

based methods need to be extended.
25
2.3.1.1 k

means algorithm:
Given a database
of n objects and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (k
n), where each
partition represents a cluster. The clusters are formed to optimize an objective

partitioning criterion, often called a
similarity function, such as distance, so that
the objects within a cluster are "similar," whereas the objects of different clusters
are "dissimilar" in terms of the database attributes.
Algorithm: The k

means algorithm is partition based on the mean value
of the
objects in the cluster.
Input: The number of clusters k and a database containing n objects
Output: A set of k clusters that minimizes the squared

error criterion.
Method:
(1)
Arbitrarily choose k objects as initial cluster centers
(2)
Repeat
(3)
(Re) assign e
ach object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster
(4)
Update the cluster means, i.e., calculate the mean value of the objects for
each cluster
26
(5)
Until no change occurs
The k

means algorithm t
akes the input parameter, k and partitions a set of n
objects into k clusters so that it results in high intracluster and low intercluster
similarity. Cluster similarity is measured in regard to the mean value of the objects
in a cluster, which can be view
ed as the cluster's center of gravity.
"How does the k

means algorithm work?" The k

means algorithm proceeds as
follows. First, it randomly selects k of the objects, each of which initially
represents a cluster mean or center. For each of the remaining obj
ects, an object
is assigned to the cluster to which it is the most similar, based on the distance
between the object and the cluster mean. It then computes the new mean for
each cluster. This process iterates until the criterion function converges. Typical
ly,
the squared

error criterion is used, which is defined as
E = E
k
i=1
k
i=1
p
Ci
p

mi2,
Where E is the sum of square

error for all the objects in the database, p is the
point in space representing a given object, and m
i
is the mean of cluster C
i
(both
p
and mi are multidimensional). This criterion tries to make the resulting k
clusters as compact and as separate as possible.
The algorithm attempts to determine k partitions that minimize the squared

error
function. It works well when the clusters are com
pact clouds that are rather well
27
separated from one another. The method is relatively scalable and efficient in
processing large data sets because the computational complexity of the algorithm
is O (nkt), where n is the total number of objects, k is the nu
mber of clusters, and
t is the number of iterations. Normally, k <<n and t<<n. The method often
terminates at a local optimum.
The k

means method, however, can be applied only when the mean of a cluster is
defined. This may not be the case in some applicat
ions, such as when data with
categorical attributes are involved. The necessity for users to specify k (number of
clusters) in advance can be seen as a disadvantage. The k

means method is not
suitable for discovering clusters with non

convex shapes or clus
ters of very
different size. Moreover, it is sensitive to noise and outlier data points since a
small number of such data can substantially influence the mean value.
There are a few variants of the k

means method. These differ in the selection of
the initi
al k means, the calculation of dissimilarity, and strategies for calculating
cluster means. An interesting strategy that often yields good results is to first
apply a hierarchical agglomeration algorithm to determine the number of clusters,
find initial cl
usters, and then use iterative relocation to improve them.
Another variant to k

means is the k

modes method, which extends the k

means
paradigm to cluster categorical data by replacing the means of clusters with
modes, using new dissimilarity measures to d
eal with categorical objects and a
28
frequency

based method to update modes of cluster. The k

means and the k

modes methods can be integrated to cluster data with mixed numeric and
categorical values, resulting in the k

prototypes method.
"How can we make th
e k

means algorithm more scalable?" A recent effort on
scaling the k

means algorithm is based on the idea of identifying three kinds of
regions in data: regions that are compressible, regions that must be maintained in
main memory, and regions that are dis
cardable. An object is compressible if it is
not discardable but belongs to a tight sub cluster. A data structure known as a
clustering feature is used to summarize objects that have been discarded or
compressed. If an object is neither discardable nor com
pressible, then it should
be retained in main memory. To achieve scalability, the iterative clustering
algorithm only includes the clustering features of the compressible objects and
the objects that must retained in main memory, thereby turning a secondar
y

memory

based algorithm into a main

memory

based algorithm.
29
Start
Initiate Centers
of the K Cluster
Evaluate Cluster
Assignment of
Vectors
Compute New Cluster
Centers with respect to
new cluster assignment
Evaluate new
Cluster
Assignments
New Cluster
Assignments Differ
from previous one
STOP
yes
No
Fig 2.1 Flowchart for kmeans Algorithm
30
2.3.1.2 K

Medoids Method:
The k

means algorithm is sensitive to outliers since an object with an extremely
large value may substantially distort the distribution of data. "How might the
algorit
hm be modified to diminish such sensitivity?” Instead of taking the mean
values of the objects in a cluster as a reference point, the mediod can be used,
which is the most centrally located object in a cluster. Thus the partitioning
method can still be per
formed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point.
This forms the basis of the k

medoids method.
The strategy of k

mediods clustering algorithm is to find k clusters in n ob
jects by
first arbitrarily finding a representative object (the medoid) for each cluster. Each
remaining object is clustered with the medoid to which it is the most similar. The
strategy then iteratively replaces one of the medoids to which it is the most
similar, which then later replaces one of the medoids by one of the non

medoids
as long as the quality of the resulting clustering is improved. This quality is
estimated using a cost function that measures the average dissimilarity between an
object and th
e medoid of its cluster. To determine whether a nonmedoid object,
O
random
, is a good replacement for a current medoid, O
j
, the following four cases
are examined for each of the nonmedoid objects, p.
Case 1:
p currently belongs to medoid o
j
. If o
j
is replaced by o
r
andom
as a
medoid and p is closest to one of o
i,
i
j, the p is reassigned to o
i
.
31
Case 2:
currently belongs to medoid o
j
. If o
j
is replaced by o
random
as a medoid
and p is closest to o
random
, then p is reassigned to o
random
.
Case 3:
p currently belongs to medoid o
i
, i
j
. If o
j
is replaced by o
random
as a
medoid and p is still closest to o
i
, then the assignment does not change.
Case 4:
p currently belongs to medoid o
i
, I
j. If o
j
is replaced by o
random
as a
medoid and p is closest to o
random
, the p is reassigned to o
random
.
Eac
h time a reassignment occurs, the difference in square

error E contributes to
the cost function that calculates the difference in square error value, if a current
medoid is replaced by a nonmedoid object. The total cost of swapping is the sum
of costs incu
rred by all nonmedoid objects. If the total cost is negative, the o
j
is
replaced with o
random
for the actual square

error would be reduced. If the total cost
is positive, the current medoid oj is considered acceptable.
Algorithm: k

medoids is partitioning
based on medoid or central objects.
Input: The number of clusters k and a database containing n objects.
Output: Set of k clusters that minimizes the sum of the dissimilarities of all the
objects to their nearest medoid.
Method:
32
(1)
Arbitrarily choose k objec
ts as the initial medoids
(2)
Repeat
(3)
Assign each remaining objects to the cluster with the nearest medoid
(4)
Randomly select a nonmedoid object, o
random
(5)
Compute the total cost, S, of swapping oj with o
random
(6)
If S < 0 then swap o
j
with o
random
to form the new set
of k medoids
(7)
Until no change
PAM (Partitioning around Medoids) was one of the first k

medoids algorithms
introduced. It attempts to determine k partitions for n objects. After an initial
random selection of k medoids, the algorithm repeatedly tries to make
a better
choice of medoids. All of the possible pairs of objects are analyzed, where one
object in each pair is considered a medoid and the other is not. The quality of the
resulting clustering is calculated for each such combination. An object, oj, is
re
placed with the object causing the greatest reduction in square

error. The set of
the best objects for each cluster in iteration forms the medoids for the next
iteration. For large values of n and k, such computation becomes very costly.
"Which method is m
ore robust

k

means or k

medoids?" The k

medoids method
is more robust than k

means in the presence of noise and outliers because a
33
medoid is less influenced by outliers or other extreme values than a mean.
However, its processing is more costly than the k

means method. Both methods
require the user to specify k, the number of clusters.
2.3.2 Hierarchical methods:
A hierarchical method creates a hierarchical decomposition of the given set of
data objects. A hierarchical method can be classified as being eit
her agglomerative
or divisive, based on how the hierarchical decomposition is formed. The
agglomerative approach, also called the bottom

up approach, starts with each
object forming a separate group. IT successively merges the objects in the same
cluster.
In successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is
done, it can never
be undone. The rigidity is useful for it leads to smaller
computation costs by not worrying about the combination of different choices.
However, such techniques cannot correct erroneous decisions. There are two
approaches to improve the quality of hierarch
ical clustering:(1) perform careful
analysis of object "linkages" at each hierarchical partitioning, such as in CURE
and Chameleon; (2) integrate hierarchical agglomeration and iterative relocation
by first using a hierarchical agglomerative algorithm and
then refining the result
using iterative relocation, as in BIRCH.
34
2.3.3 Density

based methods:
Most partitioning methods cluster objects based on the distance between objects.
Such methods can find only spherical

shaped clusters and encounter difficulty
in
discovering clusters of arbitrary shapes. Clustering methods have been developed
based on the notion of density. The general idea is to continue growing the given
cluster as long as the density (number of objects) in the "neighborhood" exceeds
some thre
shold; that is, for each object within a given cluster, the neighborhood
of a given radius will contain at least a minimum number of points. Such a
method can be used to filter out noise and discover clusters of arbitrary shape.
DBSCAN is a typical density

based method that grows clusters according to
density threshold. OPTICS is a density

based method that computes an
augmented clustering ordering for automatic and interactive cluster analysis.
2.3.4 Grid

based methods:
Grid

based methods quantize the obj
ect space into a finite number of cells that
form a grid structure. All of the clustering operations are performed on the grid
structure (quantized space). The main advantage of this approach is its fast
processing time, which is typically independent of t
he number of data objects and
dependent only on the number of cells in each dimension in the quantized space.
35
STING is a typical example of a grid

based method. CLIQUE and Wave

Cluster
are two clustering algorithms that are both grid and density based.
2.3
.5 Model

based methods:
Model

based methods hypothesize a model for each of the clusters and find the
best cluster for the data of the given model. A model

based algorithm may locate
clusters by constructing a density function that reflects the spatial di
stribution of
the data points. It also leads to a way of automatically determining the number of
clusters based on standard statistics, taking "noise" or outlier into account and
thus yielding robust clustering methods.
Some clustering algorithms integrate
the ideas of several clustering methods, so
that it is sometimes difficult to classify a given algorithm as uniquely belonging to
only one clustering method category. Furthermore, some applications may have
clustering criteria that require the integration
of several clustering techniques.
2.3.6 EM algorithm
The EM (Expectation Maximization) algorithm extends the k

means paradigm in
a different way. Instead of assigning each object to a dedicated cluster, it assigns
each object to a cluster according to a w
eight representing the probability of
membership. In other words, there are no strict boundaries between clusters.
Therefore, new means are computed based on weighted measures.
36
In K

means, we know neither of the distribution that each training instance cam
e
from, nor of the parameters of a mixture model. So we adopt the procedure used
for k

means clustering algorithm, and iterate. Guessing the initial five parameters
and using them to calculate the cluster probabilities for each instance, then using
these p
robabilities to re estimate the parameters, and repeating them is called the
EM algorithm. The first step, the calculation of the cluster probabilities (which
are the "expected" class values) is the “expectation"; the second, the calculation
of the distrib
ution parameters, is the "maximization" of the likelihood of the
distributions given the data.
Adjustments must be made to the parameter estimation equations to account for
the fact that it is only cluster probabilities, not the clusters that are known for
each instance. These probabilities act like weights. If w
i
is the probability, then i
belong to cluster A. The mean
A
and standard deviation
A
2
or A are
w
1
x
1
+ w
2
x
2
+ ...+ w
n
x
n
A
=
w
1
+w
2
+...+w
n
w
1
(x
1

)
2
+ w
2
(x
2

)
2
+...+ w
n
(x
n

)
2
A
2
=
w
1
+w
2
+...+w
n
37

Where x
i
are the entire instance, not just those belonging to cluster A. Now
consider how to terminate the iteration. The k

means al
gorithm stops when the
classes of the instances don't change from iteration to the next. This means that a
"fixed point" has been reached. The algorithm converges toward that fixed point
but never actually gets there. Despite that, we can see how close it
is by
calculating the overall likelihood that the data came from this dataset, given the
values of the parameters. This overall likelihood is obtained by multiplying the
probabilities of the individual instances i:
i
(p
A
P
r
[x
i
A] + p
B
P
r
[x
i
B])
The probab
ilities given of the cluster A and B are determined from the normal
distribution function f (x;
,
). This overall likelihood is a measure of the
"goodness" of the clustering, and increases iteration of the EM algorithm. Again,
there is a technical diffic
ulty with equating the probability of a particular value of
x with f (x;
,
). In this case, the effect does not disappear because no
probability normalization operation is applied. The upshot is that the likelihood
expression is not a probability and doe
s not necessarily lie between zero and one:
nevertheless, its magnitude reflects the quality of the clustering. In practical,
logarithm implementation is calculated, summing the logs of the individual
components, and avoiding all multiplications. Yet the o
verall conclusion still
holds; you should iterate until the increase in log

likelihood becomes negligible.
38
For example, a practical implementation might iterate until the difference
between successive values of log

likelihood is less than 10

10
for ten suc
cessive
iterations. The log likelihood may increase very sharply over the first few
iterations and then converge quickly to a point that is virtually stationary.
Although the EM algorithm is guaranteed to converge to a maximum, this is a
local maximum and
my not necessarily be the same as the global maximum. For a
better chance of obtaining the global maximum, the whole procedure should be
repeated several times, with different initial guess for the parameter values. The
overall log

likelihood figure can be
used to compare the different final
configurations obtained: just choose the largest of the local maximal.
2.4 Indexed based algorithms:
Given a data set, the index

based algorithm uses multidimensional indexing
structures, such as R

trees or k

d trees, t
o search for neighbors of each object o
within radius d around that object. Let M be the maximum number of objects
within the d

neighborhood of an outlier. Therefore, once M + 1 neighbors of
object o are found, it is clear that o is not an outlier. This a
lgorithm has a worst

case complexity of O(k*n
2
), where k is the dimensionality and n is the number of
objects in the data set. The index based algorithm scales well as k increases.
However, this complexity evaluation takes only the search time into account
even
though the task of building an index, in itself, can be computationally intensive.
39
2.5 Which Techniques to Use for Which Tasks [8]
Technique
Classification
Estimation
Prediction
Affinity
Group
Clustering
Description
Standard Statistics
Market Basket
Analysis
Memory

Based
Reasoning
Genetic Algorithm
Cluster Detection
Link Analysis
Decision Tree
Neural Networks
Table 2.1 Techniques and Tasks.
40
The choice of data mining t
echniques to apply at a given point in the cycle
depends on the particular data

mining task to be accomplished and on the data
available for analysis. Approach to select a data mining technique has two steps:
Translate the business problem to be addressed
into a series of data
mining tasks
Understand the nature of the available data in terms of the content and
the types of data fields, and the structure of the relationships between records
Data mining approach is mostly influenced by the following data cha
racteristics:
A preponderance of categorical variables
A preponderance of numeric variables
A large number of fields (independent variables) per record
Multiple target fields (dependent variables)
Variable

length records
Time

ordered data
Free

text data
41
2.
5 Multidimensional Data Model
Multidimensional data model exists in the form of a data cube that allows data to
be modeled and viewed in multiple dimensions, defined by dimensions and facts.
Dimensions are the perspectives or entities according to which an
organization
wants to keep records, like time, item, branch, and location in a sales store. Each
dimension may have a table associated with it, called a dimension table, which
further describes the dimension.
A multidimensional data model is typically or
ganized around a central theme, like
sales, for instance. A fact table represents this theme, where facts are numerical
measures. The fact table contains the names of the facts, or measures, as well as
keys to each of the related dimension tables. Multidim
ensional models exist in the
form of a star schema, a snowflake schema, or a fact constellation schema.
Star Schema:
The most common modeling paradigm is the star schema, in
which the data warehouse contains (1) a large central table (
fact table
) containin
g
the bulk of the data, with no redundancy, (2) a set of smaller attendant tables
(dimensional), one for each dimension. The schema graph resembles a starburst,
with the dimension tables displayed in a radial pattern around the central fact
table.
Snowflak
e schema:
The snowflake schema is a variant of the star schema
model, where some dimension tables are normalized, thereby splitting the data
42
into additional tables. The resulting schema graph forms a snowflake shape. The
difference between the snowflake an
d star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce
redundancies. Such table is easy to maintain and saves storage space because a
large dimension table becomes enormous when the dimensional str
ucture is
included as columns. The saved space is negligible in comparison to the
magnitude of the fact table. The snowflake structure reduces the effectiveness of
browsing since more joints are needed to execute a query. Consequently, the
system performan
ce may be adversely impacted. Hence, the snowflake schema is
not as popular as the star schema in data warehousing design.
Fact constellation:
Sophisticated applications may require multiple fact tables to
share dimension tables. This kind of schema can be
viewed as a collection of
stars, hence is called a galaxy schema or a fact constellation.
A data warehouse collects information about subjects that span the entire
organization, such as customers, items, sales, assets, and personnel, and thus its
scope is
enterprise

wide. For data warehouses, the fact constellation schema are
commonly used since it can model multiple, interrelated subjects. A data mart, on
the other hand, is a department subset of the data warehouse that focuses on
selected subjects, and t
hus its scope is department

wide. For data marts, the star
or snowflake schema are commonly used since both are geared towards modeling
single subjects, although the star schema is more popular and efficient.
43
C h a p t e r 3
ALGORITHM AND DATASE
T
This chapter di
scusses the design and implementation of indexed k

means
clustering on the fisheries database. For this purpose, k

means algorithm
clustering mining technique is implemented. Initially, the Dataset on which the
clustering algorithm is implemented is studi
ed. Secondly, the implementation of
indexed k

means and k

means algorithm is discussed.
3.1: Different forms of knowledge:
The key issue in KDD is to realize that there is more information hidden in the
data than you are able to distinguish at first sigh
t. In data mining we have four
different types of knowledge that can be extracted from the data:
1.
Shallow knowledge:
This is information that can be easily retrieved
from databases using a query tool such as structured query language (SQL)
2.
Multi

dimensional
knowledge:
This is the information that can be
analyzed using online analytical processing tools. With OLAP tools you have the
ability to rapidly explore all sorts of clustering and different orderings of the data
but it is important to realize that most
of the things you can do with an OLAP
44
tool can also be done using SQL. The advantage of OLAP tools is that they are
optimized for this kind of search and analysis operation. However, OLAP is not
as powerful as data mining; it cannot search for optimal solu
tions.
3.
Hidden knowledge:
This is data that can be found relatively easily by
using pattern recognition or machine

learning algorithms. Again, one could use
SQL to find these patterns but this would probably prove extremely time

consuming. A pattern recogni
tion algorithm could find regularities in a database
in minutes or at most a couple of hours, whereas you would have to spend
months using SQL to achieve the same result.
4.
Deep knowledge:
This is information that is stored in the database but
can only be lo
cated if there is a clue that indicates where to look. Hidden
knowledge is the result of a search space over a search algorithm. Deep
knowledge is typically the result of a search space over only a tiny local optimum,
with no indication of the dataset. A s
earch algorithm could roam around without
achieving any significant result. An example of this is encrypted information
stored in a database. It is almost impossible to decipher a message that is
encrypted if you do not have the key, which indicates that f
or the present time at
any rate, there is a limit to what one can learn.
45
3.2 Getting started:
The starting point for any data mining activity is the formation of a specific
information requirement related to a specific action, i.e., what do we want to
kn
ow and what do we want to do with this knowledge? Data mining is pointless
unless the finding of the knowledge is followed up with the appropriate actions.
A data

mining environment can be realized on many different levels using several
different technique
s; the following list gives an indication of the steps that should
be taken to start a KDD project:
1.
Make a list of requirements. For what purpose would a KDD
environment be realized? What are the criteria of success? How will success be
measured?
2.
Make an o
verview of existing hardware and software: networks,
databases, applications, servers, and so on.
3.
Evaluate the quality of the available data. For what purpose was it
collected?
4.
Make an inventory of the available databases, both internally and
externally.
46
5.
I
s a data warehouse in existence? What kind of data is available? Can we
zoom in on details of operational data?
6.
Formulate the knowledge that the organization needs both now and in
the future, in order to be able to function optimally.
7.
Identify groups of kn
owledge workers or decision makers who are to
apply the results. What kinds of decisions will they need to take? Which patterns
are useful to them and which are not, both now and in the future?
8.
Analyze whether the knowledge found can actually be used by t
he
organization. It is useless to distill client profiles from mailing files, if for technical
reasons the mailing department cannot handle the selections found.
9.
List the processes and transformations these databases have to go
through before they can be u
sed.
3.3 KDD process:
The knowledge discovery process consists of six stages: Data selection, Cleaning,
Enrichment, Coding, data mining, reporting.
3.3.1 Data Selection:
Once you have formulated your information requirements, the next logical step is
to co
llect and select the data you need. In most cases, this data will be stored in
47
operational databases used by the information systems in the organization.
However, gathering this information in a centralized database is not always an
easy task since it may
involve low

level conversion of data, such as from flat file
to relational tables. The operational data used in different parts of the
organization varies in quality. Some databases are updated on a day

to

day basis;
others may contain information that dat
es back several years.
Therefore a data warehouse is an important aspect of the data mining process.
Although it is not essential to have a good data warehouse in operation to set up
a KDD activity, it is very helpful. A data warehouse presents a stable a
nd reliable
environment in which to collect operational data.
3.3.2 Cleaning:
Once data is collected, the next stage is cleaning because the amount of pollution
that exists in a data might not be easily detectable, it is therefore a good idea to
examine th
e data in order to obtain a feeling for the possibilities, which is difficult
with large dataset. When databases are very large, it is always advisable to select
some random samples and analyze them to get a rough idea of what once can
expect. For example
in an organization, the date of birth of a person will be
stored correctly, but the age field may not be correct. Before a data mining
operation, one has to clean the data as much as possible, and this can be done
automatically in most cases. It is not rea
listic, however to expect to be able to
48
remove all the pollution in advance since some anomalies in the data will only be
discovered during the data mining process itself. Checking domain consistency
needs to be carried out by programs that have deep seman
tic knowledge of the
attributes that are being checked. Most forms of pollution are produced via the
method in which the data is gathered in the field; removing this kind of pollution
will almost always involve re

engineering the business process.
3.3.3 En
richment:
Once the data is cleaned, enriching it becomes necessity. Additional database may
be available on a commercial bases; these can provide information on a variety of
subjects, including demographic data, such as the average prices of houses and
car
s, types of insurance that people have, and so on.
Matching the information from bought

in databases with the company’s own
database can be difficult. A well

known problem is the reconstruction of family
relationships in databases: a company may buy a dat
abase containing
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο