An Evaluation of Commercial Data Mining

tealackingAI and Robotics

Nov 8, 2013 (4 years ago)

155 views


1

An Evaluation of Commercial Data Mining

Emily Davis

Supervisor: John Ebden

Abstract
: As data mining becomes an increasingly popular tool used by business and
analysts it becomes necessary to examine some of the main principles in this field. It
is necessar
y to research not only the tools and software available for data mining but
also the processes and methodologies behind their use. This literature survey presents
some of the basic processes behind data mining and three groups of software currently
availab
le for data mining.

1.1
Introduction

A wide variety of sources are available on data mining and its related concepts but it
is most important to examine the basic principles of data mining covered in these
sources. It is then necessary to examine how these

pri
nciples are put into practic
e
when conducting data mining as methodologies or processes. It is also necessary to
discover how the software available for data mining implements these principles.

1.2
Data Mining Theory

[15;Ch5]

present the basic theory b
ehind data mining. According to the authors
,
concept description is present in even the simplest descriptive forms of data mining.
Concept description generates descriptions for the characterization and comparison of
data. Characterization provides a summa
ry of the collection of data and comparisons
allow for data collections to be compared. These concept descriptions also provide a
means for data generalization which is useful for describing the concepts present in
the data in concise terms and at generali
zed levels of abstraction.


This is in fact what data mining attempts to do by providing some sort of description
or abstraction of what is contained in the greater data set.

1.3
Data Mining Classification

Various ways to classify data mining into categori
es

according to a conceptual
approach

are suggested by
a number of
sources
. [2]
attempts to classify into
categories the var
ious techniques of data mining and
specifies

two main categories


directed data mi
ning and undirected data mining
.

[12]

divide data

mining into two
categories, supervised and unsupervised learning.

[8]

also ma
k
e

a distinction

between
data mining and data modelling.
[2] mention considering
the goals
of the data mining
project
when classifying

data mining

and accordingly what techniques
can be used to
fulfil them. Prescriptive techniques are useful for making predictions and descriptive
techniques help with understanding of a problem space.



According to [2;p
g
5], directed data mining involves using the data to build a model
that describe
s one particular variable of interest in terms of the rest of the data. This
category envelops techniques such as Classification, Estimation and Prediction.

2

Undirected data mining builds a model with no single target variable but
rather
to
establish the re
lationships among all the variables. Included in this category are
Affinity Groupings or
association discovery,
clustering

(classificatio
n with no
predefined data) and d
escription or
v
isualization.

[2]


[12
;Ch2
] define input variables as independent variab
les and output variables as
dependent variables. It can then be stated that dependent variables do not exist in
unsupervised learning as no output variable is produced bu
t a descriptive relationship
is

produced
. In supervised learning a predictive
, depende
nt

variable is produced as
output.

According to the
[8]
, data mining results in patterns that are understandable such as
decision trees, rules and associations. Data modelling produces a model that fits the
data but can be understandable (trees, rules) or
presented
as a black box as in neural
networks.


In keeping with these definitions it is possible to say that directed data mining,
supervised learning and [8]’s definition of data mining describe
similar predictive
techniques

and will be referred to as su
pervised learning
. U
ndirected data mining,
unsupervised learning and [8]’s data modelling are in the same category

as descriptive
techniques

and will be referred to as unsupervised learning
.



[15]

suggest

further ways data mining can be classified. It is
suggested that this can be
done according to the kind of database used to store the data, according to the kind of
knowledge mined from the data, according to the techniques used to mine the data or
according to the application adapted to conduct the data
mining.

1.4
Data Mining Algorithms

Algorithms are used to implement the techniques in these various categories. Many of
these algorithms in the same categories work in similar ways but differ in their
technical implementation so it is useful

in
to discuss
the techniques behind the
algorithms in the different categories of data mining.

Unsupervised learning covers techniques that include clustering, association r
ule
induction, neural networks

and association discovery or market basket analysis.

According to
[12], clustering
involves building a knowledge structure that groups
instances of data into classes and allows for the discovery of concept structures in the
data. [2;pg103]


introduce
K
-
means cluster detection

which

takes primarily numeric
input that has
been normalised. The data set is divided into k number of clusters
according to the location of all members of a particular cluster

in the data
.

Clustering
makes use of the Euclidean distance formula to determine the location of data
instances and their po
sition in clusters and so requires numerical values that have
been properly scaled. When choosing the number of clusters to create it is possible to
choose a number that doesn’t match the natural structure of the data which leads to
poor results. For this
reason [2] say it is often necessary to experiment with the
number of clusters to be used.

[12] say c
lustering is useful for determining if meaningful relationships exist in the
data, evaluating the performance of supervised learning models, detecting outl
iers in

3

the data and even determining input attributes for supervised learning.
[2] add that
u
sing
clustering

means that no prior knowledge of the structure
s

to be discovered

in
the data is

required.

[2]
give some advice for interpreting clusters and reite
rate that they may have no
practical value in some cases. Decision trees can be built from clusters a
nd
clusters

can be used

for visualisation. It is
also
useful to examine the distribution of variables
from cluster to cluster.

The authors advocate the use

of clusters when a natural
grouping in the data is suspected and when there are many competing patterns in the
data as
clustering

reduces
the
complexity

of the data
.


Another form of unsupervised learning are neural networks.[1] describe neural
networks a
s network construction wherein the network learns weights at each node in
the network and segments the state space of the data with gradients or sloping
lines.Fig3.10 [2]explain this technique as having two layers. An input and an output
layer. Each input
to the network gets its own node which consists of a transformation
of input variables fed in. The input unit is connected to the output unit with a
weighting and the input is combined in the output unit with a combination function.
The activation function

is the passed transfer function.
[2] continue to training a
neural network. T
his involves setting weights on inputs to best approximate a target.
This is important for optimizing the neural network. Three steps are involved in
training. Training instance
variabl
es,
calculating
outputs using existing weights and
c
alculating errors and accordingly adjusting weights.


According to [2] n
eural networks produce very good results but they can be difficult
to use and understand as no rules are produced.
The auth
ors continue to say that
n
eural networks require extensive data preparation as inputs must be scaled,
categorical data must be converted to numerical data without introducing any ordering
and missing values must be dealt with.
The
authors
suggest that a

pr
oblem with neural
networks is that the results cannot be explained and so they should be used when
results are more useful than understanding and not when there are a high number of
inputs.


M
arket basket analysis

and association discovery are further exam
ple
s

of
unsupervised learning. According to [12] market basket analysis
involves using
association rules to determine relationships among products.
[8]state that association
discovery is the second most common data mining technology and involves the
discove
ry of associations between data fields.


Supervised learning covers techniques that include prediction, classification,
estimation, decision trees and association rules.

[12]

describe classification a
s a technique where the dependent

or output

variable is
categorical. The emphasis of the model is to assign new instances of data to
categorical
classes. The authors describe estimation as a similar technique that is used
to determine the value of an unknown output attribute that is numerical.[12]say
predictio
n differs from the two techniques mentioned above in that it is used to
determine future outcomes

of data
.


4

[1] explain that decision trees are created using data splitting rules and applying more
data splitting rules to the resulting subsets of data. Fig 3
.8
[2;Pg111] introduce two
types of decision trees, classification and regression trees. Classification decision trees
label records and assign them to classes and then report on class probability

or the
probability that a record will be classified as part

of a particular class
. Regression
decision trees estimate the value of a variable that takes on numerical values

according to the class it would fall in
.


[2]
discuss how, in decision tree data mining, a record flows through the tree along a
path determin
ed by a series of tests until a terminal node is reached and it is then
given a c
lass label. Different criteria
are used to determine when splits
in the tree
occur and it is mentioned that most software allows the user to choose splitting
criteria as well
as control minimum node size and maximum tree depth. Fig5.7


[2] state that p
roblems associated with decision trees are over
-
fitting the data,
recursive partitioning

of the data

and finding an initial split. The authors suggest that
the best initial split
is a separation where a single class appears to dominate. This split
should result in a reduction of diversity which allows the measurement of a potential
split. The authors advise that it is not possible to discover rules that involve
relationships betwee
n variables when using a decision tree. It is also mentioned that
often data can be lost when using input variables for a decision tree due to the order of
the observations. A decreased amount of data preparation is required with decision
trees as they are

not sensitive to differences of scales
amongst

inputs
,
nor to outliers
and skewed distributions. Categorical variables tend to create problems using this
technique. Although decision trees generate understandable rules they may become
very

long. The autho
rs emphasise that it is important to
clearly
indicate the best fields
to be

used when constructing the tree
.


Decision trees are useful for classification and predictions as they assign records to
broad categories and output rules that can be easily transl
ated.

[1] states
d
ecision trees
result in

if…then


rules
and decision Lists which are lists of ‘if..then’ rules.Fig3.


Association rules are similar to decision trees and according to

[8]

association rule
induction is the most established and effective o
f the current data mining
technologies. This technique involves the definition of a business goal and the use of
rule induction to generate patterns relating this goal to other data fields. The patterns
are generated as trees with splits on data fields.

[8
]

emphasise that this technique is
most useful if it involves the user in the process. This allows the user to add their
domain knowledge to the process and decide on attributes for generating splits.


1.5
Choosing a Technique

[12] provide advice on choosi
ng a data mining technique for use. When determining
whether
to use
supervised or unsupervised learning techniques the authors advise that
the following should be considered:



If a clear explanation of results is required it is often more useful to use
sup
ervise
d

learning as tec
hniques such as neural networks

provide black box
results.


5



If there is a set of input data and output data it is more likely that supervised
learning will be used as dependent and independent variables

exist
. If the input
and output

data have
interesting

interactions
a
ssociation rules are
recommended. Also to be considered is whether the data is numerical or
categorical.



If it is known which attributes best define the data decisi
on tree
s can be used
as these help to determine which
attributes are most predictive of class
membership. Clustering and neural networks assume all attributes are of equal
importance.



If the data sets have missing values neural networks may be a good choice as
the authors state that they tend to outperform ot
her techniques when noisy data
is used.



If time is a constraint the authors recommend decision trees and production
rules as they feel these techniques are faster.

[8] further state that when increased accuracy is required of a model it is beneficial to
cr
eate multiple models using the same data mining technique until the optimal model
is created.

2.1
Data Mining Process

Once the theory behind data mining is covered it is possible to examine the data
mining process and issues associated with this area. An i
mportant aspect to examine is
the importance of the data miner in th
is

process.


[2;ch1]
emphasises that there is often too much focus on the automatic techniques
available for data mining and not enough on the exploration and analysis of the
problem and th
e data.
Also suggested by [2;Ch3] is the importance of the combination
of technical and business skills needed by a data miner. To explain
this statement they
use an example of undirected data mining. Undirected data mining
could be
described, say the auth
ors, as a set of semi
-
transparent boxes. This process still
requires
interacti
on

as people are required to determine if the resulting pattern

produced during data mining are

relevant.


[10;ch1
]

emphasise
s

that analysts are still required in the data mining

process to
assess results and validate predictions made by models. This is necessary as software
lacks the human experience required to differentiate between relevant and irrelevant
conclusions that may be drawn from data mining.

An important point
state
d by
[12;ch2;pg59]

is that “It is the analysis of results
provided by a human element that ultimately dictates the success or failure of a data
mining project”

Data mining consists of a number of activities that result in a solution to a problem.
At the KD
D

(Knowledge Discovery in Databases)

panel

Ramakrishnan
e
xpressed
concern regarding the amount of time spent analysing data and expressed the need for
a more efficient data mining process. He felt it was necessary to develop a coherent
methodology or ‘tool
-
kit’ for data mining that would include machine learning,

6

statistics, databases, information retrieval and

multimedia analysis
.

[8] feels that a
step by step data mining methodology needs to be developed to allow not only experts
to conduct data mining an
d that this methodology should be repeatable for most data
mining projects.

These statements show the need for a well defined data mining
process to be used by data miners.

Various sources present different processes but
many of the processes have steps in

common.


[12:ch5]

introduces the KDD data mining process where emphasis is placed on data
preparation for model building.



G
oal identification. Properly identifying goals to be accomplished by the data
mining project help with domain understanding and det
ermining what is to be
accomplished.



Creating the target data
.

I
t is emphasized that at this stage a human expert is
required to choose the initial data for the project.



Data preprocessing

in order to deal with noisy data. This stage involves
locating dupl
icate records in the data set, locating incorrect attributes,
smoothing the data and dealing with outliers in the data set.



Data
transformation i
nvolves the addition or removal of attributes and
instances, normalizing of data and type conversions.



The actu
al data mining is the next step and at this stage the model is built from
training and test data sets.



The resulting model is then interpreted
to
determine if the results it presents
are useful or interesting.



The model or acquired knowledge is then applie
d to the problem.



[8]

states that s
teps should include
:




problem analysis when it will be determined whether the problem is suitable
for data mining and what data and technologies are available. Also at this
stage it will be important to determine what w
ill be done with the results of the
data mining to put the problem in context.




Data preparation should be part of the methodology and should cover
transformation and cleansing of the data into its required format for data
mining.




Data exploration allows
the miner to discover any errors in the data as well as
getting to terms with what is actually contained in the data.




Pattern generation should follow

which involves applying the algorithms and
validating and interpreting the patterns that result.




Patter
n deployment should be conducted as decided in the problem analysis
stage.


7




Pattern Monitoring to ensure that the model reflects shifts in data over time.




Ease of use is another aspect of usability and this should be supported by all
steps of the method
ology.



[10]
gives a
concise methodology for conducting data mining
and identifies problems
commonly associated with the process. These include the nature of data in the
database as it is often dynamic, incomplete, noisy or very large. Also problematic is

the adequacy or relevance of stored data as well as errors in the stored data. The
methodology is constructed from a number of steps.



I
dentification of the business problem and definition of the business goal to be
met by data mining. Proper identificatio
n and definition will lead to a solution
that is geared towards measurable outcomes from implementing the resulting
models.



Data processing, the most time consuming in the process. Processing involves
pre
-
processing or cleansing of the data, data integrati
on, variable
transformation and splitting or sampling from the database.



D
ata exploration and descriptive analysis must be conducted next. This allows
the analyst or data miner to discover the unexpected in the data as well as to
confirm the expected.



Dat
a mining solutions involve conducting the data mining using supervised or
unsupervised learning techniques.



Model validation is

required in order to confirm the usability of the developed
model. Validation can be conducted using a validation data set and a
ssesses
the quality of the model fit to the data as well as protecting the model from
over
-

or under
-
fitting.



Interpretation and decision making conclude the methodology and attempt to
transform the patterns discovered during data mining into knowledge.

[1
2] present a simple 4 step process model for data mining.



The first step is to assemble the data. The data may be stored in a data
warehouse, database or spreadsheet and needs to be extracted and assembled
before data mining is conducted.



The data is then

presented to the data mining software. At this stage it is
necessary to determine whether supervised or unsupervised learning will be
used, what parameters will be used, how the data will be split into test and
build sets and what attributes of the data w
ill be used.



The results of the data mining are interpreted next. This involves determining
whether the results will be useful or are interesting.



The last stage is to apply the results to a new problem.


8


[13;ch2]
presents the
steps for data mining in
some

detail.



Problem definition involves analysis of the business problem as well as the
data mining problem, problem scope definition, model evaluation metrics and
data mining objectives definition. This stage often requires a data availability
study which d
etermines the ability of attributes to predict, what relationships
are hoped to be found, whether patterns or predictions are required as results,
how the data is distributed and how the data is stored.



Data preparation.

During this stage all the data that

relates to the problem must
be gathered and cleaned. Cleaning involves dealing with null values, outliers,
inconsistencies and table properties. This process can be automated to an
extent buy calculating minimum and maximum values for fields, means and
st
andard deviations and data distributions.



Model building.
Before the model is built the data must be randomly separated
into training and testing data sets and over
-
sampling must be considered in
order to balance the predictions. The data must then be exp
lored to determine
which columns to include in the model. Exploration is assisted using
Visualization techniques and correlation matrices. The data is then passed
through the data mining algorithm and the result is a mathematical model.



Model validation in
volves the comparison of several data mining models using
the test data set. Validation code and lift charts are used to assist this process.



The validated model is then deployed to assist with business decision making
and as more data is collected the mod
el is updated.



The metadata generated during this process consists of removed columns,
previous models and model effectiveness details.

Common to all the presented processes are thorough data preparation and exploration,
interpretation and validation of th
e resulting models and the data mining itself. This
means that any data mining project requires some sort of data preparation and
a set of
techniques to validate and evaluate the results of the data mining.

[1
;ch2
]

emphasises
the importance of proper data
preparation for data mining and
describes the
types of data generally encountered during this process

and the manner
in which they can be dealt with
.
[1:ch12] say the benefits of data mining in this
manner include the creation of more effective models fast
er.


According to
[1;ch2], c
onstants are discarded for
the purposes of
data mining

preparation

as they have no pattern finding value
.

Empty and missing values are also
explained.

An empty value is one that has no real world value whereas a missing
value i
s one that does exist but is not entered into that data set.

These are dealt with in
different ways by different data mining tools.


[4] explains that data c
leaning involves checking for physical inconsistencies in the
data, checking for null values and c
hecking for logical inconsistencies. Enrichment
adds information to the data. This information can be in the form of calculated fields
or external data. Transformation can change the data either physically

(at the level of
field types) or logically

(which
changes the granularity of the data).


9


[15]

suggest that d
ata integration should
be conducted to
remove any naming
inconsistencies amongst the data sources and that data transformation should be
conducted as it results in normalization and aggregation of t
he data where required.
Data reduction can be conducted to result in a reduced representation of data that
produces the same analytical results as using the entire data set.


[10]

state that
it is also necessary to determine whether the entire database wil
l be
used for mining or whether samples will be taken from the database. Sampling can be
advantageous as it allows for training and validation
data
sets to be kept aside, it tends
to be more efficient as it decreases the time spent on data cleansing and ex
ploration
and allows for increased data exploration and visualisation. The latter often leads to
more insight for the data miner and as a result more accurate models.

According to [10] s
ampling can be conducted in
three

ways but samples must always
represe
nt the entire database. The first is simple, random sampling where all records
have equal probability of selection. Cluster sampling involves dividing the database
into clusters and then randomly selecting clusters. Stratified random sampling divides
the d
atabase into subpopulations and samples are then taken from these in proportion
to subpopulation size.

[1] state that at least two

outputs are required from data preparation: the training data
set which

is used for building the model and

the testing data s
et which helps detect
overtraining(noise trained into the model.

These data sets are used by the data mining
tool later in the process.


[12]

discuss aspects of evaluating the performance of models built during data
mining. According to these authors, when

evaluating performance it is necessary to
consider whether the benefits
of data mining
outweigh the cost, whether the results of
the data mining can be interpreted and whether the results can be used with
confidence.

2.2 Evaluating the Output of Data Mini
ng

[12] state that evaluation of supervised learning models involves determining the level
of predictive accuracy.
[12;7] say that s
upervised models tend to be evaluated using
test data sets. Such models can be evaluated by comparing the test set error rat
es of
supervised learning models created from the same training data to determine accuracy
of models and which model to apply. Three methods of comparison can be
implemented. The test sets can be made up of independent randomly selected data.
The same test

sets can be used on both models and a pair
-
wise comparison can be
conducted or the overall correctness
of the models
compared.

To test classification correctness of the model on the test data set a confusion matrix
is used. A three class example confusio
n matrix is provided by the text as shown

by
[12]

below:



10



C
lass
1

C
lass
2

C
lass
3

C
lass
1

C
ount of Class
C
11

Instances

C
ount of Class
C12

Instances

C
ount of Class
C13

Instances

C
lass
2

C
ount of Class
C21

Instances

Count of Class
C22

Instances

C
ount of

Class
C23

Instances

C
lass
3

C
ount of Class
C31

Instances

C
ount of Class
C32

Instances

Count of Class
C33

Instances


The following rules are presented
by [12]
for interpreting the confusion matrix. The
values on the main diagonal represent correct classi
fications. Row C
lass
i represents
those instances belonging to C
lass
i and column C
lass
i indicates those instances
classified as C
lass
i. Accuracy of the model is determined by summing the values
along the main diagonal and dividing this result by the num
ber of test set instances.
Confusion matrices are best used for evaluating the accuracy of models using
categorical data.

In the case of the following table, Model A is used to classify categorical data into
two classes, Accept and Reject. The model corre
ctly classified 600 Accept instances
from the data and correctly classified 300 Reject instances. However, there were
actually 625 Accept instances in the data and 375 Reject instances. The model also
classified 675 instances as Accept instances and 325 in
stances as Reject instances.
The accuracy of the model is then determined by dividing 900 by 1000 and results in
a 90% accuracy or error rate of 10%.

Model A

Computed Accept

Computed Reject

Actual
Accept

600

25

Actual
Reject

75

300


[12] state that w
hen

evaluating numerical output of models the most common
techniques used are

error rates but [4] feels that
these techniques are generally
considered to be brute force measurements as they calculate the percentage of correct
predictions.

According to [12] a
nd [4], t
he mean absolute error of a model is the
average absolute difference between computed and
predicted

outcome. The mean
squared error rate is the average squared difference between computed and desired
outcome and the root of this is the Rms (root m
ean squared error rate).

[2;ch7] state that it is possible to measure model performance using a lift or
cumulative gains chart. Since the goal of the model performance is stability it should
hold true when applied to unseen data and this is depicted on a l
ift chart.

[12] say a

technique to use when evaluating unsupervised learning models is to
employ supervised learning. The text gives an example of such a technique using

11

clustering. Once the clustering is performed a cluster is thought of as a class and
a
ssigned a class name. Random samples are chosen from instances of each class in the
same ratio to that of the data set. A supervised model is then built with the class
names as output and the random samples as the training set. The remaining instances
are
used to test the accuracy of the clustering model.

[12;ch7] point out that t
he training data used for model building should also be
evaluated especially when a supervised model shows low accuracy. It is necessary to
confirm that the training data set prope
rly represents the data to be mined.


[15]

emphasise

that only a small percentage of the patterns extracted from the data
during data mining are interesting from a problem solving point of view. Measures of
interestingness are used to determine the extent
of the usefulness of the patterns in
these cases. These measures include such things as whether the pattern is easily
understood, whether the pattern is valid with a degree of certainty, whether the pattern
is potentially useful, whether the pattern is nov
el, whether it confirms a hypothesis of
some kind and whether it represents knowledge.

[9] suggest that problems with data mining existed in two areas, commercial and
scientific. The commercial area of data mining has inadequate tools and solutions
availab
le and the scientific area needs a stronger level of attracting contributions from
fields related to data mining.


Usama M. Fayyad made certain observations
in [9]
and these emphasised two major
problems in data mining. He felt that at the moment too large

an amount of time is
spent extracting and manipulating data as opposed to mining and exploring it. He also
felt that the technology was too theoretical at present and that it was becoming
necessary to create a methodology for data mining so that those wit
h less experience
could practise effectively.


3 Data Mining Tools Available and their Implementation


Three sets of data mining tool were researched for this literature survey. They include
Oracle9i’s Data Mining, IBM Intelligent Miner and SQL Server Dat
a Mining and
Analysis Server.



3.1 IBM

[14]

emphasises the shift in the field of data mining from standalone technology to
integration with databases and the deployment of data mining in business
applications.

[14]
states that w
ith their software the data

mining is integrated directly
into the database for faster application performance.

According to
[3.1] IBM has three products available as part of their DB2 Enterprise
Edition Intelligent Miner suite. These are Intelligent Miner Modelling, Scoring and
Vis
ualisation

(which allows for graphical representations of results). The algorithmic
categories supported by Intelligent Miner include Associations Discovery,
Demographic Clustering and Tree Classification.



A SQL interface allows the modelling functions t
o be embedded into business
applications and the main steps of the data mining process to be carried out.


12


Procedures are used to discover the prediction model and to detect outliers in the data
as a form of data preparation. These are implemented as Java
stored procedures of
DB2.


[14] define s
coring
a
s the process of applying the model to new data. IM Scoring is a
database extension defined by standard SQL extensions that can run in batch or real
-
time. It consists of a two level structure of IM Scoring fu
nctions that are used to apply
the model from the algorithm and the results from this application are termed IM
Scoring results.

[14] describe IM Modelling as a tool that
provides a set of functions as an add
-
on
service to the database. These functions con
sist of sets of user defined tables,
user
defined functions
, methods and stored procedures. Modelling is performed by calling
the functions through the SQL API.

According to [14]
IM Visualization allows for understanding of the data mining
model. Java vis
ualizers present model results in the form of visual summaries.

[14]

advocate the use of a standard data mining process when using this software and
steps in the process include data preparation, model building, model testing,
visualizing results and apply
ing the model. IM Modelling decreases the amount of
work required during data preparation and completes the task of model building as
well as allowing for testing. IM Visualization is used for results visualization and
application of the model uses IM Scor
ing.

3.2 SQL Server


[3.2] introduce SQL Server Data Mining. T
his Microsoft product uses two algorithms
for data mining, MS Decision Trees Algorithm and MS Clustering Algorithm. MS
Decision Trees Algorithm is a classification type algorithm and allows for
the
prediction of values in columns based on the values of other columns. MS Clustering
Algorithm groups records into clusters.


According to [4] a
ccessing the Data mining query interface in SQLServer is done
primarily through a wizard interface but can al
so be done through an OLE DB
command object. The resulting model is stored as part of an object hierarchy in the
Analysis Services directory. Patterns

mined from the data

are stored in summary form
in terms of the dimensions, patterns and relationships so
that predictive power of the
model persists even if the row level data changes.


The data mining strategies employed in

SQLServer as mentioned by [4]
include self
-
service, as part of the wizard interface, integration of OLAP and data mining and
universal d
ata access which allows the sharing of results with multiple applications
and the development of models from relational or dimensional data sources.


[4] says f
eatures available with this data mining suite include the data mining model
which is described a
s a relational table with special columns that can be used to derive
patterns. A SQL ‘select’ statement can be used to make predictions in this feature.


13

[4] also mentions that t
his software also has DTS (data transformation services), a
query tool for bui
lding prediction packages containing the trained data mining model
as well as pointing to the untrained data source.


[13]

describes the tools used in the data mining process by SQLServer. Visual Basic
6.0 is used to view any code required during the proc
ess. SQLServer 2000 handles the
manipulation, management and storage of the data. Analysis Services build the data
mining models and makes predictions using the models. Analysis Services creates an
object called the Data Mining Model (DMM) from a mining mo
del wizard or
programming language. This creation results in a container structure similar to a table
that contains descriptions of the columns of data and the algorithm used. Training the
model populates the table with relationships and rules from the mod
el. Two types of
models may be built with these tools, decision trees and clustering. DTS allows for
the importing and transformation of data.


3.3 Oracle


[11]

describes the use of the Oracle9i Data Mining Java API and covers building
models, testing mo
dels, computing lift and scoring models.

[11;ch1] states that
Oracle embeds data mining in the database which allows for
integration with other database applications. All data mining functions are provided
through the
Java
API giving complete control to th
e programmer over data mining
functions.

According to [11;ch2]
The Oracle Data Mining suite is made up of two components,
the data mining Java API and the Data Mining Server (DMS). The DMS is a server
-
side, in
-
database component that performs data mining th
at is easily available and
scalable. The DMS also provides a repository of metadata of the input and result
objects of data mining.

3.3.1 Oracle Algorithms

ODM supports the following algorithms as
stated by [11]
:



Adaptive Bayes Network supporting decision
trees (classification)



Naive Bayes (classification)



Model Seeker (classification)



k
-
Means (clustering)



O
-
Cluster (clustering)



Predictive variance (attribute importance)



Apriori (association rules)


14

Choice of algorithm for ODM depends on the data avai
lable for mining as well as the
types of results and conclusions required.
[11] follows
with a discussion on when it is
applicable to use the various algorithms as well as an in depth explanation of the
workings of each algorithm.

According to [11]
Model S
eeker is a component that automatically builds data mining
models with decreased user input or can allow users to select models to be build.
Automated binning of attributes is required by many ODM algorithms and can be
conducted by the user through Model S
eeker as part of pre
-
processing or algorithm
use. Model Seeker also allows for the evaluation of models by testing and the
calculation of lift.

When using Model Seeker a summary is generated for each model built which allows
the user to select the best mod
el. The best model will have the largest value for
weighted target positives and total negative relative error rate. This weight is set as
the relative importance of the positive category


3.3.2 Functionality of Oracle Algorithms and Data Mining

[11] says
Mining tasks are available to perform data mining operations including
building and testing of models, computing lift, applying models (scoring) and
importing and exporting models.
This table taken from [11]

provides a summary of
the ODM tasks and when the
y apply:



Function

Build

Test

Compute
Lift

Apply
(Score)

Import
PMML

Export
PMML

Algorithm







Classification

X

X

X

X



Naïve Bayes
Classification

X

X

X

X

X

X

Clustering

X





X





Association Rules

X







X

X

Attribute Importance

X












[6] describes the Data Mining for Java aspect of the data mining suite. This aspect has
wizards that control the preparation and mining of data as well as evaluation and
scoring of models. DM4J has the ability to automatically generate Java and SQL code
t
o transfer the data mining into integrated data mining or business intelligence
applications.



15

[15;ch10]

provide some advice on points to be considered when choosing a data
mining system. According to the authors Aspects to consider include the data types
supported by the system, typical system issues for example platforms and operating
systems, data sources required, functions and data mining methodologies available,
the extent of coupling with the database or data warehouse, scalability (row and
column),
visualization tools available, data mining query languages available with the
system and GUIs available.

4
Conclusion

It can be concluded that data mining must not be thought of as an automated process
but one that requires the insight and experience of th
e data miner. In order to ensure
this is included in a data mining project

an extensive
methodology or process must be
used to conduct data mining. This process is often supported by the various software

tools

available for data mining but
these
still requ
ire extensive user input.










References:

[
1
]

Data Preparation for Data Mining
,

Dorian Pyle, San Francisco, California,
Morgan Kauffman, 2000.


[
2
]

Mastering Data Mining: The Art and Science of Customer Relationship
Management
,
Michael J.A. Berry and

Gordon S. Linoff, USA, Wiley Computer
Publishing, 2000.


[
3.1
]

The IBM Home Page.

15/04/04, IBM. <
http://www
-
306.ibm.com/software/data/iminer/
>.


[
3.2
]

The
Microsoft

Home Page.

5
/0
5
/04,
Microso
ft
.
<

http://www.microsoft.com/sql/evaluation/features/datamine.asp
>.



16

[4]
Data mining in SQL

Server

2000
,

Barry de Ville
, 01/01, InstantDoc #16175, SQL
Server Magazine, <
http://www.winnetmag.com/Article/ArticleID/16175/16175.html
>



[5]

The
Oracle

Home Page.

06
/04/04,
Oracle,
<
http://www.oracle.com/ip/deploy/database/oracle9i/collateral/index.html?bi_dm_faq
.html
>


[6]

The
Oracle

Home Page.

24/03
/04,
Oracle,
<
http://otn/oracle.com/products/b
i/odm/9idm4jv2
.html
>



[7]

The
Oracle

Home Page.

06
/04/04,
Oracle,

<
http://otn/oracle.com/products/bi/pdf/9idm_bup.pdf
>


[
8
]

White Paper: Data Mining
-

Beyond Algorithms
,
Dr Akeel Al
-
Att
ar
, 2004,
<
http://www.attar.com/tutor/mining.htm
>

[9]

Summary from the KDD
-
03 Panel
--
Data Mining: The Next 10 Years
,

Usama M. Fayyad
,
Gregory Piatetsky
-
Shapiro
,
Ramasamy Uthurusamy
,

SIGKDD Explorati
ons,
Volume 5,Issue 2
-

Page 191
,
<
http://www.acm.org/sigkdd/kdd2003/panels.html
>


[
10
]

Data Mining Using SAS Applications

by George Fernandez, USA, Chapman and
Hall/CRC, 2003.

[11]

Oracle9
i

Da
ta Mining Concepts

Release 2 (9.2)

,
Oracle Home Page,

<
http://www.lc.leidenuniv.nl/awcourse/oracle/datamine.920/a95961/preface.htm
>


[
12
]

Data mining: a tutorial
-

based primer

by Richard J. Roiger and Michael W.
Geatz, Boston, Massachusetts, Addison Wesley, 2003.

[
13
]

Preparing and Mining Data in Microsoft SQL Server 2000 and Analysis
Services
,
Seth Paul
, Nitin Gautam, Raymond Ballint,
Published:

September 2002

Upd
ated:

January 2003.

[
14
]
Enhance your Application


Simple Integration of

Advanced Data Mining
Functions
,
Corinne Baragoin
,

Ronnie Chan
,

Helena Gottschalk
,

Gregor Meyer
,

Paulo Pereira
,

Jaap Verhees
, 2002,
<
http://www.redbooks.ibm.com/redbooks/SG246879.html
>

[
15
]

Data mining: concepts and techniques

by Jiawei Han and Micheline Kamber,
San Francisco, California, Morgan Kauffmann, 2001.





17