EVALUATION OF DATA MINING

sentencehuddleData Management

Nov 20, 2013 (3 years and 10 months ago)

179 views









Arisekola Akanbi

EVALUATION OF DATA MINING








Thesis

CENTRAL OSTROBOTHNIA UNIVERSITY OF APPLIED SCIENCES

Degree Pro
gramme in
Information Technology

May 2011




Thesis

Abstract




Key words




Data, data mining, information and knowledge




Department

Technology and Business

Date

May

2011


Author

Arisekola Akanbi

Degree P
rogramme

Degree
Programme in Information Technology

Thesis T
opic


Evaluation

of data mining

Instructor

Kauko Kolehmainen

Supervisor


Kauko Kolehmainen

Pages


63

+ APPENDIX


The development in
networking,

processor and
s
t
orage
technologies have led
to the increase in the amount of data

flowing into organizations,
the creation of
mega databas
es and data warehouses to handle the
bulk of transactional data
in digital form. This has led to the
emphatic
need to develop processes and
tools to
explicitly
analyze such data so as to extract valuable
trends and
correlation
s

generating

intere
sting information that will yield knowledge

from
the data.

Data mining is the technology that meets up to the ch
allenge of

solving
our
quest for knowledge from these

vast data burden
s
. It provides us with a user
oriented approach to novel hidden patterns in data.

I
mportant disciplines
ranging f
rom
machine learning, information retrieval, statistics and artificial
in
telligence have

had impact
s

on t
he development of data mining
. Based on
the geometric increase in data flow, we envisage more advanced and
sophisticated information to be hidden in datasets.

The goal o
f the thesis was to evaluate
data mining

in theory and in practice. An
o
verview of

database system
s
, data w
arehousing, data mining
goals,
applications and algorithms

was carried out
.

It also involved re
viewing data
mining tools.


Mic
rosoft
SQL

server

2008
, in conjunction with Excel

2007

data
mining add
-
ins were

used in dem
onstrating data mining task

in practice, using

da
ta

samples

from

Microsoft

AdventureWorks

database and W
eka

repository
.
In conclusion,
the results of the tasks using
the Microsoft Exce
l data mi
ning
add
-
ins
, revealed how

reliable,

easy and efficient data mining could be
.




TABLE OF CONTENTS


1 INTRO
DUCTION…………………………………………………………………

...
.
1


2 DATABASE

SYSTEM
S
…..............................................................
......................
3




2.1 Databa
ses………
……………………………………………………………

...
..
4




2.2 Relationship between
data mining and data w
arehousing
…………

...
.......
6




2.3 Data w
arehousing.....
…………………………………………………………

.
6


3 DATA MINING
……………………………………………………………………


9



3.1 Brief history and evolution
…………………………………………………

…..
9



3.2 Knowledge discovery in databases
………………………………………

….
11




3.3 Knowledge discovery process models
……………………………………
……
12



3.4 The need for d
ata
m
ining………………………………………………...
......
..
..
14



3.5 Data Mining Goals
……………………………………………………………

.
15




3.6

Applications of data mining
………………………………………………....
..
...
.
16



3.6.1 Marketing
…………………………………………………………………...
..
.
16



3.6.2 Supply chain visibility
…………………………………………………..
.
.
..
..
.
16



3.6.3 Geospatial decision making
……………………………………………
...
...
17



3.6.4 Biomedicine and science application
…………………………………
…..
17



3.6.5 Manufacturing
……………………………………………………………
….
17



3.6.6 Telecommunications and control
……………………………………
...
..
....
18



4

DISCOVERED KNOWLEDGE
…………………………………………………
...
...
19



4.1 Association rules
……………………………………………………………


19



4.1.1 Association rules on transactional data
……………………………


....
21



4.1.2 Multilevel association rules
……………………………………………


22



4.2 Classification
…………………………………………………………………
….
22



4.2.1
Decision tree
……………………………………………………………
...
....
23



4.3 Clustering
……………………………………………………………………
...
....
25



4.4 Data mining algorithms
……………………………………………………
…...
.
26



4.4.1 Naïve Bayes algorithm
…………………………………………………
…..
.
26



4.4.2 Apriori algorithm
………………………………………………………


..
27




4.4.3 Sampling algorithm
…………………………………………………



28



4.4.4
Frequent
-
pattern tree algorithm
……………………………………
……
...
28



4.4.5 Partition algorithm
……………………………………………………
……
..
29



4.4.6 Regression
……………………………………………………………

….
29



4.4.7 Neural networks
………………………………………………………
……
.
30



4.4.8 Genetic algorithm
……………………………………………………
...
......
.
30


5 APPLIED DATA MINING
………………………………………………………
……
32


5.1 Data mining environment
…………………………………………………
…...
..
32


5.2 Installing the SQ
L
server
…………………………………………………
……
...
34


5.3 Data mining add
-
ins for Microsoft Office 2007
………………………
……
…..
34


5.4 Installing the add
-
ins for Excel 2007
……………………………………

…...
35


5.5 Connecting to the analysis service
……………………………………
……
….
35


5.6 Effect of the add
-
ins
………………………………………………………
……
..
36


5.6.1 Analyze key Influencers
………………………………………………
…..
.
.
38


5.6.2 Detect categories
………………………………………………………
…..
.
.
39


5.6.3 Fill from example tool
…………………………………………………

….
41


5.6.4 Forecast tool
……………………………………………………………


43


5.6.5
Highlight exceptions tool
……………………………………………
……
...
44


5.6.6 Scenario analysis tool
…………………………………………………


45


5.6.7 Prediction calculator
…………………………………………………
……
...
47


5.6.8 Shopping basket analysis
……………………………………………

….
49


6
ANALYSIS SCENARIO AND RESULT
………………………………………

..
52

7

CONCLUSION……………………………………………………………………

59

REFERENCES


APPENDIX

















1


1
INTRODUCTION


It is an established fact that we are in an information technology driven society,
where knowledge is an invaluable asset to any individual, organization or
government.

Companies are supplied with huge amount of data in daily basis, and
there is the need for them to focus on refining these data so as to get the most
important
and
useful information in their data warehouses. The need for a
technology to help solve this que
st for information has been on the research and
development front for several years now.

Data mining is a new technology which could be used in extracting valuable
information from data warehouses and databases of companies and governments.
It involves the

extraction of hidden information from some huge dataset.

It helps in
detecting anomalies in data and predicting future patterns and attitude in a highly
efficient way.

Data mining is implemented using tools, and the automated analysis provided by
this to
ols go beyond evaluation of dataset to providing tangible clues that human
experts would not have been able to detect due to the fact that they have never
experienced or
expected such. Applying data mining makes it easier for

co
mpanies and government, duri
ng

quality decisio
ns from available data,
which

would have taken longer time, based on human expertise.

Data mining techniques could be
applied in a wide range of organizations
, so long
as

they deal with collecting data, and there are several data mining software been
made available to the market today, to help companies tackle decision making
problems and invariably overcome competition from other companies in the same
business.

The goal o
f this

thesis work is fo
r the evaluation
o
f data mining

in theory and in
practice, as
this thesis could also be used for academic purpose.

It is to have an
overv
iew
of
database systems, data warehousing,
and insight on
data minin
g as a
field and try hands
on some data mining tools used in accomplishing the process.
Achieving such objective involves reviewing the main algorithms been employed
2


during data mining
by most data mining tools, carrying out some scenario analysis
to demonstrate the process using on
e or more data mining tools.


The tool used in this work

is the Microsoft SQL server

2008, in conjunction

with

Excel

2007

data mining add
-
ins, however, this tool only uses data that has alre
ady
been collected

and prepared
, because it basically

models an

analyz
es a ready
data.

The other data mining tools tri
ed were Ibm

intelligent miner, T
anagra data
miner, and Weka
.


The contents of this work start
with an overview of
database system
s and
databases

been the root technology that lead to data mining in for
m of evolution,
then there is a
brief literature on data warehousing and its relation to data mining,
since all useful data collected by organizations are kept there, before they could
be subjected to any further mining or analysis prior to decision making
. There is an
overview of data mining as a field,

its evolution what motivated its coming into
existence, data mining objective and the process of knowledge discovery in
databases.

The knowledge di
s
covered is then placed into classes based on the method us
ed
and the outcome. The main algorithms employed during dat
a mining process are
also analyz
ed, some having example

citations and graphs to emphasize

the
algorithm. Some of the numerous possible applications of data mining are also
discussed.

Applied data
mining using Microsoft Office Excel 2007 data mining add
-
ins are
al
so discussed, ranging from the S
QL

server, the add
-
ins to the effect of the add
-
ins in form of tools and algorithms employed during analysis of ready data.

In essence of emphasis, the objec
tive of this work is to evaluate data mi
ning in
theory and practice.





3


2



D
ATABASE SYSTEM
S


In this chapter, an overview of
database system
s
, its evolution,

databases,

data
warehousing
,
and the relationship between data warehousing and data mining will
be made.

D
atabase understanding would be incomplete without some knowledge of the
major a
spects which constitute the
building and framework of d
atabase systems,
and these fields include
structured

query language (
SQL
)
,

extended markup
language (
XML
)
,

relational d
atabases concepts,

object
-
oriented concepts,

client
and s
ervers,

s
ecurity,

unified modeling language (
UML
)
,

d
ata

warehousing,

d
ata
mining
and e
merging applications. R
elational dat
abase idea was put forward to
differentiate storage of data
from its conceptual depiction, and hence provide a
logical foundation for content sto
rage. T
he birth
of object oriented programming
languages brought about the idea of object oriented databases, e
ven

though they
are
not as popula
r as relational databases to
day.

(Elmasri & Navathe 2007,

23.)

Adding and retrieving information fr
om databases is fundamentally
achieved by
the use
of SQL
,

while
interchanging

data on t
he web was also possible and
enhanced

by

publishing language like hyper t
ext

mark Up l
anguage

(
HTML)

and
XML
. T
he data are kept in web servers, so other users could have access to them,
an
d
subsequent development and research into datab
ase management led to the
realiz
ation of data warehousing

and d
ata mining even

though they had been
applied

before the name was set aside
.

(
Elmasri & Navathe 2007,
23
.
)



XML is a standard proposed by the
World Wide Web

consortium. It helps

user
s

to
describe and store structured or semi structured data and to exchange data in a
platform and tool independent way. It helps to implement and standardize
communication between knowledge discovery and database systems. Predictive
Model Markup Language (P
MML) a standard based on XML,
have

also been
identified to enhance interoperability among different mining tools and achieve
4


integration with other applications ranging from database syst
ems t
o decision
support systems
. (
Ci
os,
P
edrycz
,
S
winiarski

&
K
urgan

2007
,

20.
)


The data mining group

(DMG) developed the
PMML to represent analytic models.
It is supported by leading business intelligence and analytic
s vendors like IBM,
SAS, micro
-
s
trate
gy, Oracle and SAP. (Webopedia

2011.)

Database system
s could be
classified as OLTP (
on
-
line transaction process
systems, and decision support systems, like warehouses, on
-
line analytical
processing (OLAP) and mining. Archive of data from OLTP form decision support
systems which have the aim of learning from past instan
ces. It involves many
short, update
-
intensive command
s

and it is the main function of relational
database management systems.

(Mitra & Acharya 2006, 24.)




2.1

D
atabases

A d
atabase is a well

structured aggregation of data
that are associated in a

meaningful way, which could be
accessed in various

logical ways by several
users.
Database systems are systems in whi
ch the translation and storage
are

of
paramount value.

The requiremen
t for data after several years
to be used by many
users opti
mally dep
icts database systems.
(Sumathi & Esakkirajan 2007, 2.)

It is sometimes abbreviated as db. It is a collection of organized data put in a way
that a computer program could quickly and easily select required parts of the data.
It can be presumed a
s an elec
tronic filing system.
(Webopedia 2011.)

A traditional database is organized into fields, records and fields, where field
implies single piece of information, record is a complete set of fields, and file a
collection of records. A database management system is needed to be able to
access data or
information from a database. (Webopedia 2011
.
)
The graph 1
below is a simple representation of how information technology and database has
evolved over time to data mining
.

5


Data collection and database creation
(
1960
s
and earlier
)

Primitive file processing
Database Management Systems
(
1970
s to early
1980
s
)
Hierarchical and network database systems
-
Relational database systems
Data modeling tools
:
entity
-
relational models
,
etc
Indexing and accessing methods
:
B
-
trees
,
hashing
,
etc
-
Query languages
:
SQL
,
etc
.
-
User interfaces
,
forms and reports
Query processing and query optimization
Transactions
,
concurrent control and recovery
-
On
-
line transaction processing
(
OLTP
)

Web
-
based
databases
(
1960
s

present
)
-
XML
-
based database
systems
-
Integration with
information retrieval
-
Data and information
integration
Advanced Data Analysis
:
Data Warehousing and Data Mining
(
late
1980
s


present
)
-
Data warehousing and knowledge discovery
:
generalization
,
classification
,
association
,
clustering
,
frequent pattern and structured pattern analysis
,
outliers analysis
,
trend and deviation
analysis
,
etc
.
-
Advanced data mining applications
:
stream data mining
,
bio
-
data mining
,
time series analysis
,
text mining
,
web mining
,
intrusion detection
,
etc
.
-
Data mining and society
:
privacy
-
preserving data mining
New generation of integrated data and information system
(
present to future
Advances Database Systems
(
mid
80
s to present
-
Advanced data models
:
extended relational
,
object
-
relational
,
etc
-
Advanced applications
:
spatial
,
temporal
,
multimedia
,
active
,
stream and sensor
,
scientific and
engineering
,
knowledge based

GRAPH 1
.
Evolution of Database Technology

(adapted from Han, Kamber & Pei
2005, 2)

6


2.2

Relationship between data mining and data w
arehousing

There has been an explosive growth in database technology and

the amount of
data collected. D
evelopments in data collection methods, bar code usage and the
computerization of transactions have provided us with enormous data. The huge
size of data and the great computation involved in knowledge discovery
hampers

the ability to analyze the data read
ily available in order to extract more intelligent
and useful information, while data mining is all about to enhance decision making
and predictions or the process of data
-
driven extraction of not so obvious but
useful information from large databases.
( Su
mathi & Sivanandam 2006,
5
.)

Today‟s competitive marketplace challenges even the most successful com
panies
to protect and retain th
e customer base, manage supplie
r partnerships and control
cost

while at the s
ame time increase their revenue.
T
his and other
information
management processes are only achievable if the information managers have
accurate and reliable access to the data
, hence

prompts

the need for creation of
data wa
rehouses.

(
Sumathi & Sivanandam,

2006, 6.)


I
nterestingly data warehousing
provides online analytical processing tools for
interactive analysis of multidimensional data of varied granularities

which enhance
data mining and mining functions such as prediction, classification and association
could be integrated with OLAP operations

thus enhancing mining of knowledge,
hence buttressing the fact that data warehouses have become an increasingly
important platform for data analysis and data mining.

(
Sumathi & Sivanandam,

2006, 6.)


2.3

Data w
arehous
ing

A d
ata warehouse is an enabled re
lational database system designed to support
very large databases at a significantly higher level of performance and
manageability. It is an environment and not a product.
(Han et al.

2001, 39.
)

7


It is an architectural construct of information that is
difficult to reach or present in
traditional operational data stores
.
A d
ata warehouse is also referred to as a
subject
-
oriented, integrated, time variant and non
-
volatile collection of data which
supports management decision making process.

Subject
-
orient
ed depicts that all
tangible relevant data pertaining to a subject are collected and stored as a single
set in a useful format.

(Han et al
.
2001, 40
.)

Integrated relates to the fact that data is being stored in a globally accepted style
with consistent na
ming trends, measurement, encoding structure and physical
features even when the underlying operational systems store the data differently.

Non
-
volatile simply implies that data in a data warehouse is in a read
-
only state,
hence can

be found and accessed i
n the warehouse
.
Time
-
variant
denotes the
period
the data has been available, because such data are usually of long term
states.

(
Han

et al
.

2001
,
41.
)

The process of constructing and using data warehouses is called data
warehousing
.

Data warehouses
comprise of consolidated data from several
sources, augmented with summary information a
nd covering a long time period.
T
hey are much larger than

other kinds of databases, having sizes ranging from
several gigabytes to tera
bytes. Typical workloads involvin
g

ad hoc, fairly complex
queries and fast response times are important.

(
Ramakrishnan

&

Gr
ehrke 2003,
679.
)

OLAP however is a basic function of a data warehouse system. It focuses on data
analysis and decision making, based on the content of the data wareh
ouse and it
is subject oriented thus implying it is organized around a certain main subject. it is
built by integrating multiple, heterogeneous data sources like flat files, on
-
line
transaction records and relational d
atabases in a given format.

(
Mitra &
Acharya

2006,
24
.
)

Data cleaning,
integration

and consolidation

techniques are often employed to
ensure consistency in nomenclature

and could be viewed as an important
preprocessing step for data mining
, encoding structures, attribute measure and
8


lots more

among different data sources. Data warehouses primarily provide
information
from a historical perspective.

(Mitra & Acharya

2006, 26.)

S
ince every key structure in the data warehouse contains some atom of time,
explicitly or implicitly even though the key

of operational data may not contain the
time atom
. Data warehouses neither need to

be
updated operationally

nor require
transaction processing, recovery and concurrency control
systems,

all it needs are
the initial addition o
f the dat
a and its access. (
Mitra & Acharya

2006,

25
.)

9


3

D
ATA MINING


The term d
ata mining could be perceived to have derived its name from the
similarities between searching for variable information in a large database and
mining a mountain for a vein of valuable one, both
processes require either sifting
through an immense amount of material or intelligently probing it to find where the
value resid
es.
(Lew & Mauch
2006
,16.)

Data mining also termed as k
nowledge discovery is the computer
-
assisted
process of digging through an
d analyzing enormous sets of data and then
extracting the meaning of the data. Data mining tools predict behaviors and future
trends, allowing proactive knowledge driven decisions to be made. It scours the
database to find hidden patterns, predictive infor
mation that experts may miss
because it lies outside the
ir expectation
.

(Lew & Mauch
2006
,
16
-
17
.
)

The Gartner Group refers to d
ata mining as “ the process of discovering
meaningful new correlation, patterns and trends by sifting through large amount of
data stored in repositories, using pattern recognition technologies as well as
statistical and mathem
atical techniques (
Larose 2005, 2
.)


3
.1

Brief history and e
volution

Management Information Systems (MIS) in the 1960s and Decision Support
Systems (DSS)
during the 1970s did a lot by providing large amount of data for
processing and execution, but unfortunately it was more of mere data that was
obtained from the systems and that was not enough in enhancing business
acti
vities and decisions ( Mueller & Lem
k
e 2003,

5.
)


A summary of the trend of data mining evolution
from data collection to data
mining stage
could be seen in the table below.


10


TABLE 1. Steps in the evolution of data mining

(adapted from

Thear
ling 2010)

Evolutionary

step

Business

Question

Enabling

Technologies

Characteristics

Data
Collection
(1960s)

"What was my
total revenue
in the last five
years?"

Computers,
tapes, disks

Retrospective,
static data
delivery

Data Access

(1980s
)

"What were
unit sales in
New England
last March?"

Relational

databases
(RDBMS),
Structured
Query
Language
(SQL), ODBC

Retrospective,
dynamic data
delivery at
record level

Data
Warehousing
&

Decision
Support

(1990s
)

"What were
unit sales in
New England
last March?
Drill down to
Boston

On
-
line analytic
processing
(OLAP),
multidimension
al databases,
data
warehouses

Retrospective,
dynamic data
delivery at
multiple levels

Data Mining

(Emerging
Today)

"What‟s likely
瑯⁨appen 瑯
Bo獴sn⁵n楴i
獡汥猠nex琠
mon瑨?

Advan捥d
a汧o物瑨m猬s
mu汴楰牯ce獳s爠
捯mpu瑥牳r
ma獳楶e
da瑡ba獥s

m牯獰e捴楶eⰠ
p牯r捴楶e
楮io牭a瑩tn
de汩very


䑡aa m楮楮g 捯u汤l be 癩敷ed 瑯 have evolved f牯r a fie汤l of age 汯lg h楳瑯特I
瑨ough the 瑥rm wa猠楮i牯ruced 楮i the N99M献s 䥴

捯u汤l be 瑲a捥d ba捫c 瑯 獯me

ma橯爠 fam楬y of o物g楮i wh楣栠 楮捬ides

捬慳獩捡氠 獴s瑩獴楣猬t a牴楦楣ia氠 楮ie汬楧en捥I
da瑡ba獥 獹獴sm猬s pa瑴e牮r 牥捯gn楴楯n and ma捨ine 汥a牮楮r⸠Ba獩捡汬s 睩瑨out
11


statistics there would be no data mining because it is the bedrock upon which
most techniques of data mining is based.
(Data m
inin
g software

2011).

Classical statistics concept studies data and data relationship, hence it plays a
significant role within the heart of today‟s data mining tools
.
Artificial intelligence on
the other hand was built upon heuristics as opposed to classical
statistics;

it
applies human thought
-
like processing in statistical problem domains. Machine
learning could be perceived as a merger of artificial intelligence and advanced
statistics analysis, since it lets computer programs to le
arn about the data they
p
rocess.

(Data m
ining software

2011.)

It
prompts

them to make decisions based on the data studied using statistics and
advances artificial intelli
gence heuristics and algorithms
.
Traditional query and
report tools have been used to des
cribe and extract what is in
a database. The
user forms a hypothesis about a relationship and verifies it or discounts it with a
series of queries against the data.

(
Data m
ining software

2011
.)


3.2

Knowledge discovery in d
atabases



Data would not fulfil
l

its potential if
it is not been processed into
information after
which knowledge could be
gained from it due to further
processing. Knowledge
discovery involves a pr
ocess that

yields new knowledge
, it give
s in details
sequence of steps (
data

mining inclusive) that ought
to be adhered to in order to
discover knowled
ge in data, however each step
is
achieved

using some software
tools.

It is a nontrivial process of
identifying
valid, nove
l,

potentially useful and
ultimately underst
andable patterns in
data in databases. It
involve
s

several steps,
and each attempts to discover some knowledge using some knowledge discovery

method.

(Han

et
al
.

2005
, 10.)

It entrenches
the whole
knowledge exploration process, r
anging from
data storage
and
access,

analysis of large data using ef
ficient and scalable algorithms,
results
interpretation and visualization,

to human
-
machine interaction
modeling and
12


suppor
t. (
Han
et al
.
2005
,
11
).

A
unique feature of the model is the definition of
input out
put states of
data, because the output of a step is used as t
he input of a
subsequent step
in the process, and at the end output is the
discovered
knowledge
portrayed

in terms of patterns, rules, c
lassific
ations, association, trends
and
statistical analysis.

(Han et al
.

2005
, 11.)


3.3

Knowledge discovery p
rocess models

The knowledge discovery process has be
en placed into two main models
called

the Fayyad et al (academic) model and the Anand

and Buchner

(industrial) model.
The Fayyad model

is represented below
:

Data
Target Data
Preprocessed data
Transformed Data
Knowledge
100
90
80
70
Patterns
Selection
Transfor
mation
Data
mining
Interpretation
/
Evaluation
Preproc
essing

GRAPH
2
.

Knowledge discovery process

(a
dapted from Fayyad, Piatetsky &
Smyth 1996)


D
eveloping and understand
ing the application domain

entails
,

learning the useful
a
fore
-
hand

knowledge and aim of t
he end user for

the discovered
knowledge.

The
next phase is cr
eating a target data set, which involves querying the existing data
13


to select the desired subset by selecting subsets o
f attributes and data

points to be
used for task.

(Han et al. 2005
,
14.)


Data cleaning and processing entails eradicating outliers, handling


noise and missing values i
n data, and

accounting for time sequence
information
and known changes.

It leads to the d
ata rejection and Projection.

It consists o
f
finding valuable attributes
by utilizing dimension reduction and transforma
tion
methods, and discovering
invari
ant representatio
n of
the data. (Han et
al
.

2005
,15
.)

Choosing the Data mining task and this i
nvolves matching the relevant
prior
knowledge and objective of a user with

a specific data mining method. C
hoosing
the data mining algorithm basically involves selecting methods
to se
arch for
patterns in the data and

conclude on which models and
yardsticks of the method is
perfec
t.
(Han et.al
.

2005
,
15
-
16.)

Data mining being the next phase

involves pattern generation
,

in a particular
representation form, such as classification, decision tree, etc.

Consol
idation of
discovered knowledge
inv
olves incorporating discovered
knowledge into the
performance system,

documenting and reporting
.

(Han

et
al
.

2005
,
15
-
16
.
)

The
Industrial mode
l also tagged CRISP
-
DM knowledge

discov
ery proc
ess
is
summarized in the graph
s

below.




14


GRAPH

3

(cont.).

Phases of the CRISP
-
DM

process model

(a
dapted from

CRISP
2011
)


1
.
Business understanding
·

Determination of business
objective
·

Assessment of the situation
·

Determination of data mining goal
·

Generation of project plan

2
.
Data Understanding
·

Collection of initial data
·

Description of data
·

Exploration of data
·

Verification of data quality
3
.
Data preparation
·

Data selection
·

Data cleansing
·

Construction of data
·

Integration of data
·

Data substeps formatting
4
.
Modeling
·

Selection of modeling technique
·

Generation of test design
·

Models creation
·

Generated models assessment
5
.
Evaluation
·

Evaluation of results
·

Process review
·

Determination of next step

6
.
Deployment
·

Plan deployment
·

Plan monitoring and maintenance
·

Generation of final report
·

Review of the process substeps

GRAPH
4
.

D
etails of
CRISP
p
rocess

model

(adapted from CRISP 2011)


3.4

The need for

data m
ining

The achievement of digital revolution and the escalation of the internet have
brought about a great amount of multi
-
dimensional data in virtually all human
endeavor, and the data type ranges from text, image, audio, speech, hypertext,
15


graphics and video th
us providing organizations with too many data, but the whole
data might not be useful if it does not provide a tangible unique information that
could be utilized in solving a problem. The quest to generate information from
existing data pr
om
pted the need f
or data mining. (
Mitra & Acharya 2003,
2
.)

The impressive

development in the data mining could be affiliated to the
conglomerat
ion of several factors such as t
he explosive growth in data collection,

the storing of data in warehouses, thus enhancing
accessibility and reliability of
database, the availability of increased acces
s

to data from internet and intranets,
the high quest to raise market share globally, the evolving of mining software, and
the awesome development in compu
ting power and

storage
ability. (Larose 2005,
4
.
)


3.5

Data mining g
oals

Data mining is basically done with the aim of achieving certain objectives and it
ranges from classification, prediction, i
dentification to optimization.

·

Classification: this involves allocating data into classes or categories as a

result of combining yardsticks.

·

Prediction:
Mining in this instance helps to single

out features in a data, and
their tendencies in the event of time.

·

Identification: Trends or patterns in certain data could enhance in identifying
the existence of items, events or action in a given scenario or case.

·

Optimization: Mining also
facilitates the optimization of the use of scarce
resources in turn maximizing output variables within constraint conditions
(Elmasri & Navathe 2007, 947.)




16


3
.
6

Applications

o
f data m
ining

The traditional approach to data analysis for decision making used to involve
furnishing domain experts with statistical modeling techniques so as to develop
hand
-
crafted solutions for specific problems, but the influx of mega data having
millions of rows
and columns and the spontaneous constructions and deployment
of data driven analytics coupled with demand by users for results easily readable
and understandable has prompted the inevitable need for data mining. (S
umathi
&

Sivanandam 2006, 166.)

Data minin
g technologies are deployed in several
decision
-
making scenarios in organizations. Its importance cannot be over
emphasized, as it is applicable in several fields
some of which as discussed
below.


3.6
.1 Marketing

This involves analysis of customer behavior in purchasing patterns, market
strategies determination varying from advertising to location, targeted mailing,
segmentation of customers, products, stores, catalogs design and advertisement
strategy. (Elmasri
&

Navathe 2007, 970.)



3.6.2
Supply chain visibility

Companies have automated portions of their supply chain enabling collection of
significant data about inventory, supply performance and logistic of materials, and
finish
ed goods, material expenditures, accuracy of plans for order delivery. Data
mining application also spans though price optimization and work force analysis in
organizations.
(Sumathi &

Sivanandam 2006, 169
.)



17


3.6.3

Geospati
al d
ecision making

In climate
data and earth ecosystem scenario, automatic extraction and analysis of
interesting patterns involving modeling ecological data and designing efficient
algorithm for finding spatiote
mporal patterns in form of tele
-
connection patterns or
recurring and persi
stent climate patterns. This operation is usually carried out
using the clustering technique, which divides the data into meaningful groups,
helping to automate the discovery of tele
-
connections.

(Sumathi &

Sivanandam
2006, 174.)


3.6.4

Biomedicine and s
cience
a
pplication

Biology used to be a field dominated by an attitude of formulate hypothesis,
conduct experiment, evaluate results, but now upon the impact of data mining it
has evolved into a field of big science attitude involving collecting and storin
g data,
mine for new hypothesis, then confirm with data or supplemental experiment.
(
Sumathi & Sivanandam 2006, 170.)


It also includes discovery of patterns in radiological images, analysis of microarray
(gene
-
chip) experimental data to cluster genes and
to relate to symptoms or
disease, analysis of side effects of drugs and effectiveness of certain drugs.

(Sumathi & Suvanandam 2006, 171
.)


3.6.5
Manufacturing

The application in this aspect relates to optimizing the resources used in optimal
design of
manufacturing processes, and product design based on customers‟
feedback
. (Elmasri & Navathe 2007, 970.)



18


3.6.6

Telecommunications and control

It is applied to the vastly available high volume of data consisting of call records
and other telecommunication related data, which in turn is applied in toll
-
fraud
detection, consumer marketing and improving services. (
Sumathi
& Suvanandam
2006,
178.)

Da
ta mining is also applied in security operations and services,
information analysis and delivery, text and Web mining instances, banking and
commercial applications as well
as insurance. (Han et al. 2005, 45
6.
)



19


4

D
ISCOVERED KNOWLEDGE


D
ata could only be useful when it is converted into information and it becomes
paramount when some knowledge is gained from the generated information, as
such is the most vital phase of data handling in any setup that deals with decision
making, and this kn
owledge obtained could be inductive or deductive, where
deductive knowledge deduces new information from applying pre specified logical
rules on some data. The inductive knowledge is the form of knowledge
referred to
when data mining is concerned, as it di
scovers new rules and patterns from some
given data.

(Elmasri & Navathe 2007, 948
)
.
The knowledge acquired from data
mining

is classified in the
forms

below, though knowledge could be as a result of a
combination of any of them
:

·

Association rules
:
simply
involves correlating the presence of a set of items
with another range of values for another set of variable.

·

Classification hierarchies: t
his aims at progressing from an existi
ng set if
transactions or actions

to generate a hierarchy of classes.

·

Sequentia
l patterns:

basically seeks some form of sequence from some
events or activities.

·

Patterns within time series: t
his involves detecting similarities within
positions of time series of some data, implying sets of data obtained at
regular intervals.

·

Clusterin
g:

this relates to segmentation
of some given collection of items or
action
s into sets of similar elements (Elmasri & Navathe 2007,
949
.
)


4.
1

Association r
ules

This deals with unsupervised data, as it finds interesti
ng associations,

dependencies and r
elationships in vast data item
sets. These items are kept as
transactions that could be created by an external process or fetched from data
20


warehouses or relational databases. However, due to the expansible feature of the
association rules algorithms and t
he ever/increasing size of cumulating data, use
of association rules for knowledge extraction is somewhat inevit
able, as
discovering interesting

associations gives a sources of information that is used for
making decisions.
(Han

at al
.

2005, 256
.)


Association rules are applied in areas such as market/basket data analysis,
cross/marketing, catalog design, loss/leader analysis, clustering, data processing,
genomics etc. The Market/basket analysis is the most intuitive application of
association rule,
as it strives to analyze customer tendencies by finding
associations existing between items purchased by customers

(
Han
et al
.

2005,
264
-
270
.)


An example of the application of association rule is as in the graph below, where a
sales transaction table is m
ade to identify items which are often bought together,
so as to be able to make some cross
-
selling decisions during discou
nt sales
period. (Maclennan, Tang, & Crivat
2009, 7
)
.

It simply shows that customers who
buy milk also buy cheese, and customers who buy cheese could also buy wine,
likewise customers buy either Coke or Pepsi and juice, the same applies to
customers buying beer or wine, cake or donut.

MILK
CAKE
BEER
CHEESE
COKE
WINE
DONUT
PEPSI
JUICE
BEEF

GRAPH

4
.

Items association

(adapted from

MacLennan et al
. 2009, 7)


21


4.1.1

Association rules on transactional d
ata

Items are denoted as Boolean variables while the collection is denoted as Boolean
vector, as the vector is analyzed to determine
which variables are frequently
taken
together by different users or in other words associated with each other. These co
-
occurrences

are represented i
n association rules written as
:

L
H
S => RHS [Support,

Confidence]

The

left
-
hand side

(LHS)
implies the right
-
hand side

(RHS) with a given value of
support and confidence.

Support and confidence are used to determine the quality of a given ru
le in terms
of its usefulness (
Strength) and certainty, while support denotes how ma
ny
instances (transactions) from a data set that was used to generate t
he rule
including items from left hand side and right hand side
, confidence on the other
hand express
es how many instances
that include items from left hand side also
include items from

right hand side
, where measured values appear in
percentages.

(Han et al
.

2005
,
290
-
291.)

Association rule is interesting if it satisfies minimum values of confidence and
support which are stipulated by the user. Association rules are
derived when data
describe events that occur at the same time or in some close proximity.

The main
association rules are
: single
-
dimensional and multidimensional, where both rules
could be placed into groups as either Boolean or quantitative.
(Han et al
.

2005
,
292
.)

The Boo
lean case relates to the presence or absence of an event or item, while
th
e quantitative case considers
values which are partitioned into item intervals.
Basically the data to be used in any mining applying association rule ought to be
given in a transacti
onal form, hence should have a transaction
identification
(
ID
)
,
and information about all items consistently
.

(Han

et al
.

2005
,
292
.)



22


4.1.2

Multilevel association r
ules

Finding association rules at low levels in cases where items f
orm a hierarchy could
be difficult.

S
uch association rules could be found at higher levels existing as
established knowledge. Multilevel association rules a
re created by performing a
top
-
d
own, iterative deepening search, in essence by first finding strong rules at the
high leve
ls in the hierarchy, before searching for l
ower
-
level weaker rules.

(Han et
al
.

2005, 244
.)

The
main
met
hods for multilevel association
rule are in classes subsequently
placed as; U
niform support based method which involves the same minimum
support
threshold been applied to create rules at all levels of abstraction.

Reduced

support

based method
which
primarily fixes

the short comings of uniform
support
method, since every level of abstraction is supplied with its own minimum support
threshold and the

lower the level, the smaller the threshold.

(Han et al
.

2005, 244
-
245
.)

Level by level independent

method

involving a breadth
-
fi
rst check been carried
out, hence

every node in the hierarchy is checked, regardless of its parent node
frequency.
Level
cross
-
filtering

by single item method simply involving

checking
items at certain levels are checked only if their parent at the previous level are
frequent
. Level cross filtering
by k
-
itemset

method is a method in which a k
-
itemset
at some given level

is c
hecked only if its parent k
-
itemset at the previous level i
s
frequent. (Han et al
.

2
005
,
246
.)


4.2

Classification

Classification is the process of learning a model which describes different classes
of data, and the classes are usually predetermined. It is

also known as supervised
learning, as it involves building a model which could be used to classify new data.
The process begins by using an already classified data usually called training set
of data, and each training data consist of an attribute called
class label. (E
l
masri
&

23


Navathe 2007, 961).

It essentially is the act of splitting up objects so that each
one is assigned to a number of mutually exhaustive and exclusive categories
k
nown as class
.

(Bramer 2007, 23.)


Classification basically involves
the process of finding set of models or functions
which describe and distinguish data classes or concepts for purpose of being able
to use the model to predict the class of objects whose label is unknown. The
classification models could occur in models ope
rating on rules involving decision
tree, or neural networks or mathematical rules or formulae or probability.

(Han e
t
al
.

2005, 28
0.)


A decision tree is a flow
-
chart like tree structure, where each node denotes some
test on an attribute value, each branch

stands for an outcome of the test, while the
tree leaves represent classes or class distributions. (Han et al
.

2005,
282
.)

The
approach of classification that uses probability theory to find the most likely of
classifications is known as th
e Naïve Bayes
.

(Bramer 2007, 24.)


4.2.1 Decision t
ree

A decision tree is a flow chart like tree figure used to represent information in a
given data, where each
internal
node represents a test on an attribute, every
branch denotes the outcome of a test, while the leaf
nodes represent classes. The
uppermost node in a tree is the root node.

(Han et al
.

20
05, 284
-
286
.
)


A typical decision tree is shown below,

where the tree represents some magazine
subscription by people, with information of their age, car type, number of children if
any, and their
subscription. Internal nodes are denoted by rectangles, leaf nodes
by ovals. The table below is a sample data for

the subscription.



24


TABLE 2
.

Sample training data for magazine subscription

(adapted from Ye 2003,
5)

ID

Age

Car

Children

Subscription

1

23

Sedan

0

Yes

2

31

Sports

1

No

3

36

Sedan

1

No

4

25

Truck

2

No

5

30

Sports

0

No

6

36

Sedan

0

No

7

25

Sedan

0

Yes

8

36

Truck

1

No

9

30

Sedan

2

Yes

10

31

Sedan

1

Yes

11

45

Sedan

1

Yes


Age
Car
Type
Number of
Children
Yes
No
Car
Type
Car
Type
No
Yes
Yes
No
<
=
30
>
30
>
0
0
Sports
,
Truck
Sedan
Sports
,
Truck
Sedan
Sports
,
Truck
Sedan

GRAPH
5
.

Decision tree for magazine subs
c
ription

(adapted from Ye. 2003, 6)



25


4.3

Clustering

This basically aims to place objects into groups, such

that records in a group are
similar to each other and totally dissimilar to records in other groups, and this
groups are said to be disjoint
.

Clustering analysis is also called segmentation or
taxonomy analysis, as it tries to identify homogenous subclasses of cases in a
given population.

(Elmasri & Navathe 2007, 964).

Some of the app
roaches of
Cluster analysis are discussed below
.



Hierarchical clustering which permits users to choose a definition of distance, then
select a linking method for forming clusters, after which the number of clusters that
best

suit the data are estimated. This approach of clustering creates a
representation of clusters in icicle plots and dendograms.

A dendogram is defined
as a binary tree with distinguished roots that has all the data items at its leaves.

(Cios et al
.

2007, 26
0
.)



The k
-
means clustering simply requires the user to indicate the number of clusters
in advance, after which the algorithm estimates

how to designate cases to the k
clusters. k
-
means clustering isles computer
-
intensive, hence it is often preferred
when data sets are much say over a thousand, and it creates a table showing the
mean
-
square error called Anova.

(Garson 2010.)




The t
wo
step clustering which generates pre
-
clusters after which it clusters the
pre
-
clusters using hierarchical methods. This approach handles very high volume
of datasets, and has the largest array of output options, including variable plots.

(Garson 2010.)


The graph 6

below is a diagrammatic representation of clusteri
ng, where people
are placed into clusters based on their income levels and age.


26


Cluster
2
Cluster
3
Cluster
1
Age
Income

GRAPH 6
.

O
utput of clustering people on income basis

(adapted from MacLennan
et al. 2009, 7)


4.4

Data mining algorithms

Data mining algorithms are the mechanisms which create the data mining model,
which is the main phase of the data mining process. In the subsequent sub
headings the algorithms will be
discussed.


4.
4.1

Naïve
Bayes
a
lgorithm

Collecting frequent item

sets involves consideration of all possible item

sets,
computing their support, checking if they are of higher value than the minimum
support threshold. Naïve algorithm involves
searching
high number of item

sets,
while scanning lots of transactions each time, and as such making the amount of
test that need to be conducted to be exponentially high, thus causing problem and
excessive time consumption, and due to this short coming of the algo
rithm, there
was need for the birth of another more efficient algorithm.

(
Han et al 2005, 296
.)


27


An example for demonstrating the algorithm involves having several tiny particles
of two colors red and green

as shown in graph below. The particles are classified
as either red or green. In an effort to classify new cases and tell which class they
belong to, could be easily done based on the graph.


GRAPH

7
.

Naive Bayes classifier

(adapted from Statsoft 2011)

It is obvious that the green
particles are twice as much as the red particles, hence
on handling a new case which has not be handled before,
it is twice like that a
particles will belong to the green group rather than the red
group.
(Statsoft 2011
.
)


4.
4.2

Apriori
a
lgorithm


T
his algorithm applies a prior

knowledge of an important attribute of frequent item
-
sets. The Apriori property of any item
-
set declares that all non empty subsets of a
frequent item
-
set has to be frequent,
hence where a given item
-
set is not frequent
(
if it does not meet up to the minimum support threshold), then all superset of this
item
-
set will also not be frequent, since it cannot occur more frequently than the
original item
-
set. (Cios et al
.

2007, 295.)


4.4
.3

Sampling a
lgorithm

This algorithm is basically about taking a small sample of the main database of
transactions, then establishing the frequent item

sets from the sample. where such
28


frequent item
-
sets form a superset of the frequent item
-
sets of
the whole
database, then one could affirm the real frequent item

sets by scrutinizing the
remainder if the database in order to determine the exact support values of the
superset item
-
set.
This is basically some form of Apriori algorithm though with a
lowered minimum support.
(Elmasri & Navathe 2007,

952.)

Second scans of the databases are usually required because of cases of missed
item

sets, and determining if there were any missed item
-
s
ets gave room for the
idea of Negative border, which in relation to a frequent item

set say S, and some
set of it
em say I, is the minimal

item
-
sets contained in power s
et(I) and not in S, in
a nut shell, the negative border of some set of frequent item

sets consist of the
closest item

sets
possibly frequent. (
Elmasri & Navathe 2007,

952.)


Consider
an example
having a set of items I= {A, B, C, D
, E} and let the combined
frequent
item
-
sets of size 1 to 3 be S= {
{A}, {B}, {C}, {D}, {AB}
, {AC}, {BC}, {AD},
{CD}, {ABC}
}. Here the negative border is
{
{E}, {
BD}, {ACD}}. The set {E} is the
only 1 item

set not contained in S, {BD} is the only 2
-
item
-
set not in S, but whose 1
item
-
set subset are, and {ACD} is the only 3 item

set whose 2 item

set subsets are
all in

S. The negative border is important since it is necessary to determine the
support for those item
-
sets in the negative border to ensure that no large item
-
sets
are missed. (
Elmasri & Navathe 2007, 953
.
)


4.
4.4

Frequent
-
pattern tree a
lgorithm

This is also
an algorithm which came into been due to the fact that Apriori
algorithm involves creating and testing huge amount of item
-
sets. However, this
algorithm eliminates the creation of such large candidate item
-
sets.

A com
pressed
sample of the database
is first created, based on the frequent pattern
tree;

this
tree keep
s
useful

item
-
set information and gives an avenue for the efficient finding
of frequent item
-
sets. The main mining process is divided into smaller task and
each functions on a conditional
fr
equent pattern tree, which is a branch of the main
tree.

(Elmasri & Navathe
2007,

955.)

29



4.
4.5

Partition a
lgorithm

Partitioning algorithm operates by splitting the database into non
-
overlapping
subsets, which are taken for separate databases and all
bulk item
-
sets for that
parti
tion are called Local Frequent i
tem
-
sets, and they are created in one pass,
after which the Apriori algorithm is then efficiently applied on each partition if it fits
into the primary memory. Partitions are taken such that each

every partition could
be accommodated in

the main memory, hence been checked only once.

(Elmasri
& Navathe 2007, 957.)

The main short coming of this algorithm is that the minimum support for each
partition is dependent on the size of the partition rather
than the size of the main
database for large item
-
sets
.

After the first scan, a union of the frequent item
-
sets
from every partition is taken, forming the Global candidate frequent item
-
sets for
the whole database. The global candidate large item
-
set found

in the first scan are
confirmed in the second scan, when their support is measured for the whole
database, invariably, this algorithm is implemented in parallel or distributed
m
anner for enhanced performance. (Elmasri & Navathe 2007, 957
.
)


4.4.6

Regression

Regression is an exclusive application of the classification rule. If a classification
rule is regarded as a function over the variables that map these variables into
target class variable, the rule is called a regression rule. A common applicat
ion of
regression occurs when in

place of mapping a tuple of data from some relation to
a specific class, the value of variable is predic
ted based on the tuple itself
.

(
Elmasri &
Navathe, 2007,
967.)


Regression involves smoothing data by
fitting the data to a function. It could be
linear or multiple, the linear involves finding the best line to fit two variables so that
one could be used to predict the other, while the multiple one has to do with more
30


than two variables
. (
Han et al
.

2005,
321
.)

For example, where there is a single
categorical predictor such as female or male, a legitimate regression analysis has
been undertaken if one compares two income histograms, one f
or men and on
e
for women.
(Berk
2003
.)


4.4.7

Neural n
etworks

This is a

technique derived

from artificial intelligence
,
using general regression and
provide
s

an iterative method to implement it.

It operates using a curve fitting
approach to infer a function from a given sample. It is a learning approach which
uses a test samp
le for initial learning and inference.
Neural networks are placed
into two classes namely supervised and unsupervised networks. Adaptive
methods that try to reduce the output error are supervised learning methods, while
unsupervised learning methods involv
e those that develop internal representations
without sample outputs.

(E
lmasri & Navathe 2007, 968
-
969.)

It can be used where some information is known, and one would like to infer some
unknown information. Example is in the Stock mark
et prediction, where
last week

and today‟s stock prices are known, but one wants to know tomorr
ows stock
prices. (Statsoft 2011.)


4.4.8
Genetic a
lgorithm

Genetic algorithms are a class of randomized search procedures capable of
adaptive and large search over a large range of
search space topologies.

It was
developed by John Holland in the 1960s. It applies the techniques of evolution,
dependent on optimization of functions in artificial intelligence to generate some
solution, by simply developing a sample of possible solutions

to some problem
domain, then taking out solutions that are better and gathering them together to
create a new domain of solutions, and lastly using the new solutions to replace the
poorer of the original, after which the whole cycle is done again (
Hill
20
11
.
)


31


The solutions generated by genetic algorithms are differentiated from that of other
techniq
ues because g
enetic algorithms use a set of solution during each
generation instead of a single solution.

The memory of the search done is


represented by the set of solutions at disposal for a generation.
It finds near



optimal balance between knowledge gain and exploitation by manipulating
encoded solutions.

Genetic algorithm is a randomized algorithm

unlike other
algorithms, and its ability to solve problems in parallel mak
es it powerful
in data
mining (
Elmasri &
Navathe 2007,
969.)







32


5

APPLIED DATA MINING


The process of evaluating data mining could only be complete after a practical
demonstration is done. In the process of trying to carry out a practical mining task,
several other mining devices were used in the course of this project, and they
range from t
he I
bm

intelligent miner,
E
stard

miner, SQL server warehouse, and
SQL server using Microsoft Excel 2007
, but due to logistic,
and administrative
limitations in using most of the m
ining devices, I chose to
use the SQL server cum
Microsoft Excel 2007 for the

mining task, since it
provides a trial version which
grants much access with

lesser administrative requirements.

Data Acquisition
Data
Preparation
Modeling
Validation
Application

GRAPH

8
.

Data mining process

(adapted from MacLennan et al. 2009, 188)

The graph above simply shows the steps
involved in a simple data mining
process. However for the purpose of this thesis work, the data acquisition and
preparati
on phases were skipped since a
ready data was gotten from a repository.


5
.1 Data mining environment

In an attempt to demonstrate the data mining process, the mining software that
was chosen is the Microsoft excel 2007 which is been used in conjunction with
33


Microsoft SQL server. The exact SQL server that was used is the Microsoft SQL
server 2008, though o
ther earlier versions exist too. The server is Microsoft‟s
enterprise
-
class database solution. It consists
of four components namely the:
database engine, analysis s
ervices
, integration services and the reporting
s
ervices, and these four components work to
gether to create business intelligence
(Ganas 2009
,

2
.)


The

database e
ngine basically facilitates the storage of data in tables and allows
users to analyze the given data using commands in SQL language. Considering
the database engine from the business intelligence perspective, its primary
function is the storage of collecte
d data, and it has the capacity to store hundreds
of gigabytes of data which are also be termed as “data warehouse” or “data mart”.

(
Ganas 2009
,

2
.)

The Analysis Services part of the SQL server is responsible for the analysis of
data using Online Analytic
al Processing (OLAP) cubes and the data mining
algorithms. The
cube is fundamentally a pre
-
meditated pivot table. It is located in
the server and it stores the raw data, along with pre
-
calculated data, in a multi
-
dimensional format. The data in an OLAP cub
e could be accessed using Excel
pivot table. OLAP cubes are valuable since the make it easy and convenient for
users to handle and analyze extremely large amount of data. (Ganas 2009,

2.)

The other component of the SQL server is the Integration service, which basically
extracts, transforms and load data. Its primary purpose is to transfer data between
different storage formats, just as in an instance it pulls data out of excel file and
uplo
ads it into an SQL server table. It is also a data cleaning tool which makes it
really relevant, since dirty data could make it difficult to develop
valid statistical
models. One of the cleaning techniques inscribed in the integration service is the
fuzzy
logic, as it is applied to clean data by isolating and removing questionable
values. (Ganas 2009,

2
-
3
.)

The Reporting Services is the fourth component of the SQL server, it is a web
-
based reporting tool. Basically it creates a web page where users could se
e
reports that were generated using the data in the SQL server tables. It also consist
34


of a web application known as Report Builder, which allows users to create ad hoc
reports without knowing the SQL language. This facility enhances the easy
transmission
of business intelligence

to several users. (Ganas 2009,

3
.)


5
.2

Installing the SQL server

Upon adhering to the installation requirements before the installation proper, the
installation wizard was used as it allows the user to specify which features are
to
be installed, the installation location, and the administrator privileges. The wizard
is also used to grant users access to the components of the server.

(MacLennan
et al. 2009,16.)


5
.3

Data mining a
dd
-
ins for Microsoft Office 2007

The Data mining
add
-
ins for Microsoft office 2007 enhances the ex
ploitation of the

full potentials in the SQL Server, hence an instance of the add
-
ins have to be
installed on a machine which already has an SQL server and all its components
installed. This add
-
ins compris
e of three parts namely the Table analysis tools, the
Data mining client, and the Data mining template for Visio. It also consist of the
Server configuration utility which handles the details of configuring the add
-
ins and
the connection

of the Analysis Se
rvices (MacLennan et al.

2009,

16.)

The Data mining client for Excel 2007 allows users

to build complex
models in
Excel, while processing those models on a high performance server running
Analysis Services, and this reduces the time and effort required to
extract and
analyze information from ordinary raw data using the most powerful algorithms
available.

(Ganas 2009, 3.)



35


5
.4

Installing the a
dd
-
ins for Excel 2007

The main component of the add
-
ins which is of utmost importance in this case of
data mining is

the Data mining client

(which is basically installed on a computer)
,
as it act
s

as a link between the user and

SQL server running analysis s
ervices.
The client architecture allows multiple users, with every one working on his own
computer to be
nefit of th
e power of a single analysis services s
erver.

Installing the add
-
ins on a computer and specifying
the name of the server running
analysis s
ervices is relatively straightforward; a wizard guides users through the
installation process, however taking note th
at

the user has
administrative rights to
all aspects of the server and the add
-
ins
. (MacLennan et al. 2009, 20.)


5
.5

Connecting to the analysis s
ervice

On clicking the a
nalyze menu, there is also the option of connecting
to an analysis
s
ervice which must
be done to facilitate a successful evaluation of the given data.
The connection button was labeled <No co
nnection>. On clicking it, the analysis
services c
onnection dialog box appeared, requesting server name and instance
name, then the given server name w
as entered and an instance name was
entered too. The figure below shows an instance of the connection phase.

By defau
lt, it is recommended that the windows a
uthentication button is used in
con
necting to the analysis s
ervices, because it supports only wind
o
ws
authenticated clients
.

(MacLennan

et al
.

2009,

21.)

36



GRAPH

9
.

Connecting to analysis s
ervice


5
.6

Effect of the a
dd
-
in
s

Upon completion of the installation of the add
-
ins to the computer,

the data mining
ribbon

could be readily seen on the menu on a launched Microsoft office 2007
Excel
spread
sheet,

and through the ribbon the icons which are divided into a
logical and organized fashion that mimics the typical order of tasks involved in a
typical data mining pr
oce
ss appear, they include the d
ata Preparation

which
comprises explore data, clean data and partition d
ata as this basically acquires
and prepares data for analysis.

(
Mann 2007,
29.)


There is also the d
ata modeling functionality which has options of
algorithms that
could be selected from to use in the

data mining. The accuracy and v
alidation
option aids in testing and validating the mining model against real data before the
deploying
the model into production. The model u
sage allows one

to query and
b
rowse the analysis s
ervices server for existing mining models. The
management
option

enables the management of the mining models such as renaming, deleting,
clearing, reprocessing, exporting or importing of models.

(
Mann 2007,

29.)


The other icon on the work sheet is the
connection tab, which facilitates

connection to a
server;

it is discussed in detail in the next subsection.

Once a
37


connection was

been successfully made to the analysis s
ervice, i
f some data were
add
ed to the spreads
heet, and the whole or part of the data was selected and
formatted as table using the “format as table” option from the menu, the data
adopts the new selected format and the “Table Tools” option appears

as shown in
the graph below
, it consist of the analyz
e and d
esign options on the menu.

(
Mann
2007, 30.)


On clicking the a
nalyze button eight of the functionalities of

the analysis services
which make it possible to perform tasks without compelling the need to know
anything about the underlying data mining algorithms appears and they include;

·

Analyze Key Influencers

·

Detect Categories

·

Fill From Example

·

Forecast

·

Highlight

Exceptions

·

Scenario Analysis

·

Prediction Calculator

·

Shopping Basket Analysis



GRAPH

10
.

Formatting an Excel sheet

38


The other option which is the d
esign tab provides the option to remove duplicates,
conve
rt to range and summarize with
pivot t
able, it also
provides the option of
exporting the formatted data.



GRAPH 1
1
.

The a
nalysis ribbons


5
.6
.1

Analyze k
ey Influencers

The analyze key influencers tool

when applied, basically brings out the relationship
between all other chosen columns on a given table to one specified column, and
then it makes a report showing in details which of the columns have major
influence on the stipulated column

and how it port
rays itself. Its implementation
generates a temporary mining model using the Naïve Bayes algorithm.

(MacLennan

et al
.

2009
, 22).

Take for instance if the tool is applied on a table having columns as annual
income, geographic location, number of kids and purchases, on applying the tool
and selecting the purchases column, it would be possible to determine if it is
income that plays a
major role in the purchases an individual makes or any of the
other columns. Upon specifying the target column, the analyze key influencers
button is clicked and it pops up a dialog box requesting the selection of the column
39


to be analyzed for key factors,

there is also an option to restrict the other columns
that could be used for the analysis process.

(MacLennan

et al 2009, 23).


Once the selection has been made, the run button is clicked and within seconds, a
report is displayed such as in the graph belo
w.



GRAPH 1
2
.

Key influencer report

A look at the graph above simply demonstrates that the income column has the
greater influence over the weekly purchases of the people

represented by the
favors colu
m
n
, though the number of kids is closely also having
much influence in
a few instances.

(MacLennan et al
.

2009, 23).


5
.6
.2

Detect c
ategories

Handling huge amount of data could be cumbersome,

hence it is more advisable
and convenient to regroup all the data into smaller categories so that elements in
each category have lots of similarities enough to see them as been similar. It
applies the clustering algorithm, thus making data analysis conven
ient and easy. It
detects groups in the given data after analyzing the data, and then it places them
40


into groups based on their simil
arities, also emphasizing the

details which
prompted the category creation.
(MacLennan

et al
.

2009, 29.)

Applying the detec
t categories functionality of the table analysis tools simply
involves selecting and formatting the given data,
then the analyze ribbon displays,
on clicking the detect categories button, a dialog box pops up
. The box displays
the option of selecting the c
olumns from the data, that the user would like to
analyze, and the user could do so by un
-
checking or checking any column of his
choice.

(MacLennan et al
.

2009, 29.)

There is also the option of the tool appending the detected category column to the
original Excel table
,

which is usually checked by default, and thirdly the option of
selecting the number of categories
,

the user would like to have usually in auto

detect state by default. (MacLennan

et al
.

2009, 29
-
30
.)

On clicking run in the dialog box,

the process is complete
d within few seconds and
then

the results are displayed. The displayed result called the categories report
has three parts, one showing the created categories and the number of rows in
each as shown in graph below.


GRAPH 1
3
.

C
ategories created in c
ategory report

The second part of the category report shows the features of each category in
ranking of their relevance and importance, it is a table having four columns, the
41


first column having the category name, the second is the fe
atures from the original
column name, the third is the value, and the fourth column is the relative
importance showing how significant the features are to the created category
.
(MacLennan

et al
.

2009, 31.)


A simple graph below shows how the category
characteristics

look


GRAPH
1
4
.

Category report c
haracteristics

The third part of the category

report is the category profiles,
it appears as bar
charts, showing the distribution of any of the original data characteristics over all
the generated
categories. Each of the bars in the chart contains more than one
color simply showing segments denoting the proportion of a row in the category.
There is the color legend on the right hand side which clearly portrays the
proportion of the feature on the ca
tegory. The generated categories could be
renamed based on the users wish.

(MacLennan

et al 2009, 33.)


5
.6
.3

Fill from example t
ool

This data mining tool has an auto fill potential, in the sense that it is able to learn
from any given example of data, and automatically generate subsequent data
based on the trend and relationship in the example. It basically operates only on
columns of
the Excel spreadsheet, so long as some two or more data examples
42


have been given in the row. The reliability of the result of this tool is mainly
dependent on the amount of sample data or values given in the target column,
hence, the higher the sample data
, the greater the reliability of the too
l result, and
vice versa. (MacLennan

et al
.

2009, 35
.
)

On selecting and formatting the given data, the table analysis tool option appeared
as expected and the fill from example tool button was clicked. A dialog box c
ame
up, showing the option of which column to select for the sample data

task
, though
the tool most often sug
gests a likely column, but there was

still
the
possib
ility

to
choose a target column if it is different from the suggested one.

There is also the
option of choosing which columns to use for the analysis in conjunction with the
specified column. On clicking the run button, the process is completed, and a
pattern report for the specified column is generated on a new Excel work sheet,
and also a new c
olumn is appended at the end of the original sheet, showing the
original and newly generated
complete
column.
(Brenn
an
2011.)

The generated report has four columns showing the original column names, their
values, if they favorably impact the target column
or not, and the last column
showing the relative effect by aid of horizontal bars.


GRAPH 1
5
.

C
ompleted table after fill from e
xample process

43


The results of the fill from example process could be refined by carrying out the
process again, if the displayed

result is not close to the expected one, and the