the knowledge discovery process

levelsordData Management

Nov 20, 2013 (3 years and 10 months ago)

89 views

Computer

Science

Universiteit
Maastricht

Institute for Knowledge

and Agent Technology

Data mining and

the knowledge discovery process

Summer Course 2006

H.H.L.M. Donkers

Content


Opening / acquaintance


What is data mining


Data mining methodology


Course perspective


Course contents

Data
-

Information
-

Knowledge
-


Data: symbols


Information: data that are processed to be useful;
provides answers to "who", "what", "where", and
"when" questions


Knowledge: application of data and information;
answers "how" questions


Understanding: appreciation of "why"


Wisdom: evaluated understanding.


(
Russell Ackoff
-

http://www.outsights.com/systems/dikw/dikw.htm
)


Data
-

Information
-

Knowledge
-

http://www.outsights.com/systems/dikw/dikw.htm


What is Data Mining


Traditionally


“Data mining is the extraction of implicit,
previously unknown, and potentially useful
information from data.”


Witten & Frank (2000). Data Mining.


What is Data Mining


Traditionally


“The application of specific algorithms for
extracting patterns from data, it is a part of
knowledge discovery from databases”


Fayyad (1997). From data mining to knowledge
discovery in databases.


What is Data Mining


Traditionally


“Data mining is a process, not just a series of
statistical analyses.”


SAS Institute (2003). Finding the solution to
data mining.

What is Data Mining


Traditionally


Computer Science


(Semi
-
)automated
application of algorithms
for pattern discovery


Algorithms developed in
the field of Artificial
Intelligence (machine
learning)


Part of the process of
knowledge discovery


Statistics


Process of discovering
patterns in data


(Manual) application of a
series of statistical
techniques (among which
machine learning)


Incorporates


Exploration


Sampling


Modeling


Validation

Data mining =
Statistics +
Marketing

What is Data Mining


A Fusion

“An analytic process designed to explore data in
search of consistent patterns and/or
systematic relationships between variables,
and then to validate the findings by applying
the detected patterns to new subsets of data.
The ultimate goal is prediction.”


Statsoft (2003). Data Mining Techniques.

What is Data Mining


A Fusion

“An information extraction activity whose goal is
to discover hidden facts contained in
databases. Using a combination of machine
learning, statistical analysis, modeling
techniques and database technology, data
mining finds patterns and subtle relationships
in data and infers rules that allow the
prediction of future results.”


Rudjer Boskovic Institute
(2001). DMS Tutorial.

Data Mining In This Course


We use the book of Witten & Frank


Computer science (machine learning) approach


Emphasis on algorithms for pattern discovery
and rule extraction


What are the underlying models


What are the properties of the algorithms


When to use (for which tasks)


How to apply and to tune


How to interpret and assess the results

Data Mining Process


These algorithms are only part of a process
that computer scientists call Knowledge
Discovery and the statisticians call Data Mining


The process starts with the recognition of a
problem and ends with the control of a
deployed solution


The whole process needs to be supported for a
successful application

Methodologies for Data Mining


As Data Mining is coming of age, several
methodologies have been developed, each
with their own perspective. We will discuss
three of them:


Fayyad
et al
. (Computer science)


E.g., WEKA


SEMMA (SAS) (Statistics)


SAS Enterprise Miner, R


CRISP
-
DM (SPSS, OHRA, a.o.) (Business)


SPSS Clementine

Fayyad’s KDD Methodology

data

Target

data

Processed

data

Transformed

data

Patterns

Knowledge

Selection

Preprocessing

& cleaning

Transformation

& feature

selection

Data Mining

Interpretation

Evaluation

SEMMA Methodology

Supported by SAS Enterprise Mining environment

SAMPLE

Input data,

Sampling,

Data partition

EXPLORE

Distribution explorer,

Multiplot,

Insight,

Association,

Variable selection

MODEL

Regression,

Tree,

Neural Network,

Ensemble

MODIFY

Transform variable,

Filter outliers,

Clustering,

SOM / Kohonen

ASSESS

Assessment,

Score,

Report

CRISP
-
DM Methodology


Developed by data
-
mining companies (SPSS,
NCR, OHRA, ChryslerDaimler), funded by the
European Commission


Tool
-
independent / industry
-
independent


Hierarchical process model

1 Generic phases 2 Generic tasks

3 Specific tasks 4 Task instances


Supported by SPSS Clementine environment

CRISP
-
DM Methodology

Business
understanding

Data
understanding

Data

Preparation

Modeling

Evaluation

Deployment

TASKS


Business objective


Assess situation


Data mining goals


Project plan

CRISP
-
DM Methodology

Business
understanding

Data
understanding

Data

Preparation

Modeling

Evaluation

Deployment

TASKS


Collect data


Describe data


Explore data


Verify data quality

CRISP
-
DM Methodology

Business
understanding

Data
understanding

Data

Preparation

Modeling

Evaluation

Deployment

TASKS


Select data


Clean data


Construct data


Integrate data


Format data

CRISP
-
DM Methodology

Business
understanding

Data
understanding

Data

Preparation

Modeling

Evaluation

Deployment

TASKS


Select modeling

techniques


Design the test


Build model


Assess model

CRISP
-
DM Methodology

Business
understanding

Data
understanding

Data

Preparation

Modeling

Evaluation

Deployment

TASKS


Evaluate results


Review process


Determine next

steps

CRISP
-
DM Methodology

Business
understanding

Data
understanding

Data

Preparation

Modeling

Evaluation

Deployment

TASKS


Plan deployment


Plan monitoring

and maintenance


Final report


Review project

A Comparison

data

Target

data

Processed

data

Transformed

data

Patterns

Knowledge

Selection

Preprocessing

& cleaning

Transformation

& feature

selection

Data Mining

Interpretation

Evaluation

SAMPLE

Input data,

Sampling,

Data partition

EXPLORE

Distribution explorer,

Multiplot,

Insight,

Association,

Variable selection

MODEL

Regression,

Tree,

Neural Network,

Ensemble

MODIFY

Transform variable,

Filter outliers,

Clustering,

SOM / Kohonen

ASSESS

Assessment,

Score,

Report

Business
understanding

Data
understanding

Data

Preparation

Modeling

Evaluation

Deployment

A Small Poll (July 2002)

Which DM Methodology do you use?
0
20
40
60
80
100
Crisp DM
SEMMA
My organisation's
My own
Other
None
Source:
http://www.kdnuggets.com/polls/2002/methodology.htm

Poll repeated (2004)

Which DM Methodology do you use?
0
20
40
60
80
Crisp DM
SEMMA
My organisation's
My own
Other
None
Source:
http://www.kdnuggets.com/polls/2004/data_mining_methodology.htm

Course perspective and goal


The perspective is from computer science

(machine learning): Fayyad’s approach


The emphasis is on techniques for the
automated discovery of patterns in data and
the automated extraction of rules (the model
phase of SEMMA and CRISP)


The goal is to get acquainted with these
techniques, so you can use them in the
methodology of your choice

Course contents


Data preparation (Tuesday)


Selection, preprocessing, transformation


Techniques, algorithms and models


Decision trees (Monday)


Instance based and Bayesian learning (Wednesday)


Neural networks (Wednesday)


Association rules (Thursday)


Clustering (Thursday)


Support Vector Machines (Friday)


Evaluation of learned models (Tuesday)

Course contents


For each technique you learn


For which tasks it is suitable


Classification, rules, prediction, …


Restrictions on input data (numerical, symbolic, etc.)


What algorithms are available


What parameters should be tuned


How to interpret the results


How to evaluate the model