FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

78 εμφανίσεις

Facultad de Informática


FROM BUSINESS OBJECTIVES TO DATA
MINING: TOWARDS A SISTEMATIC WAY OF
DATA MINING PROJECT DEVELOPMENT



Ernestina Menasalvas

Facultad de Informática

Universidad Politecnica de Madrid. Spain

emenasalvas@fi.upm.es

November 2004




Background(I)


1995: doctoral student.


Visit University of Regina (Prof. Ziarko)


Visit Warsaw University (Prof. Pawlak)


1998: Defend thesis. Data Mining process model
(Anita Wasilewska & C. Fernandez
-
Baizan)


Since then:


Data Bases Professor: Data bases, data mining


Coordinator of the Data Mining group at Facultad de
Informática UPM


Techniques: Rough Sets, Bayes, …


Methodologies for data mining process management


Evaluation in Data Mining


Experimentation in Web Mining


Web Mining: Web Goal Mining


Background(II)


Projects developed:


Pure Research:


Data Mining to be integrated on RDBMS


Web Profiler


Methodology for Data Mining process management


Research and application:


Data Mining applied on different domains:


Car dealers


Travel agency


….


Data Mining Project Development


Methodologies for Data Mining project
development


Is it really Data Mining a Science?


Are we developing proyects as an art?


Has the research got the same results in all the areas??


Algorithms


Data Preparation


Data enrichment


Conceptualization of Data Mining problems

Data Mining: an art, a science?


Since it appeared a lot of algorithms have been
programmed


Standards:


Crisp
-
DM


SEMMA


PMML 3.0



Process depends on the expertise of the data miner



User speaks about business problems



Data Miner speaks about algorithms

Data Mining as a project


Data Mining is data intensive activity


Data understanding


Data Preparation


Database manager:


Transactional databases


Datawarehouses


The end result of a data mining project is a tool
(software project) for better decision making
process:


Software development project


IT department has to be involved


Project Management


Why?


In order to organize the process of develpoment and to
produce a project plan


How?


Establish how the process is going to be develop:


Sequential


Incremental


What?


Establish how is the process is splitted into phases and
define the tasks to be developed in each step:


RUP


XP


COMMONKADS




LIFECYCLE
MODELS

METHODOLOGY


Way of making things



Independent of the
process being developed


Particular tasks



Detail of tasks to be
developed

Common pitfall of data mining
implementation


The common pitfall of data mining implementation the
following:


Not being able to efficiently communicate mining results
within an organization.


Not having the right data to conduct effective analysis.


Not using existing data correctly.


Not being able to evaluate results



Questions that arise:


Can the adequateness of a set of data for a problem be
established when preparing the project plan?


How the set of data can be used to produce the expected
results?


How we can evaluate the results?


Cost estimation?

Data Mining Approaches


Vendor
independent:


CRISP
-
DM


Based on the
commercial tools:


CAT’s


SEMMA


CRM Methodology:


CRM Catalyst

Model Process

Not Real Methodology

Based on Crisp
-
DM

Globlal CRM process

Does not concentrate on
Data Mining step

Cross
-
Industry Standard Process for
Data Mining:CRISP
-
DM

Data Mining as a project: CATs


CATs :
Clementine

Application

Templates

: [CATs]



Specific libraries of best practices that provide inmediate
value right out of the box


Following the CRISP
-
DM standard. Every CAT stream is
assigned to a CRISP
-
DM phase


They provide long term value as they can always be used
with a new data set for new insight in other projects.


Available as an add
-
on module to Clementine, include:


Telco CAT

-

improve retention and cross
-
selling efforts for
telecommunications


CRM CAT

-

understand and predict customer migration
between segments,


Microarray CAT

-

accelerate biological discoveries, find
genes
Fraud CAT

-

predict and detect instances of fraud in
financial transactions, claims, tax returns …


Web CAT

What is a CAT?

[CATs]

SEMMA(1)


SEMMA (
Sample, Explore, Modify, Model, Assess
):
[SEMMA]


Is not a data mining methodology


Rather a logical organization of the functional tool set of
SAS Enterprise Miner for carrying out the core tasks of
data mining.


Enterprise Miner can be used as part of any iterative
data mining methodology adopted by the client.


Naturally steps such as formulating a well defined
business or research problem and assembling quality
representative data sources are critical to the overall
success of any data mining project.

SEMMA(2)


SEMMA i
s focused on the model development aspects of data
mining:[SEMMA]



Sample the data

to extract a portion of a large data set big
enough to contein significant information, yet small to manipulate
quickly.


Explore the data

by searching for anticipated trends and
anomalies in order to gain understanding and ideas.


Modify the data

by creating selecting and transforming the
variables to focus the model selection problem.


Model the data

allowing the software to search automatically for
a combination of data that reliably predicts a desired outcome.
Modelling techniques include neural networks, tree
-
clasiffiers,
statistical models, etc.


Assess the data

by evaluating the usefulness and reliability of
the findings from the data mining process and estimate how well
it performs.

Methods for Project Management:

CRM Catalyst(1)


Developed jointly by

CustomISe, MACS and SalesPathways.
Together they have formed the Catalyst Foundation
http://www.crmmethodology.com/


Motivations:


CRM projects are difficult to execute successfully because of the
wide range of factors influencing

their success. So it can take a
long time to make CRM work properly for an organisation.


Solution:

CRM Catalyst.



Methodology acts as a catalyst for CRM projects enabling them
to

achieve their objectives more reliably and in less time.


It gives a project life cycle with a set of defined phases broken
down into steps with clearly stated inputs and outputs.


Methods for Project Management:

CRM Catalyst(2)

Implementation requires

Data Mining development
process

Implementation is
Knowledge intensive

The resutls are obtained in
a progressive way

Progressive

Lifecycle Model

In some steps
Knowledge
Intensive
Methdology

could
be appropriate

Main steps in a Data Mining Project

1.
Define the goals:


Business and data mining experts together have to define
the goals


Each goal must be defined with measurements for success

2.
Obtain the models:


Apply data mining algorithms.


Preprocesing is important

3.
Evaluate results:


ascertaine the value of an object according to specified
criteria, operationalised in terms of measures.

4.
Deploy:


Decide patterns and models that can be deployed

5.
Evaluate


After product working it should be contrasted the result


1. Define the goals


Distinguish between :


Data Mining goals


Business goals


How do we translate?


Clasification

Estimation

Association

¿?

¿?

¿?

Increase the lifetime value of valuable customers


It has to be solved in the Business
Understanding step of CRISP
-
DM

Business Understanding

in the CRISP
-
DM Process

Business
Understanding

Determine
Business
Objectives

Assess
Situation

Determine
Data Mining
Goals

Produce
Project Plan

Background

Business
Objectives

Business
Success
Criteria

Inventory &
Resources

Reqs,
Assumptions
&Constraints

Risks &
Contingencies

Terminology

Costs &
Benefits

Data Mining
Goals

Data Mining
Success Criteria

Project Plan

Initial Assessment of Tools
& Techniques

1.1 Determine Business

objectives and success criteria


Not only business objectives have to be established but
measures in order to be able to evaluate the results



Business objectives:


What is the customer's primary objective?


Increase the number of loyal customers


Selling more of a certain product


Have a positive marketing campaing



Business success criteria:


What constitutes a successful outcome of the project?


Objectives measures so that the success can be established


ROI

1.2 Costs & Benefits


Perform a cost
-
benefits analysis


Compute the benefits of the project


Which measures do we have?


ROI


APEX


OPEX....


Compute the costs of the project (equipment, human
resources...)


Which methodology do we have?


COCOMO for sortware


Quantify the risk that the project fails


Knowledge not available


Data Not available


Proper tools


Data Mining Estimation Model


Establishing a parametrical estimation model for Data
Mining (Marban’03)

DMCOMO

(Data Mining COst MOdel)

Data Mining Cost Estimation


Main factors in a Data Mining project


Data Sources (number, kind, nature, …)


Data mining problem to be solved (descriptive,
predictive, …)


Development platform


Available tools


Expertise of the development team


Drivers



Data Drivers



Model Drivers



Platform Drivers



Tools and techniques Drivers


Project Drivers


People Drivers

1.3 Data Mining goals and success

Data mining goals:


Translate the customer's primary objective into a data
mining goal, e.g.


Loyalty program translated into segmentation problem


Decreasing the attrition rate transformed into classification
problem


Data mining success criteria:


Determine success in technical terms


Translate the notion of sucess into confidence, support
and lift and other parameteres


Determine de cost of errors


How do we make the translation?

Methodology


Which is the methodology to be followed to
translate business objectives into data mining
objectives?


Unluckily, there is no such methodology. First we
have to solve:


How a business objective is expressed?


What is a data mining goal?


How are data mining goals achieved?


Which are the requirements of data mining functions?

In order to describe everything in a standard way:

Conceptualize the problem

Conceptualization in other disciplines


Data Bases:


E/R diagrams


Independent of the domain


A tool for business understanding and for data base
designer


Translation from E/R to implementation


Internal Schema

Conceptual Schema

External view
1

External view
n

3 levels proposed architecture


Internal Schema

Conceptual Schema

Business problem

Business problem

Requirements of algorithms will

be solved at this level

Tools requirements to
be solved

SAS, WEKA, Clementine…

3 layers architecture for data mining


It is the bridge:


Between business goals and the final tool


Independent of the domain


Provides independence:


Changes in the tool do not reflect to the solution



It has to be decided what to model in the
conceptualization


Automatic translation of business goals into data
mining goals


Data Mining goals +constraints = feasible data
mining goals

Elements to conceptualize


Elements to be taken into account:


Data:


Quality from data mining point of view


Adequateness for the problem


Classification for data mining purposes


Knowledge:


Related to the process being analyzed


Related to the data used


People


Owners of data


Experts in the process


Data mining problems requirements


Data mining methods requirements

Proposed process

DMMO


Data Mining Modelling Objects:


Data


Knowledge


Constraints of data and applications


Data Mining objects


Algorithms


Measures


Methods



To bridge the gap between data miners and
business users


Are data adequate for analysis?


The adequateness of the data is analyzed taking into
account goals to fulfil.


Data together with the knowledge extracted from the
experts can be transformed so that just by being the input
of a certain data mining algorithm will produce the
required patterns.



Quality of the data, in this context:


is not only related to the technical quality: proper model,
percentage of null values,



but also has to do with:


meaning of the attributes,


Where each piece of data comes from,


relationship among data, and


finally how the data fulfil the requirements of the data mining
functions

2. Data Mining: obtain models


Apply data mining process model


Associated problems solved by the 3 layers
architecture:


Comparison of approaches


Evaluate costs


Pros and cons of approaches


Only experience or a conceptualization can help


The conceptual model will help to establish the
process to obtain each feasible model.


Requirements and transformations implicit in the
model

2.1 Determine type of problem


What are data mining problems?


Classification


Estimation


Association


Segmentation



In the conceptual model requirements for each type will
be settled



2.2 Apply CRISP
-
DMprocess model



Data Mining problem has to be settled before going into
modeling step


Requierements will be established in Business
understanding


Requierements will be checked in Data Understanding
and data Preparation


Preparation will be guided by conceptual model


Evaluation on feasibility can be done before applying
the model

Business
Understanding

Data
Understa
nding

Data

Prep
arati
on

M
o
d
el
in
g

Eval
uati
on

Deplo
yment

Business
Understanding

3. Evaluate results

[Spilipopou, Berendt]


Evaluation:
the act of ascertaining the value of an
object

according to specified
criteria
, operationalised in terms of
measures
.


Object= model already obtained


Criteria and Measures and has to do with goals


Evaluation requires a well
-
defined notion of
success
, which
must be in place before


the evaluation takes place


the data mining phase starts


any work with the data starts


i.e. already during the business understanding process.


Here once again conceptualization plays its role


Evaluation in the CRISP
-
DM Process


The CRISP
-
DM process is


a non
-
ending circle of iterations


a non
-
sequential process, where backtracking at previous
phases is usually necessary


In each sequential instantiation evaluation takes place:






But it is a cycle


In all the iterations all the steps should be revisited


Results have to be evaluated!!

Business
Understanding

Data
Understa
nding

Data

Prep
arati
on

M
o
d
el
in
g

Eval
uati
on

Deplo
yment

Business
Understanding

4. Deployment


All the models that have possitive evaluation can
be deployed


For measurements of success to trust
deployment has to follow rules established at the
beginning of the project



The real evaluation has not yet been performed

5. Evaluate after deployment


After deployment there is the need to proof that
the improvements are really due to the actions
taken after a data mining discovery and not to
any other factor or action carried out in the
company


None of the obvious claims about success of data
mining have ever been systematically tested.


Experiments are crucial to establish if the impact
of the deployment is really positive or negative


Experiments have to be designed at the
beginning of the project


Conclusions


Data mining projects are being developed more as
art than a science


Many algorithms have been implemented but no
systematically proof of one better than another in
real case is done after deployment


Conceptual model is required:


To map business goals to the model


To map data mining algorithms to a conceptual model


Achievements of the model:


Will be used along the process to guide the project


Evaluation tool

Future works


Conceptual model


Define DMMO objects


Evaluation techniques related to the model:


Evaluate data mining goals


Evaluate business goals


Experimentation methods:


obstursively and


non obstrusivelsly

References


Evaluation in Web mining Tutorial at
ECML/PKDD 2004

Pisa, Italy;
20th September, 2004. Bettina Berendt, Myra Spiliopoulou, Ernestina
Menasalvas


Towards

a Methodology for Data mining Project Development : The
Importance of Abstraction. Menasalvas, Millán, Gonzalez
-
Aranda,
Segovia


Bettina Berendt
,
Andreas Hotho
,
Dunja Mladenic
,
Maarten van
Someren
, Myra Spiliopoulou,
Gerd Stumme
: Web Mining: From Web
to Semantic Web, First European Web Mining Forum, EMWF 2003,
Cavtat
-
Dubrovnik, Croatia, September 22, 2003, Revised Selected
and Invited Papers
Springer 2004



Myra Spiliopoulou,
Carsten Pohle
: Modelling and Incorporating
Background Knowledge in the Web Mining Process.
Pattern Detection
and Discovery 2002
: 154
-
169


www.crisp
-
dm.org


www.spss.com/
clementine
/
cats
.htm



www.
sas
.com/technologies/analytics/datamining/miner/
semma
.html


www.
crm
methodology.com


www.
e
m
e
trics.org/articl
e
s/whit
e
pap
e
r.html


Facultad de Informática


THANKS