Introduction to KDD for Tony's MI Course

lavishgradeΛογισμικό & κατασκευή λογ/κού

25 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

103 εμφανίσεις

CogNova

Technologies

1

Knowledge Discovery

in Databases (KDD)


An Introduction

Daniel L. Silver

Copyright (c), 2003

All Rights Reserved

CogNova

Technologies

2

Agenda


Introduction to KDD & DM



Data Mining and E
-
Commerce


Overview of the KDD Process


Benefits, Costs, Status and Trends

CogNova

Technologies

3


“We are drowning in information, but
starving for knowledge.”
John Naisbett




Data Warehousing and Data Mining

Knowledge Discovery in Databases (KDD)

CogNova

Technologies

4

Introduction

KDD is a rapidly maturing field ...


Also referred to as:


Data dredging, Data harvesting, Data archeology


A multidisciplinary field:


Database and data warehousing


Data and model visualization methods


On
-
line Analytical Processing


Statistics and machine learning


Knowledge management

CogNova

Technologies

5

Introduction

Why has KDD come to the fore now?


Competitive focus
-

Knowledge Management


Abundance of business and industry data


Inexpensive, powerful computing engines


Strong theoretical/mathematical foundations


machine learning & logical inference


statistics and dynamically systems


database management systems

CogNova

Technologies

6

Introduction


What is KDD?

A Process


The selection and processing of data for:


the identification of novel, accurate, and
useful patterns, and


the modeling of real
-
world phenomenon.



Data Warehousing
and

Data mining

are
major components of the KDD process.

CogNova

Technologies

7

The KDD Process

Selection and

Preprocessing

Data Mining

Interpretation

and Evaluation

Data

Consolidation

Knowledge

p(x)=0.02

Data

Warehouse

Data Sources

Patterns &

Models

Prepared Data

Consolidated

Data

CogNova

Technologies

8

Introduction


KDD In Context

C og No va
T ec hnologi es
9
The KDD Process
The KDD Process
Selection and

Preprocessing
Data Mi ning
Interpretation
and Evaluation
Data
Consolidation
K no w le d g e
p (x) =0.02
W are house
D ata So u r c es
P att er n s &
M o d els
P r ep ar e d Da ta
C o n s o lid a ted
D ata
Identify

Problem or

Opportunity

Measure Effect

of Action

Act on

Knowledge

“The Virtuous

Cycle”

Berry & Linoff

Knowledge

Results

Strategy

Problem

CogNova

Technologies

9

Introduction
-

CRISP


Cr
oss

I
ndustry
S
tandard
P
rocess for Data
Mining


Developed by employees at SPSS, NCR,
DaimlerCrysler


Iterative process with 6 major steps:


Business Understanding


Data Understanding


Data Preparation


Modeling


Evaluation


Deployment

CogNova

Technologies

10

Why? …

Relationship


Marketing

a.k.a


Customer
Relationship
Management

Marketing Embraces KM, DW, DM

Marketing

Traditional

Marketing

MIS

Data

Warehousing


Data Mining



CogNova

Technologies

11

What is Relationship Marketing?


Knowing your customers on
an individual basis


Maximizing life
-
time value
not individual sales


Developing and maintaining
a mutually beneficial
relationship


Acquire, retain, win
-
back
desirable customers

Arbuckle’s

Market

“ The Corner Store ”

CogNova

Technologies

12

Knowledge Discovery

What can KDD do for an organization?

Impact on Marketing


Target marketing at a credit card company


Consumer usage analysis at a telecomm
provider


Loyalty assessment at a service bureau


Quality of service analysis at an appliance
chain

CogNova

Technologies

13

Application Areas

Private/Commercial Sector


Marketing:
segmentation, product targeting,



customer value and retention, ...


Finance:
investment support, portfolio management


Banking & Insurance:
credit and policy approval


Security:
fraud detection, access control


Science and medicine:
hypothesis discovery,




prediction, classification, diagnosis


Manufacturing:
process modeling, quality control,





resource allocation


Engineering:
pattern recognition, signal processing


Internet:
smart search engines, web marketing

CogNova

Technologies

14

Application Areas

Public/Gov’t Sector


Finance:
investment management, price forecasting


Taxation:
adaptive monitoring, fraud detection


Health care:
medical diagnosis, risk assessment,



cost /quality control


Education:
process and quality modeling,




resource forecasting


Insurance:
worker’s compensation analysis


Security:
bomb, iceberg detection


Transportation:
simulation and analysis


Statistics:
demographic analysis, municipal planning

CogNova

Technologies

15

Data Mining and E
-
Commerce

CogNova

Technologies

16

Business Evolution on the Web

Publishing

Time or Maturity

Functionality

Interactivity

Transactions

Processes

Static web pages

Dynamic web pages

Web
-
enabled

applicatons

Agents,

On
-
line

data mining

CogNova

Technologies

17

E
-
Commerce and the

Emergence of E
-
Business



Enterprise

Resource

Planning

Supply

Chain

Management

Customer

Relationship

Management

Selling

Chain

Management

Procurement

Management

Knowledge

Management

Supp
liers

Services

Customers

Distributers

Customers

Partners

Government

Agents

Partners

Intranet

Middleware


New Era of

Cross
-
Functional

Integrated Applications


CogNova

Technologies

18

New Applications for DM

CRM =
Customer Relationship Management


B2C focus


Marketing, Sales, Service


DW/DM?


User profiling/on
-
line target marketing


Customer tailored messages


Click
-
stream analysis

CogNova

Technologies

19

New Applications for DM

CRM =
Customer Relationship Management


Click
-
stream analysis:


How many visits for a site? for a page?


Patterns / trajectories of visits


Relationship to marketing campaigns


Shopping cart analaysis (75% abandoned)


Source is server side data collection


Cognos has a number of BI tools

CogNova

Technologies

20

Typical B2C Shopping Trip

View Homepage

Registration

Address Book

Shopping Cart

Product Advisor

Search

Receive Ack.

Submit Order

Enter Payment Info.

Enter Shipping Info.

Select Products

Navigate

Personalization through:

-
Web based data mining


= web mining

-
Collaborative Filtering

-
Individual User Profiling


CogNova

Technologies

21

New Applications for DM

ERP = Enterprise Resource Planning


Forecasting and Planning


Purchasing and Material Management


Inventory Management


Finished Product distribution


Accounting and Finance


DW/DM?


On
-
line inventory prediction


On
-
line fraud detection


On
-
line process modeling

CogNova

Technologies

26

Four Approaches to Data Mining

1.
Purchase models from external sources
based on similar data (like buying an
existing photo)

2.
Purchase software with embedded expertise
and use it on your data (like buying an fully
automatic camera)

3.
Hire an outside consultant with your data
(like hiring a professional photographer)

4.
Master the skills of data mining and develop
your own models (like becoming a photog.)

CogNova

Technologies

27

The KDD Process


CogNova

Technologies

28

The KDD Process

Selection and

Preprocessing

Data Mining

Interpretation

and Evaluation

Data

Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns &

Models

Prepared Data

Consolidated

Data

CogNova

Technologies

29

The KDD Process

Possible results for any one effort:


Confirmation of the obvious



New knowledge
-

the data mine “
nugget




No significant relations found
(random data)

CogNova

Technologies

30

The KDD Process

Core Problems & Approaches


Problems:


identification

of relevant data


representation

of data


search

for valid pattern or model


Approaches:


top
-
down
deduction
by expert


interactive
visualization
of data/models


* bottom
-
up

induction

from data *

Probability

of sale

Income

Age

Data

Mining

OLAP

CogNova

Technologies

31

The KDD Process

The Architecture of a KDD System

Graphical User Interface

Data

Consolidation

Selection

and

Preprocessing

Data

Mining

Interpretation

and Evaluation

Warehouse

Knowledge

Data Sources

CogNova

Technologies

32

The KDD Process

Selection and

Preprocessing

Data Mining

Interpretation

and Evaluation

Data


Consolidation

Knowledge

p(x)=0.02

Warehouse

CogNova

Technologies

33

Data Consolidation

Garbage in Garbage out


The quality of results relates directly to
quality of the data


50%
-
70% of KDD process effort will be spent
on data consolidation, cleansing and
preprocessing


Major justification for a corporate
Data
Warehouse

CogNova

Technologies

34

Data Consolidation & Warehousing

From data sources to consolidated data
repository

RDBMS

Legacy

DBMS

Flat Files

Data

Consolidation

and Cleansing

Warehouse

or Datamart

External

Analysis and

Info Sharing

Inflow

Metaflow


Upflow

Downflow

Outflow

CogNova

Technologies

35

Data Consolidation

The Process


Collect & Consolidate


Define requirements
-

Generate data model


Identify authoritative sources (internal/external)


Extract required data (Prism, Passport, InfoPump)


Integrate into working database (ODS)


Generate meta
-
data = data about the data


Clean
-

M
easure data quality at the source


Completeness
-

Accuracy
-

Integrity

(Vality)


Load only clean data into warehouse


Schedule periodic source checking/cleansing

CogNova

Technologies

36

Data Warehousing


A Process

Definition: The strategic collection, cleansing, and
consolidation of organizational data to meet
operational, analytical, and communication
needs.



75% of early DW projects were not completed


Data warehousing is not a project


It is an on
-
going set of organizational activities


Must be business benefits driven



CogNova

Technologies

37

Data Warehouse


An Objective


A clean, consistent and reliable source of
organizational data


A Data Warehouse differs from an Operational
DB in that it is subject oriented (not application
oriented), contains integrate data, as well as
summaries and histories


A departmental DW is referred to as a
Data Mart


Focus is on local, specific needs


More common than corporate wide data warehouses


CogNova

Technologies

38

Data Warehousing

Common Choices for a Warehouse
Repository


RDBMS (
Oracle, Sybase, DB2, Red Brick
)


supports very large, multipurpose databases


multidimensional access via ROLAP methods


slow for massive/complex data analysis


MDBMS (
Accumate, Essbase, Oracle
Express
)


fast, full feature OLAP


size limitations
-

5 GB of raw data (100 GB total)


few standards, proprietary systems

CogNova

Technologies

39


Relationship between DW and DM?

Source of

consolidated

data

Rationale

for data

consolidation

Data


Warehousing

Analysis

Query/Reporting

OLAP

Data Mining

Strategic

Tactical

CogNova

Technologies

40

The KDD Process

Selection and

Preprocessing

Data Mining

Interpretation

and Evaluation

Data


Consolidation

Knowledge

p(x)=0.02

Warehouse

CogNova

Technologies

41

Selection and Preprocessing


Generate a set of examples


choose sampling method


consider sample complexity


deal with volume bias issues


Reduce attribute dimensionality


remove redundant and/or correlating attributes


combine attributes (sum, multiply, difference)


Reduce attribute value ranges


group symbolic discrete values


quantize continuous numeric values


OLAP and visualization tools play key role
(Han calls this
descriptive data mining
)

CogNova

Technologies

42

OLAP:
On
-
Line Analytical Processing

OLAP Functionality


Dimension selection


slice & dice


Rotation


allows change in perspective


Filtration



value range selection


Hierarchies


drill
-
downs to lower levels


roll
-
ups to higher levels


OLAP

cube

Year

by Month

Product Class

by Product Name

Sales

Region

Profit Values

CogNova

Technologies

43

Selection and Preprocessing


Transform data


decorrelate and normalize values


map time
-
series data to static representation


Encode data


representation must be appropriately for the Data
Mining tool which will be used


continue to reduce attribute dimensionality where
possible without loss of information


OLAP and visualization tools as well as
transformation and encoding software

CogNova

Technologies

44

Selection and Preprocessing


DEMO

Cognos
-

PowerPlay

An On
-
line Analytical Processing

(OLAP) System

CogNova

Technologies

45

The KDD Process

Selection and

Preprocessing

Data Mining


Interpretation

and Evaluation

Data

Consolidation

Knowledge

p(x)=0.02

Warehouse

CogNova

Technologies

46

Overview of Data Mining Methods


Automated Exploration/Discovery


e.g..
discovering new market segments


distance and probabilistic clustering algorithms


Prediction/Classification


e.g..
forecasting gross sales given current factors


regression, neural networks, genetic algorithms


Explanation/Description


e.g..
characterizing customers by demographics


and purchase history


inductive decision trees,





association rule systems

x1

x2

f(x)

x

if age > 35


and income < $35k


then ...

Focus is on induction of a model

from specific examples

CogNova

Technologies

47

Data Mining Methods

Automated Exploration and Discovery


Distance
-
based numerical clustering


metric grouping of examples (KNN)


graphical visualization can be used



Bayesian clustering


search for the number of classes which result in
best fit of a probability distribution to the data


AutoClass (NASA) one of best examples

Income

Age

CogNova

Technologies

48

Data Mining Methods

Prediction and Classification


Function approximation
(curve fitting)


Classification
(concept learning, pattern
recognition)



Methods:


Statistical regression


Artificial neural networks


Genetic algorithms


Nearest neighbour algorithms

I1

I2

I3

I4

O1

O2

f(x)

x

x1

x2

A

B

CogNova

Technologies

49

Data Mining Methods

Generalization


The objective of learning is to achieve good
generalization

to new cases, otherwise just use
a look
-
up table.


Generalization can be defined as a
mathematical
interpolation

or
regression

over a
set of training points:

f(x)

x

CogNova

Technologies

50

Data Mining Methods

Generalization


Generalization accuracy can be guaranteed
for a specified confidence level given
sufficient number of examples


Models can be validated with a previously
unseen test set or approximated by cross
-
validation methods

f(x)

x

CogNova

Technologies

51

Data Mining Methods


DEMO

Ward Systems
-

NeuroShell 2

An artificial neural network

system

CogNova

Technologies

53

Data Mining Methods

Explanation and Description


Learn a generalized hypothesis (model) from
selected data


Description/Interpretation of model provides
new human knowledge


Methods:


Inductive decision tree and rule systems


Association rule systems


Link Analysis

A?

B?

C?

D?

Root

Leaf

Yes

CogNova

Technologies

54

Modeling & Data Mining


DEMO

Angoss
-

KnowledgeSEEKER

An inductive decision tree/rule

system

CogNova

Technologies

55

The KDD Process

Selection and

Preprocessing

Data Mining

Interpretation

and Evaluation

Data Consolidation

and Warehousing

Knowledge

p(x)=0.02

Warehouse

CogNova

Technologies

56

Interpretation and Evaluation

Evaluation


Statistical validation and significance testing


Qualitative review by experts in the field


Pilot surveys to evaluate model accuracy

Interpretation


Inductive tree and rule models can be read directly


Clustering results can be graphed and tabled


Code can be automatically generated by some
systems
(ANNs, IDTs, Regression models)

CogNova

Technologies

57

Interpretation and Evaluation

Visualization tools can be very helpful:


sensitivity analysis (I/O relationship)


histograms of value distributions


time
-
series plots and animation


requires training and practice

Response

Velocity

Temp

CogNova

Technologies

58

Benefits, Costs,

Status and Trendss


CogNova

Technologies

59

Benefits of KDD


Maximum utility from corporate data


discovery of new knowledge


generation of predictive models


Important feedback to data warehousing effort


identification and justification of essential data


Reduction of application dev ’t backlog


model development
vs.
software development


Effect on bottom line of organization


cost reduction, increased productivity, risk
avoidance … competitive advantage

CogNova

Technologies

60

Requirements and Costs of KDD


Hardware

-

computationally intensive


Software

-

micro < $20k, integrated suites $100k+


Data

-

internal collection, surveys, external sources


Human resources



DB/DP/DC expertise to consolidate and
preprocess data


Machine learning and stats competence


Application knowledge & project mgmt


70%
of the effort is expended on the data
consolidation and preprocessing activities

CogNova

Technologies

61

The Current Status and Trends


Standards and methodology lag technology


Many products:


micro DM packages (Cognos, Angoss)


macro
-

integrated suites (SAS, IBM, SPSS)


most will not work with each other


Software costs have risen x100 over 6 years


Major players yet to be determined in certain
verticals


Internet
-

“the” sink and source of data


Legal and ethical issues on the horizon

CogNova

Technologies

62

Knowledge Discovery Products




Installed

and
Planned

Query Tools

MOLAP

Multidimensional

Database

Data Mining

ROLAP

0%

10%

20%

30%

40%

50%

60%

46%

55%

40%

43%

14%

31%

16%

37%

31%

17%

Source: 1997 s
urvey of IS managers by Sentry

What the

competition

will be

doing?

KDD

Tools

CogNova

Technologies

63

Top 5 DM Trends for 2002
-
04

from Aaron Zornes


META Group:


More use of predictive models


Better models based on more data


Systems will be easier for domain knowledge
experts to use


Use of XML based predictive model markup
language (PMML) for knowledge exchange


Standards/methods for integration with
DBMS for rapid creation and deployment


CogNova

Technologies

64

The Current Status and Trends

What has prevented the use of Data Mining?


Products:


General in nature, not tailored for business


Missing standard interfaces to organizational data


Emphasis on sales and not training/consulting


Customers:


Frightened by technical skill set required


Uncertain of mining results and ROI


Convinced warehouse must be completed first


Lacking knowledge of external data sources

CogNova

Technologies

65

Key Technologies for KDD


Data warehousing and distributed database


Parallel computing


AI and expert systems


Machine learning and statistical inference


Visualization (including Virtual Reality)


Internet
-

future sink and source of data


adaptive filters, knowledge extractors


smart web services

CogNova

Technologies

66

Current Management Issues


Ownership of data and knowledge


Security of customer data


Responsibility for accuracy of
information


Ethical practices
-

fair use of data

CogNova

Technologies

67

A List of Major Vendors

Lots of Players

Approaching market from hardware, database,
statistical, machine learning, education,
financial/marketing, and management
consulting:

IBM
,
SAS
,
SPSS
,
SGI
,
Thinking Machines
,
Cognos
,
ZDM Scientific
,
Neuralware
,
Information Discovery
,
American Heuristics
,
Data Distilleries
,
SuperInduction

CogNova

Technologies

68

THE END


danny.silver@acadiau.ca