Data Mining Tools

naivenorthAI and Robotics

Nov 8, 2013 (3 years and 9 months ago)

192 views

Data Mining Tools

Overview & Tutorial

Ahmed Sameh

Prince Sultan University

Department of Computer Science &
Info Sys

May 2010

(Some slides belong to IBM)



1

2

Introduction Outline


Define data mining


Data mining vs. databases


Basic data mining tasks


Data mining development


Data mining issues

Goal:

Provide an overview of data mining.

3

Introduction


Data is growing at a phenomenal
rate


Users expect more sophisticated
information


How?


UNCOVER HIDDEN INFORMATION

DATA MINING


4

Data Mining Definition


Finding hidden information in a
database


Fit data to a model


Similar terms


Exploratory data analysis


Data driven discovery


Deductive learning

5

Data Mining Algorithm


Objective: Fit Data to a Model


Descriptive


Predictive


Preference


Technique to choose
the best model


Search


Technique to search the
data


“Query”

6

Database Processing vs. Data
Mining Processing


Query


Well defined


SQL



Query


Poorly defined


No precise query
language





Data



Operational data




Output



Precise



Subset of database



Data



Not operational data



Output



Fuzzy



Not a subset of database


7

Query Examples


Database





Data Mining



Find all customers who have purchased milk



Find all items which are frequently purchased with
milk. (association rules)



Find all credit applicants with last name of Smith.



Identify customers who have purchased more than
$10,000 in the last month.




Find all credit applicants who are poor credit
risks. (classification)



Identify customers with similar buying habits.
(Clustering)

8

Related Fields



Statistics

Machine

Learning

Databases

Visualization

Data Mining and

Knowledge Discovery

9

Statistics, Machine Learning
and Data Mining


Statistics:


more theory
-
based


more focused on testing hypotheses


Machine learning


more heuristic


focused on improving performance of a learning agent


also looks at real
-
time learning and robotics


areas not part
of data mining


Data Mining and Knowledge Discovery


integrates theory and heuristics


focus on the entire process of knowledge discovery,
including data cleaning, learning, and integration and
visualization of results


Distinctions are fuzzy

Definition


A class of database application that
analyze
data in a database using tools which look
for trends or anomalies.


Data mining was invented by IBM.

Purpose


To look for hidden patterns or previously
unknown relationships among the data in a
group of data that can be used to predict future
behavior.


Ex: Data mining software can help retail
companies find customers with common
interests.

Background Information


Many of the techniques used by today's data
mining tools have been around for many years,
having originated in the artificial intelligence
research of the 1980s and early 1990s.


Data Mining tools are only now being applied
to large
-
scale database systems.

The Need for Data Mining


The amount of raw data stored in corporate
data warehouses is growing rapidly.


There is too much data and complexity that
might be relevant to a specific problem.


Data mining promises to bridge the analytical
gap by giving knowledgeworkers the tools to
navigate this complex analytical space.

The Need for Data Mining, cont’


The need for information has resulted in the
proliferation of data warehouses that integrate
information multiple sources to support
decision making.


Often include data from external sources, such
as customer demographics and household
information.

Definition (Cont.)

Data mining is the exploration and analysis of large quantities
of data in order to discover
valid, novel, potentially useful,
and ultimately understandable patterns in data.


Valid
: The patterns hold in general.

Novel
: We did not know the pattern
beforehand.

Useful
: We can devise actions from the
patterns.

Understandable
: We can interpret and
comprehend the patterns.

Of “laws”, Monsters, and Giants…


Moore’s law: processing “capacity” doubles
every 18 months :
CPU, cache, memory


It’s more aggressive cousin:


Disk storage “capacity” doubles every 9
months

1E+3
1E+4
1E+5
1E+6
1E+7
1988
1991
1994
1997
2000
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
ExaByte
Disk TB Shipped per Year
1998 Disk Trend (Jim Port er)
ht t p://www.diskt rend.com/pdf/port rpkg.pdf.
What do the two
“laws” combined
produce?

A rapidly growing
gap between our
ability to generate
data, and our ability
to make use of it.

What is Data Mining?

Finding interesting structure in
data


Structure:
refers to statistical patterns,
predictive models, hidden relationships



Examples of tasks addressed by Data Mining


Predictive Modeling (classification,
regression)


Segmentation (Data Clustering )


Summarization


Visualization

19

Major Application Areas for

Data Mining Solutions


Advertising


Bioinformatics


Customer Relationship Management (CRM)


Database Marketing



Fraud Detection


eCommerce


Health Care


Investment/Securities


Manufacturing, Process Control


Sports and Entertainment


Telecommunications


Web

20

Data Mining


The non
-
trivial extraction of novel, implicit, and
actionable knowledge from large datasets.


Extremely large datasets


Discovery of the non
-
obvious


Useful knowledge that can improve processes


Can not be done manually


Technology to enable data exploration, data analysis,
and data visualization of very large databases at a high
level of abstraction,
without a specific hypothesis in
mind
.


Sophisticated data search capability that uses statistical
algorithms to discover patterns and correlations in data.

21

Data Mining (cont.)

22

Data Mining (cont.)


Data Mining is a step of Knowledge Discovery
in Databases (
KDD
) Process


Data Warehousing


Data Selection


Data Preprocessing


Data Transformation


Data Mining


Interpretation/Evaluation


Data Mining is sometimes referred to as KDD
and DM and KDD tend to be used as
synonyms

23

Data Mining Evaluation

24

Data Mining is Not …


Data warehousing


SQL / Ad Hoc Queries / Reporting


Software Agents


Online Analytical Processing (OLAP)


Data Visualization

25

Data Mining Motivation


Changes in the Business Environment


Customers becoming more demanding


Markets are saturated


Databases today are huge:


More than 1,000,000 entities/records/rows


From 10 to 10,000 fields/attributes/variables


Gigabytes and terabytes


Databases a growing at an unprecedented
rate


Decisions must be made rapidly


Decisions must be made with maximum
knowledge

Why Use Data Mining Today?

Human analysis skills are inadequate:


Volume and dimensionality of the data


High data growth rate


Availability of:


Data


Storage


Computational power


Off
-
the
-
shelf software


Expertise

An Abundance of Data


Supermarket scanners, POS data


Preferred customer cards


Credit card transactions


Direct mail response


Call center records


ATM machines


Demographic data


Sensor networks


Cameras


Web server logs


Customer web site trails

Evolution of Database Technology


1960s: IMS, network model


1970s: The relational data model, first relational
DBMS implementations


1980s: Maturing RDBMS, application
-
specific
DBMS, (spatial data, scientific data, image data,
etc.), OODBMS


1990s: Mature, high
-
performance RDBMS
technology, parallel DBMS, terabyte data
warehouses, object
-
relational DBMS, middleware
and web technology


2000s: High availability, zero
-
administration,
seamless integration into business processes


2010: Sensor database systems, databases on
embedded systems, P2P database systems,
large
-
scale pub/sub systems, ???

Much Commercial Support


Many data mining tools


http://www.kdnuggets.com/software



Database systems with data mining
support


Visualization tools


Data mining process support


Consultants

Why Use Data Mining Today?

Competitive pressure!

“The secret of success is to know something that
nobody else knows.”

Aristotle Onassis



Competition on service, not only on price (Banks,
phone companies, hotel chains, rental car
companies)


Personalization, CRM


The real
-
time enterprise


“Systemic listening”


Security, homeland defense

The Knowledge Discovery Process

Steps:

1.
Identify business problem

2.
Data mining

3.
Action

4.
Evaluation and measurement

5.
Deployment and integration into
businesses processes

Data Mining Step in Detail

2.1 Data preprocessing


Data selection: Identify target
datasets and relevant fields


Data cleaning


Remove noise and outliers


Data transformation


Create common units


Generate new fields

2.2 Data mining model construction

2.3 Model evaluation

Preprocessing and Mining

Original Data

Target

Data

Preprocessed

Data

Patterns

Knowledge

Data

Integration

and Selection

Preprocessing

Model

Construction

Interpretation

34

Data Mining Techniques

Data Mining Techniques

Descriptive

Predictive

Clustering

Association

Classification

Regression

Sequential Analysis

Decision Tree

Rule Induction

Neural Networks

Nearest Neighbor Classification


35

Data Mining Models and Tasks

36

Basic Data Mining Tasks


Classification
maps data into
predefined groups or classes


Supervised learning


Pattern recognition


Prediction


Regression

is used to map a data item
to a real valued prediction variable.


Clustering
groups similar data
together into clusters.


Unsupervised learning


Segmentation


Partitioning

37

Basic Data Mining Tasks (cont’d)


Summarization
maps data into subsets
with associated simple descriptions.


Characterization


Generalization


Link Analysis

uncovers relationships
among data.


Affinity Analysis


Association Rules


Sequential Analysis determines sequential
patterns.

38

Ex: Time Series Analysis


Example: Stock Market


Predict future values


Determine similar patterns over time


Classify behavior

39

Data Mining vs. KDD


Knowledge Discovery in
Databases (KDD):

process of
finding useful information and
patterns in data.


Data Mining:

Use of algorithms to
extract the information and patterns
derived by the KDD process.

40

Data Mining Development


Similarity Measures


Hierarchical Clustering


IR Systems


Imprecise Queries


Textual Data


Web Search Engines



Bayes Theorem


Regression Analysis


EM Algorithm


K
-
Means Clustering


Time Series Analysis


Neural Networks


Decision Tree Algorithms


Algorithm Design Techniques


Algorithm Analysis


Data Structures


Relational Data Model


SQL


Association Rule Algorithms


Data Warehousing


Scalability Techniques


41

KDD Issues


Human Interaction


Overfitting



Outliers



Interpretation


Visualization


Large Datasets


High Dimensionality

42

KDD Issues (cont’d)


Multimedia Data


Missing Data


Irrelevant Data


Noisy Data


Changing Data


Integration


Application

43

Visualization Techniques


Graphical


Geometric


Icon
-
based


Pixel
-
based


Hierarchical


Hybrid

44

Data Mining Applications

45

Data Mining Applications:

Retail


Performing basket analysis


Which items customers tend to purchase together. This
knowledge can improve stocking, store layout
strategies, and promotions.


Sales forecasting


Examining time
-
based patterns helps retailers make
stocking decisions. If a customer purchases an item
today, when are they likely to purchase a
complementary item?


Database marketing


Retailers can develop profiles of customers with certain
behaviors, for example, those who purchase designer
labels clothing or those who attend sales. This
information can be used to focus cost

effective
promotions.


Merchandise planning and allocation


When retailers add new stores, they can improve
merchandise planning and allocation by examining
patterns in stores with similar demographic
characteristics. Retailers can also use data mining to
determine the ideal layout for a specific store.

46

Data Mining Applications:

Banking


Card marketing


By identifying customer segments, card issuers and
acquirers can improve profitability with more effective
acquisition and retention programs, targeted product
development, and customized pricing.


Cardholder pricing and profitability


Card issuers can take advantage of data mining
technology to price their products so as to maximize
profit and minimize loss of customers. Includes risk
-
based pricing.


Fraud detection


Fraud is enormously costly. By analyzing past
transactions that were later determined to be
fraudulent, banks can identify patterns.



Predictive life
-
cycle management


DM helps banks predict each customer’s lifetime value
and to service each segment appropriately (for example,
offering special deals and discounts).

47

Data Mining Applications:

Telecommunication


Call detail record analysis


Telecommunication companies accumulate detailed
call records. By identifying customer segments with
similar use patterns, the companies can develop
attractive pricing and feature promotions.


Customer loyalty


Some customers repeatedly switch providers, or

churn
”, to take advantage of attractive incentives
by competing companies. The companies can use
DM to identify the characteristics of customers who
are likely to remain loyal once they switch, thus
enabling the companies to target their spending on
customers who will produce the most profit.

48

Data Mining Applications:

Other Applications


Customer segmentation


All industries can take advantage of DM to discover
discrete segments in their customer bases by
considering additional variables beyond traditional
analysis.


Manufacturing


Through choice boards, manufacturers are beginning to
customize products for customers; therefore they must
be able to predict which features should be bundled to
meet customer demand.


Warranties


Manufacturers need to predict the number of customers
who will submit warranty claims and the average cost of
those claims.


Frequent flier incentives


Airlines can identify groups of customers that can be
given incentives to fly more.

49

Which are our


lowest/highest margin

customers ?

Who are my customers

and what products

are they buying?

Which customers


are most likely to go

to the competition ?


What impact will

new products/services

have on revenue

and margins?

What product prom
-

-
otions have the biggest

impact on revenue?

What is the most

effective distribution

channel?

A producer wants to know….

50

Data, Data everywhere

yet ...


I can’t find the data I need


data is scattered over the
network


many versions, subtle
differences



I can’t get the data I need


need an expert to get the data


I can’t understand the data I
found


available data poorly documented




I can’t use the data I found


results are unexpected


data needs to be transformed
from one form to other

51

What is a Data Warehouse?



A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they
can understand and use
in a business context.



[Barry Devlin]

52

What are the users saying...


Data should be integrated
across the enterprise


Summary data has a real
value to the organization


Historical data holds the
key to understanding data
over time


What
-
if capabilities are
required

53

What is Data Warehousing?



A
process

of
transforming
data

into
information
and
making it available to
users in a timely
enough manner to
make a difference


[Forrester Research, April
1996]

Data

Information

54

Very Large Data Bases


Terabytes
--

10^12 bytes:



Petabytes
--

10^15 bytes:



Exabytes
--

10^18 bytes:



Zettabytes
--

10^21
bytes:



Zottabytes
--

10^24
bytes:


Walmart
--

24 Terabytes


Geographic Information
Systems

National Medical Records


Weather images


Intelligence Agency
Videos

55

Data Warehousing
--


It is a process


Technique for assembling and
managing data from various
sources for the purpose of
answering business
questions. Thus making
decisions that were not
previous possible


A decision support database
maintained separately from
the organization’s operational
database

56

Data Warehouse



A data warehouse is a


subject
-
oriented


integrated


time
-
varying


non
-
volatile


collection of data that is used primarily in
organizational decision making.




--

Bill Inmon, Building the Data Warehouse 1996

Data Warehousing Concepts


Decision support is key for companies wanting
to turn their organizational data into an
information asset


Traditional database is transaction
-
oriented
while data warehouse is data
-
retrieval
optimized for decision
-
support


Data Warehouse

"A subject
-
oriented, integrated, time
-
variant,
and non
-
volatile collection of data in support of
management's decision
-
making process"


OLAP (on
-
line analytical processing), Decision
Support Systems (DSS), Executive Information
Systems (EIS), and data mining applications

57

What does data warehouse do?




integrate diverse information from
various systems which enable users to
quickly produce powerful ad
-
hoc queries
and perform complex analysis



create an infrastructure for reusing the
data in numerous ways



create an open systems environment to
make useful information easily accessible
to authorized users



help managers make informed decisions

58

Benefits of Data Warehousing


Potential high returns on investment


Competitive advantage


Increased productivity of corporate
decision
-
makers

59

Comparison of OLTP and Data Warehousing

OLTP systems



Data warehousing
systems

Holds current data



Holds historic data

Stores detailed data



Stores detailed, lightly, and






summarized data

Data is dynamic




Data is largely static

Repetitive processing



Ad hoc, unstructured, and
heuristic





processing

High level of transaction throughput

Medium to low transaction
throughput

Predictable pattern of usage


Unpredictable pattern of usage

Transaction driven



Analysis driven

Application oriented



Subject oriented

Supports day
-
to
-
day decisions


Supports strategic decisions

Serves large number of



Serves relatively lower number

clerical / operational users


of managerial users

60

Data Warehouse Architecture


Operational Data


Load Manager


Warehouse Manager


Query Manager


Detailed Data


Lightly and Highly Summarized Data


Archive / Backup Data


Meta
-
Data


End
-
user Access Tools

61

End
-
user Access Tools


Reporting and query tools


Application development tools


Executive Information System (EIS)
tools


Online Analytical Processing (OLAP)
tools


Data mining tools

62

Data Warehousing Tools and Technologies


Extraction, Cleansing, and Transformation
Tools


Data Warehouse DBMS


Load performance


Load processing


Data quality management


Query performance


Terabyte scalability


Networked data warehouse


Warehouse administration


Integrated dimensional tools


Advanced query functionality

63

Data Marts


A subset of data warehouse that
supports the requirements of a
particular department or business
function

64

Online Analytical Processing (OLAP)


OLAP


The dynamic synthesis, analysis, and
consolidation of large volume of multi
-
dimensional data


Multi
-
dimensional OLAP


Cubes of data

65

Time
City
Product
type
Problems of Data Warehousing


Underestimation of resources for
data loading


Hidden problem with source systems


Required data not captured


Increased end
-
user demands


Data homogenization


High demand for resources


Data ownership


High maintenance


Long duration projects


Complexity of integration

66

Codd's Rules for OLAP


Multi
-
dimensional conceptual view


Transparency


Accessibility


Consistent reporting performance


Client
-
server architecture


Generic dimensionality


Dynamic sparse matrix handling


Multi
-
user support


Unrestricted cross
-
dimensional operations


Intuitive data manipulation


Flexible reporting


Unlimited dimensions and aggregation levels

67

OLAP Tools


Multi
-
dimensional OLAP (MOLAP)


Multi
-
dimensional DBMS (MDDBMS)


Relational OLAP (ROLAP)


Creation of multiple multi
-
dimensional
views of the two
-
dimensional relations


Managed Query Environment (MQE)


Deliver selected data directly from the
DBMS to the desktop in the form of a
data cube, where it is stored, analyzed,
and manipulated locally

68

Data Mining


Definition


The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large database and using
it to make crucial business decisions


Knowledge discovery


Association rules


Sequential patterns


Classification trees


Goals


Prediction


Identification


Classification


Optimization


69

Data Mining Techniques


Predictive Modeling


Supervised training with two phases


Training phase : building a model using
large sample of historical data called
the training set


Testing phase : trying the model on
new data


Database Segmentation


Link Analysis


Deviation Detection


70

What are Data Mining Tasks?


Classification


Regression


Clustering



Summarization


Dependency modeling


Change and Deviation Detection

71

What are Data Mining Discoveries?


New Purchase Trends


Plan Investment Strategies


Detect Unauthorized Expenditure


Fraudulent Activities


Crime Trends


Smugglers
-
border crossing

72

73

Data Warehouse Architecture

Data Warehouse

Engine

Optimized Loader

Extraction

Cleansing

Analyze

Query

Metadata Repository

Relational

Databases

Legacy

Data

Purchased

Data

ERP

Systems

74

Data Warehouse for Decision
Support & OLAP


Putting Information technology to help the
knowledge worker make faster and better
decisions


Which of my customers are most likely to go
to the competition?


What product promotions have the biggest
impact on revenue?


How did the share price of software
companies correlate with profits over last 10
years?

75

Decision Support


Used to manage and control business


Data is historical or point
-
in
-
time


Optimized for inquiry rather than update


Use of the system is loosely defined and
can be ad
-
hoc


Used by managers and end
-
users to
understand the business and make
judgements

76

Data Mining works with Warehouse
Data


Data Warehousing
provides the Enterprise
with a memory



Data Mining provides
the Enterprise with
intelligence


77

We want to know ...


Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?


Which types of transactions are likely to be fraudulent
given the demographics and transactional history of a
particular customer?


If I raise the price of my product by Rs. 2, what is the
effect on my ROI?


If I offer only 2,500 airline miles as an incentive to
purchase rather than 5,000, how many lost responses will
result?


If I emphasize ease
-
of
-
use of the product as opposed to its
technical capabilities, what will be the net effect on my
revenues?


Which of my customers are likely to be the most loyal?


Data Mining helps extract such information

78

Application Areas

Industry

Application

Finance

Credit Card Analysis

Insurance

Claims, Fraud Analysis

Telecommunication

Call record analysis

Transport

Logistics management

Consumer goods

promotion analysis

Data Service providers

Value added data

Utilities

Power usage analysis

79

Data Mining in Use


The US Government uses Data Mining to
track fraud


A Supermarket becomes an information
broker


Basketball teams use it to track game
strategy


Cross Selling


Warranty Claims Routing


Holding on to Good Customers


Weeding out Bad Customers

80

What makes data mining possible?


Advances in the following areas are
making data mining deployable:


data warehousing


better and more data (i.e., operational,
behavioral, and demographic)


the emergence of easily deployed data
mining tools and


the advent of new data mining
techniques.


--

Gartner Group

81

Why Separate Data Warehouse?


Performance


Op dbs designed & tuned for known txs & workloads.


Complex OLAP queries would degrade perf. for op txs.


Special data organization, access & implementation
methods needed for multidimensional views & queries.


Function


Missing data: Decision support requires historical data, which
op dbs do not typically maintain.


Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.


Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be
reconciled.

82

What are Operational Systems?


They are OLTP systems


Run mission critical
applications


Need to work with
stringent performance
requirements for
routine tasks


Used to run a
business!




83

RDBMS used for OLTP


Database Systems have been used
traditionally for OLTP


clerical data processing tasks


detailed, up to date data


structured repetitive tasks


read/update a few records


isolation, recovery and integrity are
critical

84

Operational Systems



Run the business in real time


Based on up
-
to
-
the
-
second data


Optimized to handle large
numbers of simple read/write
transactions


Optimized for fast response to
predefined transactions


Used by people who deal with
customers, products
--

clerks,
salespeople etc.


They are increasingly used by
customers

85

Examples of Operational Data

Data

Industry

Usage

Technology

Volumes

Customer

File

All

Track

Customer

Details

Legacy application, flat

files, main frames

Small
-
medium

Account

Balance

Finance

Control

account

activities

Legacy applications,

hierarchical databases,

mainframe

Large

Point
-
of
-

Sale data

Retail

Generate

bills, manage

stock

ERP, Client/Server,

relational databases

Very Large

Call

Record

Telecomm
-

unications

Billing

Legacy application,

hierarchical database,

mainframe

Very Large

Production

Record

Manufact
-

uring

Control

Production

ERP,

relational databases,

AS/400

Medium

86

Application
-
Orientation vs.
Subject
-
Orientation

Application
-
Orientation

Operational
Database

Loans

Credit

Card

Trust

Savings

Subject
-
Orientation

Data

Warehouse

Customer

Vendor

Product

Activity

87

OLTP vs. Data Warehouse


OLTP systems are tuned for known
transactions and workloads while
workload is not known a priori in a data
warehouse


Special data organization, access methods
and implementation methods are needed
to support data warehouse queries
(typically multidimensional queries)


e.g
., average amount spent on phone calls
between 9AM
-
5PM in Pune during the month
of December


88

OLTP vs Data Warehouse


OLTP


Application
Oriented


Used to run
business


Detailed data


Current up to date


Isolated Data


Repetitive access


Clerical User


Warehouse (DSS)


Subject Oriented


Used to analyze
business


Summarized and
refined


Snapshot data


Integrated Data


Ad
-
hoc access


Knowledge User
(Manager)

89

OLTP vs Data Warehouse


OLTP


Performance Sensitive


Few Records accessed at
a time (tens)



Read/Update Access



No data redundancy


Database Size 100MB
-
100 GB


Data Warehouse


Performance relaxed


Large volumes accessed
at a time(millions)


Mostly Read (Batch
Update)


Redundancy present


Database Size
100 GB
-

few terabytes

90

OLTP vs Data Warehouse


OLTP


Transaction
throughput is the
performance metric


Thousands of users


Managed in
entirety



Data Warehouse


Query throughput
is the performance
metric


Hundreds of users


Managed by
subsets

91

To summarize ...


OLTP Systems are

used to
“run”

a
business






The Data
Warehouse helps
to
“optimize”

the
business

92

Why Now?


Data is being produced


ERP provides clean data


The computing power is available


The computing power is affordable


The competitive pressures are
strong


Commercial products are available

93

Myths surrounding OLAP Servers
and Data Marts


Data marts and OLAP servers are departmental
solutions supporting a handful of users


Million dollar massively parallel hardware is
needed to deliver fast time for complex queries


OLAP servers require massive and unwieldy
indices


Complex OLAP queries clog the network with
data


Data warehouses must be at least 100 GB to be
effective


Source
--

Arbor Software Home Page

II. On
-
Line Analytical Processing (OLAP)

Making Decision
Support Possible

95

Typical OLAP Queries


Write a multi
-
table join to compare sales for each
product line YTD this year vs. last year.


Repeat the above process to find the top 5
product contributors to margin.


Repeat the above process to find the sales of a
product line to new vs. existing customers.


Repeat the above process to find the customers
that have had negative sales growth.

96

* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

What Is OLAP?


Online Analytical Processing
-

coined by

EF Codd in 1994 paper contracted by

Arbor Software
*


Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence, Executive
Information System


OLAP = Multidimensional Database


MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)


ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)

97

The OLAP Market


Rapid growth in the enterprise market


1995: $700 Million


1997: $2.1 Billion


Significant consolidation activity among
major DBMS vendors


10/94: Sybase acquires ExpressWay


7/95: Oracle acquires Express


11/95: Informix acquires Metacube


1/97: Arbor partners up with IBM


10/96: Microsoft acquires Panorama


Result: OLAP shifted from small vertical
niche to mainstream DBMS category

98

Strengths of OLAP


It is a powerful visualization paradigm


It provides fast, interactive response
times


It is good for analyzing time series


It can be useful to find some clusters and
outliers


Many vendors offer OLAP tools

99

Nigel Pendse, Richard Creath
-

The OLAP Report


OLAP Is FASMI


Fast


Analysis


Shared


Multidimensional


Information

100

Month

1

2

3

4

7

6

5

Product

Toothpaste

Juice

Cola

Milk

Cream

Soap

W

S

N

Dimensions:
Product, Region, Time

Hierarchical summarization paths


Product

Region

Time

Industry Country Year



Category Region Quarter




Product City Month Week







Office Day

Multi
-
dimensional Data


“Hey…I sold $100M worth of goods”

101

A Visual Operation: Pivot (Rotate)

10

47

30

12

Juice

Cola

Milk

Cream

3/1 3/2 3/3 3/4

Date

Product

102

“Slicing and Dicing”

Product

Sales Channel

Retail

Direct

Special

Household

Telecomm

Video

Audio

India

Far East

Europe

The Telecomm Slice

103

Roll
-
up and Drill Down



Sales Channel


Region


Country


State


Location Address


Sales
Representative

Higher Level of

Aggregation

Low
-
level

Details

Results of Data Mining Include:


Forecasting what may happen in the
future


Classifying people or things into
groups by recognizing patterns


Clustering people or things into
groups based on their attributes


Associating what events are likely to
occur together


Sequencing what events are likely to
lead to later events

Data mining is
not


Brute
-
force crunching of
bulk data


“Blind” application of
algorithms


Going to find relationships
where none exist


Presenting data in different
ways


A database intensive task


A difficult to understand
technology requiring an
advanced degree in
computer science

Data Mining versus OLAP


OLAP
-

On
-
line
Analytical
Processing


Provides you
with a very
good view of
what is
happening,
but can not
predict what
will happen in
the future or
why it is
happening

Data Mining Versus Statistical
Analysis


Data Mining


Originally developed to act
as expert systems to solve
problems


Less interested in the
mechanics of the
technique


If it makes sense then
let’s use it


Does not require
assumptions to be made
about data


Can find patterns in very
large amounts of data



Requires understanding
of data and business
problem


Data Analysis


Tests for statistical
correctness of models


Are statistical
assumptions of models
correct?


Eg Is the R
-
Square
good?


Hypothesis testing


Is the relationship
significant?


Use a t
-
test to validate
significance


Tends to rely on sampling


Techniques are not
optimised for large
amounts of data


Requires strong statistical
skills

Examples of What People are
Doing with Data Mining:


Fraud/Non
-
Compliance
Anomaly detection


Isolate the factors that
lead to fraud, waste and
abuse


Target auditing and
investigative efforts
more effectively


Credit/Risk Scoring


Intrusion detection


Parts failure prediction


Recruiting/Attracting
customers


Maximizing
profitability (cross
selling, identifying
profitable customers)


Service Delivery and
Customer Retention


Build profiles of
customers likely
to use which
services


Web Mining

What data mining has done for...

Scheduled its workforce

to provide faster, more accurate
answers to questions.

The US Internal Revenue Service

needed to improve customer
service and...

What data mining has done for...

analyzed suspects’ cell phone
usage to focus investigations.

The US Drug Enforcement
Agency needed to be more
effective in their drug “busts”
and

What data mining has done for...

Reduced direct mail costs by 30%
while garnering 95% of the
campaign’s revenue.

HSBC need to cross
-
sell more

effectively by identifying profiles

that would be interested in higher

yielding investments and
...

Suggestion:Predicting Washington


C
-
Span has lunched a digital
archieve of 500,000 hours of audio
debates.


Text Mining or Audio Mining of these
talks to reveal cwetrain questions
such as….

Example Application: Sports

IBM Advanced Scout analyzes

NBA game statistics


Shots blocked


Assists


Fouls



Google: “IBM Advanced Scout”

Advanced Scout


Example pattern: An analysis of the

data from a game played between

the New York Knicks and the Charlotte

Hornets revealed that “
When Glenn Rice
played the shooting guard position, he
shot 5/6 (83%) on jump shots."



Pattern is interesting:

The average shooting percentage for the
Charlotte Hornets during that game was
54%.

Data Mining: Types of Data


Relational data and transactional data


Spatial and temporal data, spatio
-
temporal observations


Time
-
series data


Text


Images, video


Mixtures of data


Sequence data



Features from processing other data
sources

Data Mining Techniques


Supervised learning


Classification and regression


Unsupervised learning


Clustering


Dependency modeling


Associations, summarization, causality


Outlier and deviation detection


Trend analysis and change detection

Different Types of Classifiers


Linear discriminant analysis (LDA)


Quadratic discriminant analysis
(QDA)


Density estimation methods


Nearest neighbor methods


Logistic regression


Neural networks


Fuzzy set theory


Decision Trees

Test Sample Estimate


Divide D into D
1

and D
2


Use D
1

to construct the classifier d


Then use resubstitution estimate
R(d,D
2
) to calculate the estimated
misclassification error of d


Unbiased and efficient, but removes
D
2

from training dataset D

V
-
fold Cross Validation

Procedure:


Construct classifier d from D


Partition D into V datasets D
1
, …, D
V


Construct classifier d
i

using D
\

D
i


Calculate the estimated misclassification
error R(d
i
,D
i
) of d
i

using test sample D
i

Final misclassification estimate:


Weighted combination of individual
misclassification errors:

R(d,D) = 1/V Σ R(d
i
,D
i
)

Cross
-
Validation: Example

d

d
1

d
2

d
3

Cross
-
Validation


Misclassification estimate obtained
through cross
-
validation is usually
nearly unbiased


Costly computation (we need to
compute d, and d
1
, …, d
V
);
computation of d
i

is nearly as
expensive as computation of d


Preferred method to estimate quality
of learning algorithms in the
machine learning literature

Decision Tree Construction


Three algorithmic components:


Split selection (CART, C4.5, QUEST,
CHAID, CRUISE, …)


Pruning (direct stopping rule, test
dataset pruning, cost
-
complexity
pruning, statistical tests, bootstrapping)


Data access (CLOUDS, SLIQ, SPRINT,
RainForest, BOAT, UnPivot operator)

Goodness of a Split

Consider node t with impurity phi(t)

The
reduction in impurity

through
splitting predicate s (t splits into
children nodes t
L

with impurity
phi(t
L
) and t
R

with impurity phi(t
R
))
is:


Δ
phi
(s,t) = phi(t)


p
L

phi(t
L
)


p
R

phi(t
R
)

Pruning Methods


Test dataset pruning


Direct stopping rule


Cost
-
complexity pruning


MDL pruning


Pruning by randomization testing


Stopping Policies

A stopping policy indicates when further
growth of the tree at a node t is
counterproductive.


All records are of the same class


The attribute values of all records are
identical


All records have missing values


At most one class has a number of
records larger than a user
-
specified
number


All records go to the same child node if t
is split (only possible with some split
selection methods)

Test Dataset Pruning


Use an independent test sample D’
to estimate the misclassification cost
using the resubstitution estimate
R(T,D’) at each node


Select the subtree T’ of T with the
smallest expected cost

Missing Values


What is the problem?


During computation of the splitting
predicate, we can selectively ignore
records with missing values (note that
this has some problems)


But if a record r misses the value of the
variable in the splitting attribute, r can
not participate further in tree
construction

Algorithms for missing values address
this problem.

Mean and Mode Imputation

Assume record r has missing value
r.X, and splitting variable is X.


Simplest algorithm:


If X is numerical (categorical), impute
the overall mean (mode)


Improved algorithm:


If X is numerical (categorical), impute
the mean(X|t.C) (the mode(X|t.C))

Decision Trees: Summary


Many application of decision trees


There are many algorithms available for:


Split selection


Pruning


Handling Missing Values


Data Access


Decision tree construction still active
research area (after 20+ years!)


Challenges: Performance, scalability,
evolving datasets, new applications

Supervised vs. Unsupervised Learning

Supervised


y=F(x): true function


D: labeled training set


D: {x
i
,F(x
i
)}


Learn:

G(x): model trained to
predict labels D


Goal:

E[(F(x)
-
G(x))
2
] ≈ 0


Well defined criteria:
Accuracy, RMSE, ...

Unsupervised


Generator: true model


D: unlabeled data
sample


D: {x
i
}


Learn

??????????


Goal:

??????????


Well defined criteria:

??????????

Clustering: Unsupervised Learning


Given:


Data Set D (training set)


Similarity/distance metric/information


Find:


Partitioning of data


Groups of similar/close items

Similarity?


Groups of similar customers


Similar demographics


Similar buying behavior


Similar health


Similar products


Similar cost


Similar function


Similar store





Similarity usually is domain/problem
specific

Clustering: Informal Problem
Definition

Input:


A data set of
N
records each given as a
d
-
dimensional data feature vector.

Output:


Determine a natural, useful “partitioning”
of the data set into a number of (k)
clusters and noise such that we have:


High similarity of records within each cluster
(intra
-
cluster similarity)


Low similarity of records between clusters
(inter
-
cluster similarity)


Types of Clustering


Hard Clustering:


Each object is in one and only one
cluster


Soft Clustering:


Each object has a probability of being
in each cluster


Clustering Algorithms


Partitioning
-
based clustering


K
-
means clustering


K
-
medoids clustering


EM (expectation maximization) clustering


Hierarchical clustering


Divisive clustering (top down)


Agglomerative clustering (bottom up)


Density
-
Based Methods


Regions of dense points separated by sparser
regions of relatively low density

K
-
Means Clustering Algorithm

Initialize k cluster centers

Do


Assignment step
: Assign each data point to its closest
cluster center



Re
-
estimation step
: Re
-
compute cluster centers

While

(there are still changes in the cluster centers)



Visualization at:


http://www.delft
-
cluster.nl/textminer/theory/kmeans/kmeans.html



Issues

Why is K
-
Means working:


How does it find the cluster centers?


Does it find an optimal clustering


What are good starting points for the algorithm?


What is the right number of cluster centers?


How do we know it will terminate?


Agglomerative Clustering

Algorithm:


Put each item in its own cluster (all singletons)


Find all pairwise distances between clusters


Merge the two
closest

clusters


Repeat until everything is in one cluster


Observations:


Results in a hierarchical clustering


Yields a clustering for each possible number of
clusters


Greedy clustering: Result is not “optimal” for any
cluster size

Density
-
Based Clustering


A cluster is defined as a connected dense
component.


Density is defined in terms of number of
neighbors of a point.


We can find clusters of arbitrary shape


Market Basket Analysis


Consider shopping cart filled with
several items


Market basket analysis tries to
answer the following questions:


Who makes purchases?


What do customers buy together?


In what order do customers purchase
items?


Market Basket Analysis

Given:


A database of
customer
transactions


Each transaction is
a set of items



Example:

Transaction with
TID 111 contains
items {Pen, Ink,
Milk, Juice}

TID

CID

Date

Item

Qty

111

201

5/1/99

Pen

2

111

201

5/1/99

Ink

1

111

201

5/1/99

Milk

3

111

201

5/1/99

Juice

6

112

105

6/3/99

Pen

1

112

105

6/3/99

Ink

1

112

105

6/3/99

Milk

1

113

106

6/5/99

Pen

1

113

106

6/5/99

Milk

1

114

201

7/1/99

Pen

2

114

201

7/1/99

Ink

2

114

201

7/1/99

Juice

4


Market Basket Analysis (Contd.)


Coocurrences


80% of all customers purchase items X,
Y and Z together.


Association rules


60% of all customers who purchase X
and Y also buy Z.


Sequential patterns


60% of customers who first buy X also
purchase Y within three weeks.

Confidence and Support

We prune the set of all possible
association rules using two
interestingness measures:


Confidence

of a rule:


X


Y has confidence c if P(Y|X) = c


Support

of a rule:


X


Y has support s if P(XY) = s

We can also define


Support

of an itemset (a
coocurrence) XY:


XY has support s if P(XY) = s

Market Basket Analysis:
Applications


Sample Applications


Direct marketing


Fraud detection for medical insurance


Floor/shelf planning


Web site layout


Cross
-
selling

Applications of Frequent Itemsets


Market Basket Analysis


Association Rules


Classification (especially: text, rare
classes)


Seeds for construction of Bayesian
Networks


Web log analysis


Collaborative filtering

Association Rule Algorithms


More abstract problem redux


Breadth
-
first search


Depth
-
first search

Problem Redux

Abstract:


A set of items {1,2,…,k}


A dabase of transactions
(itemsets) D={T1, T2, …,
Tn},

Tj subset {1,2,…,k}


GOAL:

Find all itemsets that appear in
at least x transactions


(“appear in” == “are subsets
of”)

I subset T: T
supports

I


For an itemset I, the number of
transactions it appears in is
called the
support

of I.

x is called the
minimum support
.

Concrete:


I = {milk, bread, cheese,
…}


D = {
{milk,bread,cheese},
{bread,cheese,juice}, …}


GOAL:

Find all itemsets that appear
in at least 1000
transactions


{milk,bread,cheese}
supports {milk,bread}


Problem Redux (Contd.)

Definitions:


An itemset is
frequent

if it
is a subset of at least x
transactions. (FI.)


An itemset is
maximally
frequent

if it is frequent
and it does not have a
frequent superset. (MFI.)


GOAL: Given x, find all
frequent (maximally
frequent) itemsets (to be
stored in the
FI (MFI)
).


Obvious relationship:

MFI subset FI

Example:

D={
{1,2,3}, {1,2,3},
{1,2,3}, {1,2,4}

}

Minimum support x = 3


{1,2}

is frequent

{1,2,3}

is maximal frequent

Support(
{1,2}
) = 4


All maximal frequent
itemsets:
{1,2,3}

Applications


Spatial association rules


Web mining


Market basket analysis


User/customer profiling


ExtenSuggestionssions: Sequential
Patterns


In the “Market Itemset Analysis”
replace Milk, Pen, etc with names of
medications and use the idea in
Hospital Data mining new proposal


The idea of swaem intelligence


add
to it the extra analysis pf the
inducyion rules in this set of slides.



Kraft Foods
:
Direct Marketing


Company maintains a large database of purchases by customers.


Data mining

1. Analysts identified associations among groups of products
bought by particular segments of customers.

2. Sent out 3 sets of coupons to various households.



Better response rates: 50 % increase in sales for one its
products



Continue to use of this approach



Health Insurance Commission of Australia
:
Insurance Fraud


Commission maintains a database of insurance claims,including
laboratory tests ordered during the diagnosis of patients.


Data mining

1. Identified the practice of "up coding" to reflect more
expensive tests than are necessary.

2. Now monitors orders for lab tests.


Commission expects to save US$1,000,000 / year by
eliminating the practice of "up coding”.




HNC Software
: Credit Card Fraud



Payment Fraud


Large issuers of cards may lose


$10 million / year due to fraud


Difficult to identify the few transactions among thousands which
reflect potential fraud


Falcon software


Mines data through neural networks


Introduced in September 1992



Models each cardholder's requested transaction against the customer's
past spending history.


processes several hundred requests per second


compares current transaction with customer's history


identifies the transactions most likely to be frauds


enables bank to stop high
-
risk transactions before they are
authorized


Used by many retail banks: currently monitors



160 million card accounts for fraud


New Account Fraud



Fraudulent applications for credit cards are growing at 50 %
per year


Falcon Sentry software


Mines data through neural networks and a rule base


Introduced in September 1992


Checks information on applications against data from
credit bureaus


Allows card issuers to simultaneously:


increase the proportion of applications received


reduce the proportion of fraudulent applications
authorized


New Account Fraud


Quality Control





IBM Microelectronics
:
Quality Control



Analyzed manufacturing data on Dynamic Random Access Memory
(DRAM) chips.


Data mining

1. Built predictive models of



manufacturing yield (% non
-
defective)


effects of production parameters on chip performance.

2. Discovered critical factors behind



production yield &



product performance.

3. Created a new design for the chip


increased yield saved millions of dollars in direct
manufacturing costs


enhanced product performance by substantially lowering the
memory cycle time


B & L Stores


Belk and Leggett Stores =


one of largest retail chains


280 stores in southeast U.S.


data warehouse contains 100s of gigabytes (billion
characters) of data



data mining to:


increase sales


reduce costs


Selected DSS Agent from MicroStrategy, Inc.


analyize merchandizing (patterns of sales)


manage inventory

Retail Sales




DSS Agent



uses intelligent agents data mining


provides multiple functions


recognizes sales patterns among stores


discovers sales patterns by


time of day


day of year


category of product


etc.



swiftly identifies trends & shifts in customer tastes



performs Market Basket Analysis (MBA)


analyzes Point
-
of
-
Sale or
-
Service (POS) data


identifies relationships among products and/or services purchased


E.g. A customer who buys Brand X slacks has a 35% chance of
buying Brand Y shirts.


Agent tool is also used by other Fortune 1000 firms


average ROI > 300 %


average payback in 1 ~ 2 years

Market Basket Analysis



Case Based Reasoning

(CBR)

case A

targ
e
t
case B
General scheme for a case based reasoning (CBR) model. The target case is


matched against similar precedents in the historical database, such as cases A and B.




Case Based Reasoning (CBR)




Learning through the accumulation of experience



Key issues


Indexing:

storing cases for quick, effective access of precedents


Retrieval:

accessing the appropriate precedent cases



Advantages


Explicit knowledge form recognizable to humans


No need to re
-
code knowledge for computer processing



Limitations


Retrieving precedents based on superficial features

E.g. Matching Indonesia with U.S. because both have similar population size


Traditional approach ignores the issue of generalizing knowledge


Genetic Algorithm


Generation of candidate solutions using the procedures of biological
evolution.


Procedure



0.
Initialize
.


Create a population of potential solutions ("organisms").

1.
Evaluate
.


Determine the level of "fitness" for each solution.

2.
Cull
.


Discard the poor solutions.

3.
Breed
.


a. Select 2 "fit" solutions to serve as parents.


b. From the 2 parents, generate offspring.


*
Crossover
:




Cut the parents at random and switch the 2 halves.


*
Mutation
:


Randomly change the value in a parent solution.

4.
Repeat
.


Go back to Step 1 above.



Genetic Algorithm (Cont.)


Advantages


Applicable to a wide range of problem domains.


Robustness:

can obtain solutions even when the performance


function is highly irregular or input data are noisy.


Implicit parallelism:

can search in many directions concurrently.



Limitations


Slow, like neural networks.

But: computation can be distributed


over multiple processors


(unlike neural networks)




Source
: www.pathology.washington.edu


Multistrategy Learning


Every technique has advantages & limitations



Multistrategy approach


Take advantage of the strengths of diverse techniques


Circumvent the limitations of each methodology

Types of Models


Prediction Models for
Predicting and Classifying


Regression algorithms
(predict numeric
outcome):
neural
networks
, rule induction,
CART (OLS regression,
GLM)


Classification algorithm
predict symbolic
outcome): CHAID,
C5.0

(discriminant analysis,
logistic regression)


Descriptive Models for
Grouping and Finding
Associations


Clustering/Grouping
algorithms: K
-
means,
Kohonen


Association algorithms:
apriori
, GRI




Neural Networks


Description


Difficult interpretation


Tends to ‘overfit’ the data


Extensive amount of training time


A lot of data preparation


Works with all data types

Rule Induction

Description


Intuitive output


Handles all forms of numeric data,
as well as non
-
numeric (symbolic)
data


C5 Algorithm

a special case of rule
induction


Target variable must be symbolic

Apriori

Description


Seeks
association rules
in dataset


‘Market basket’ analysis


Sequence discovery

Data Mining Is


The automated process of finding
relationships and patterns in stored
data



It is different from the use of SQL
queries and other business
intelligence tools




Data Mining Is


Motivated by business need, large
amounts of available data, and
humans’ limited cognitive processing
abilities


Enabled by data warehousing,
parallel processing, and data mining
algorithms

Common Types of Information
from Data Mining


Associations
--

identifies occurrences
that are linked to a single event


Sequences
--

identifies events that
are linked over time


Classification
--

recognizes patterns
that describe the group to which an
item belongs

Common Types of Information
from Data Mining


Clustering
--

discovers different
groupings within the data


Forecasting
--

estimates future
values


Commonly Used Data Mining
Techniques


Artificial neural networks


Decision trees


Genetic algorithms


Nearest neighbor method


Rule induction

The Current State of Data Mining
Tools


Many of the vendors are small companies


IBM and SAS have been in the market for
some time, and more “biggies” are
moving into this market


BI tools and RDMS products are
increasingly including basic data mining
capabilities


Packaged data mining applications are
becoming common


The Data Mining Process


Requires personnel with domain,
data warehousing, and data mining
expertise


Requires data selection, data
extraction, data cleansing, and data
transformation


Most data mining tools work with
highly granular flat files


Is an iterative and interactive
process

Why Data Mining


Credit ratings/targeted marketing
:


Given a database of 100,000 names, which persons are
the least likely to default on their credit cards?


Identify likely responders to sales promotions


Fraud detection


Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a
particular customer?



Customer relationship management
:


Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor?
:


Data Mining helps extract such
information

Applications


Banking: loan/credit card approval


predict good customers based on old customers


Customer relationship management:


identify those who are likely to leave for a competitor.


Targeted marketing:


identify likely responders to promotions


Fraud detection: telecommunications,
financial transactions


from an online stream of event identify fraudulent
events


Manufacturing and production:


automatically adjust knobs when process parameter
changes



Applications (continued)


Medicine: disease outcome, effectiveness
of treatments


analyze patient disease history: find
relationship between diseases


Molecular/Pharmaceutical: identify new
drugs


Scientific data analysis:


identify new galaxies by searching for sub
clusters


Web site/store design and promotion:


find affinity of visitor to pages and modify
layout

The KDD process


Problem fomulation


Data collection


subset data: sampling might hurt if highly skewed data


feature selection: principal component analysis,
heuristic search


Pre
-
processing: cleaning


name/address cleaning, different meanings (annual,
yearly), duplicate removal, supplying missing values


Transformation:


map complex objects e.g. time series data to features
e.g. frequency


Choosing mining task and mining method:


Result evaluation and Visualization:


Knowledge discovery is an iterative process

Relationship with other fields


Overlaps with machine learning, statistics,
artificial intelligence, databases,
visualization but more stress on


scalability of number of features and instances


stress on algorithms and architectures
whereas foundations of methods and
formulations provided by statistics and
machine learning.


automation for handling large, heterogeneous
data


Some basic operations


Predictive:


Regression


Classification


Collaborative Filtering


Descriptive:


Clustering / similarity matching


Association rules and variants


Deviation detection

Classification


Given old data about customers and
payments, predict new applicant’s
loan eligibility.

Age

Salary

Profession

Location

Customer type

Previous customers

Classifier

Decision rules

Salary > 5 L

Prof. = Exec

New applicant’s data

Good/

bad

Classification methods


Goal:
Predict class Ci = f(x1, x2, ..
Xn)


Regression: (linear or any other
polynomial)


a*x1 + b*x2 + c = Ci.


Nearest neighour


Decision tree classifier: divide decision
space into piecewise constant regions.


Probabilistic/generative models


Neural networks: partition by non
-
linear boundaries


Define proximity between instances,
find neighbors of new instance and
assign majority class


Case based reasoning: when
attributes are more complicated than
real
-
valued.




Nearest neighbor



Cons



Slow during application.



No feature selection.



Notion of proximity vague





Pros

+

Fast training



Clustering


Unsupervised learning when old data with
class labels not available e.g. when
introducing a new product.


Group/cluster existing customers based on
time series of payment history such that
similar customers in same cluster.


Key requirement: Need a good measure of
similarity between instances.


Identify micro
-
markets and develop
policies for each


Applications


Customer segmentation e.g. for targeted
marketing


Group/cluster existing customers based on
time series of payment history such that
similar customers in same cluster.


Identify micro
-
markets and develop policies
for each


Collaborative filtering:


group based on common items purchased


Text clustering


Compression

Distance functions


Numeric data: euclidean, manhattan
distances


Categorical data: 0/1 to indicate
presence/absence followed by


Hamming distance (# dissimilarity)


Jaccard coefficients: #similarity in 1s/(# of
1s)


data dependent measures: similarity of A and
B depends on co
-
occurance with C.


Combined numeric and categorical data:


weighted normalized distance:

Clustering methods


Hierarchical

clustering


agglomerative Vs divisive


single link Vs complete link


Partitional

clustering


distance
-
based: K
-
means


model
-
based: EM


density
-
based:


Partitional methods: K
-
means


Criteria: minimize sum of square of
distance



Between each point and centroid of the
cluster.


Between each pair of points in the
cluster


Algorithm:


Select initial partition with K clusters:
random, first K, K separated points


Repeat until stabilization:


Assign each point to closest cluster
center


Generate new cluster centers


Adjust clusters by merging/splitting

Collaborative Filtering


Given database of user preferences,
predict preference of new user


Example: predict what new movies you will
like based on


your past preferences


others with similar past preferences


their preferences for the new movies


Example: predict what books/CDs a person
may want to buy



(and suggest it, or give discounts to
tempt customer)

Association rules


Given set T of groups of items


Example: set of item sets
purchased


Goal: find all rules on itemsets
of the form a
--
>b such that



support

of a and b > user
threshold s


conditional probability (
confidence
)
of b given a > user threshold c


Example: Milk
--
> bread


Purchase of product A
--
>
service B

Milk, cereal

Tea, milk

Tea, rice, bread

cereal

T

Prevalent


Interesting


Analysts already
know about
prevalent rules


Interesting rules
are those that
deviate

from prior
expectation


Mining’s payoff is
in finding
surprising

phenomena

1995

1998

Milk and

cereal sell

together!

Zzzz...

Milk and

cereal sell

together!

Applications of fast itemset
counting

Find correlated events:


Applications in medicine: find
redundant tests


Cross selling in retail, banking


Improve predictive capability of
classifiers that assume attribute
independence



New similarity measures of
categorical attributes [
Mannila et al,
KDD 98
]

Application Areas

Industry

Application

Finance

Credit Card Analysis

Insurance

Claims, Fraud Analysis

Telecommunication

Call record analysis

Transport

Logistics management

Consumer goods

promotion analysis

Data Service providers

Value added data

Utilities

Power usage analysis

Usage scenarios


Data warehouse mining:


assimilate data from operational sources


mine static data


Mining log data


Continuous mining: example in process
control


Stages in mining:



data selection


pre
-
processing:
cleaning


transformation


mining


result evaluation


visualization

Mining market


Around 20 to 30 mining tool vendors


Major tool players:


Clementine,


IBM’s Intelligent Miner,


SGI’s MineSet,


SAS’s Enterprise Miner.


All pretty much the same set of tools


Many embedded products:


fraud detection:


electronic commerce applications,


health care,


customer relationship management: Epiphany

Vertical integration:


Mining on the web


Web log analysis for site design:



what are popular pages,


what links are hard to find.


Electronic stores sales enhancements:


recommendations, advertisement:


Collaborative filtering
:
Net perception,
Wisewire


Inventory control: what was a shopper
looking for and could not find..

State of art in mining
OLAP

integration


Decision trees [
Information discovery,
Cognos]


find factors influencing high profits


Clustering

[Pilot software]