Data Mining Tools
Overview & Tutorial
Ahmed Sameh
Prince Sultan University
Department of Computer Science &
Info Sys
May 2010
(Some slides belong to IBM)
1
2
Introduction Outline
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
Goal:
Provide an overview of data mining.
3
Introduction
Data is growing at a phenomenal
rate
Users expect more sophisticated
information
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
4
Data Mining Definition
Finding hidden information in a
database
Fit data to a model
Similar terms
Exploratory data analysis
Data driven discovery
Deductive learning
5
Data Mining Algorithm
Objective: Fit Data to a Model
Descriptive
Predictive
Preference
–
Technique to choose
the best model
Search
–
Technique to search the
data
“Query”
6
Database Processing vs. Data
Mining Processing
Query
Well defined
SQL
Query
Poorly defined
No precise query
language
Data
–
Operational data
Output
–
Precise
–
Subset of database
Data
–
Not operational data
Output
–
Fuzzy
–
Not a subset of database
7
Query Examples
Database
Data Mining
–
Find all customers who have purchased milk
–
Find all items which are frequently purchased with
milk. (association rules)
–
Find all credit applicants with last name of Smith.
–
Identify customers who have purchased more than
$10,000 in the last month.
–
Find all credit applicants who are poor credit
risks. (classification)
–
Identify customers with similar buying habits.
(Clustering)
8
Related Fields
Statistics
Machine
Learning
Databases
Visualization
Data Mining and
Knowledge Discovery
9
Statistics, Machine Learning
and Data Mining
Statistics:
more theory
-
based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real
-
time learning and robotics
–
areas not part
of data mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery,
including data cleaning, learning, and integration and
visualization of results
Distinctions are fuzzy
Definition
A class of database application that
analyze
data in a database using tools which look
for trends or anomalies.
Data mining was invented by IBM.
Purpose
To look for hidden patterns or previously
unknown relationships among the data in a
group of data that can be used to predict future
behavior.
Ex: Data mining software can help retail
companies find customers with common
interests.
Background Information
Many of the techniques used by today's data
mining tools have been around for many years,
having originated in the artificial intelligence
research of the 1980s and early 1990s.
Data Mining tools are only now being applied
to large
-
scale database systems.
The Need for Data Mining
The amount of raw data stored in corporate
data warehouses is growing rapidly.
There is too much data and complexity that
might be relevant to a specific problem.
Data mining promises to bridge the analytical
gap by giving knowledgeworkers the tools to
navigate this complex analytical space.
The Need for Data Mining, cont’
The need for information has resulted in the
proliferation of data warehouses that integrate
information multiple sources to support
decision making.
Often include data from external sources, such
as customer demographics and household
information.
Definition (Cont.)
Data mining is the exploration and analysis of large quantities
of data in order to discover
valid, novel, potentially useful,
and ultimately understandable patterns in data.
Valid
: The patterns hold in general.
Novel
: We did not know the pattern
beforehand.
Useful
: We can devise actions from the
patterns.
Understandable
: We can interpret and
comprehend the patterns.
Of “laws”, Monsters, and Giants…
Moore’s law: processing “capacity” doubles
every 18 months :
CPU, cache, memory
It’s more aggressive cousin:
Disk storage “capacity” doubles every 9
months
1E+3
1E+4
1E+5
1E+6
1E+7
1988
1991
1994
1997
2000
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
ExaByte
Disk TB Shipped per Year
1998 Disk Trend (Jim Port er)
ht t p://www.diskt rend.com/pdf/port rpkg.pdf.
What do the two
“laws” combined
produce?
A rapidly growing
gap between our
ability to generate
data, and our ability
to make use of it.
What is Data Mining?
Finding interesting structure in
data
Structure:
refers to statistical patterns,
predictive models, hidden relationships
Examples of tasks addressed by Data Mining
Predictive Modeling (classification,
regression)
Segmentation (Data Clustering )
Summarization
Visualization
19
Major Application Areas for
Data Mining Solutions
Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
20
Data Mining
The non
-
trivial extraction of novel, implicit, and
actionable knowledge from large datasets.
Extremely large datasets
Discovery of the non
-
obvious
Useful knowledge that can improve processes
Can not be done manually
Technology to enable data exploration, data analysis,
and data visualization of very large databases at a high
level of abstraction,
without a specific hypothesis in
mind
.
Sophisticated data search capability that uses statistical
algorithms to discover patterns and correlations in data.
21
Data Mining (cont.)
22
Data Mining (cont.)
Data Mining is a step of Knowledge Discovery
in Databases (
KDD
) Process
Data Warehousing
Data Selection
Data Preprocessing
Data Transformation
Data Mining
Interpretation/Evaluation
Data Mining is sometimes referred to as KDD
and DM and KDD tend to be used as
synonyms
23
Data Mining Evaluation
24
Data Mining is Not …
Data warehousing
SQL / Ad Hoc Queries / Reporting
Software Agents
Online Analytical Processing (OLAP)
Data Visualization
25
Data Mining Motivation
Changes in the Business Environment
Customers becoming more demanding
Markets are saturated
Databases today are huge:
More than 1,000,000 entities/records/rows
From 10 to 10,000 fields/attributes/variables
Gigabytes and terabytes
Databases a growing at an unprecedented
rate
Decisions must be made rapidly
Decisions must be made with maximum
knowledge
Why Use Data Mining Today?
Human analysis skills are inadequate:
Volume and dimensionality of the data
High data growth rate
Availability of:
Data
Storage
Computational power
Off
-
the
-
shelf software
Expertise
An Abundance of Data
Supermarket scanners, POS data
Preferred customer cards
Credit card transactions
Direct mail response
Call center records
ATM machines
Demographic data
Sensor networks
Cameras
Web server logs
Customer web site trails
Evolution of Database Technology
1960s: IMS, network model
1970s: The relational data model, first relational
DBMS implementations
1980s: Maturing RDBMS, application
-
specific
DBMS, (spatial data, scientific data, image data,
etc.), OODBMS
1990s: Mature, high
-
performance RDBMS
technology, parallel DBMS, terabyte data
warehouses, object
-
relational DBMS, middleware
and web technology
2000s: High availability, zero
-
administration,
seamless integration into business processes
2010: Sensor database systems, databases on
embedded systems, P2P database systems,
large
-
scale pub/sub systems, ???
Much Commercial Support
Many data mining tools
http://www.kdnuggets.com/software
Database systems with data mining
support
Visualization tools
Data mining process support
Consultants
Why Use Data Mining Today?
Competitive pressure!
“The secret of success is to know something that
nobody else knows.”
Aristotle Onassis
Competition on service, not only on price (Banks,
phone companies, hotel chains, rental car
companies)
Personalization, CRM
The real
-
time enterprise
“Systemic listening”
Security, homeland defense
The Knowledge Discovery Process
Steps:
1.
Identify business problem
2.
Data mining
3.
Action
4.
Evaluation and measurement
5.
Deployment and integration into
businesses processes
Data Mining Step in Detail
2.1 Data preprocessing
Data selection: Identify target
datasets and relevant fields
Data cleaning
Remove noise and outliers
Data transformation
Create common units
Generate new fields
2.2 Data mining model construction
2.3 Model evaluation
Preprocessing and Mining
Original Data
Target
Data
Preprocessed
Data
Patterns
Knowledge
Data
Integration
and Selection
Preprocessing
Model
Construction
Interpretation
34
Data Mining Techniques
Data Mining Techniques
Descriptive
Predictive
Clustering
Association
Classification
Regression
Sequential Analysis
Decision Tree
Rule Induction
Neural Networks
Nearest Neighbor Classification
35
Data Mining Models and Tasks
36
Basic Data Mining Tasks
Classification
maps data into
predefined groups or classes
Supervised learning
Pattern recognition
Prediction
Regression
is used to map a data item
to a real valued prediction variable.
Clustering
groups similar data
together into clusters.
Unsupervised learning
Segmentation
Partitioning
37
Basic Data Mining Tasks (cont’d)
Summarization
maps data into subsets
with associated simple descriptions.
Characterization
Generalization
Link Analysis
uncovers relationships
among data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequential
patterns.
38
Ex: Time Series Analysis
Example: Stock Market
Predict future values
Determine similar patterns over time
Classify behavior
39
Data Mining vs. KDD
Knowledge Discovery in
Databases (KDD):
process of
finding useful information and
patterns in data.
Data Mining:
Use of algorithms to
extract the information and patterns
derived by the KDD process.
40
Data Mining Development
•
Similarity Measures
•
Hierarchical Clustering
•
IR Systems
•
Imprecise Queries
•
Textual Data
•
Web Search Engines
•
Bayes Theorem
•
Regression Analysis
•
EM Algorithm
•
K
-
Means Clustering
•
Time Series Analysis
•
Neural Networks
•
Decision Tree Algorithms
•
Algorithm Design Techniques
•
Algorithm Analysis
•
Data Structures
•
Relational Data Model
•
SQL
•
Association Rule Algorithms
•
Data Warehousing
•
Scalability Techniques
41
KDD Issues
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
42
KDD Issues (cont’d)
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
43
Visualization Techniques
Graphical
Geometric
Icon
-
based
Pixel
-
based
Hierarchical
Hybrid
44
Data Mining Applications
45
Data Mining Applications:
Retail
Performing basket analysis
Which items customers tend to purchase together. This
knowledge can improve stocking, store layout
strategies, and promotions.
Sales forecasting
Examining time
-
based patterns helps retailers make
stocking decisions. If a customer purchases an item
today, when are they likely to purchase a
complementary item?
Database marketing
Retailers can develop profiles of customers with certain
behaviors, for example, those who purchase designer
labels clothing or those who attend sales. This
information can be used to focus cost
–
effective
promotions.
Merchandise planning and allocation
When retailers add new stores, they can improve
merchandise planning and allocation by examining
patterns in stores with similar demographic
characteristics. Retailers can also use data mining to
determine the ideal layout for a specific store.
46
Data Mining Applications:
Banking
Card marketing
By identifying customer segments, card issuers and
acquirers can improve profitability with more effective
acquisition and retention programs, targeted product
development, and customized pricing.
Cardholder pricing and profitability
Card issuers can take advantage of data mining
technology to price their products so as to maximize
profit and minimize loss of customers. Includes risk
-
based pricing.
Fraud detection
Fraud is enormously costly. By analyzing past
transactions that were later determined to be
fraudulent, banks can identify patterns.
Predictive life
-
cycle management
DM helps banks predict each customer’s lifetime value
and to service each segment appropriately (for example,
offering special deals and discounts).
47
Data Mining Applications:
Telecommunication
Call detail record analysis
Telecommunication companies accumulate detailed
call records. By identifying customer segments with
similar use patterns, the companies can develop
attractive pricing and feature promotions.
Customer loyalty
Some customers repeatedly switch providers, or
“
churn
”, to take advantage of attractive incentives
by competing companies. The companies can use
DM to identify the characteristics of customers who
are likely to remain loyal once they switch, thus
enabling the companies to target their spending on
customers who will produce the most profit.
48
Data Mining Applications:
Other Applications
Customer segmentation
All industries can take advantage of DM to discover
discrete segments in their customer bases by
considering additional variables beyond traditional
analysis.
Manufacturing
Through choice boards, manufacturers are beginning to
customize products for customers; therefore they must
be able to predict which features should be bundled to
meet customer demand.
Warranties
Manufacturers need to predict the number of customers
who will submit warranty claims and the average cost of
those claims.
Frequent flier incentives
Airlines can identify groups of customers that can be
given incentives to fly more.
49
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
What product prom
-
-
otions have the biggest
impact on revenue?
What is the most
effective distribution
channel?
A producer wants to know….
50
Data, Data everywhere
yet ...
I can’t find the data I need
data is scattered over the
network
many versions, subtle
differences
I can’t get the data I need
need an expert to get the data
I can’t understand the data I
found
available data poorly documented
I can’t use the data I found
results are unexpected
data needs to be transformed
from one form to other
51
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they
can understand and use
in a business context.
[Barry Devlin]
52
What are the users saying...
Data should be integrated
across the enterprise
Summary data has a real
value to the organization
Historical data holds the
key to understanding data
over time
What
-
if capabilities are
required
53
What is Data Warehousing?
A
process
of
transforming
data
into
information
and
making it available to
users in a timely
enough manner to
make a difference
[Forrester Research, April
1996]
Data
Information
54
Very Large Data Bases
Terabytes
--
10^12 bytes:
Petabytes
--
10^15 bytes:
Exabytes
--
10^18 bytes:
Zettabytes
--
10^21
bytes:
Zottabytes
--
10^24
bytes:
Walmart
--
24 Terabytes
Geographic Information
Systems
National Medical Records
Weather images
Intelligence Agency
Videos
55
Data Warehousing
--
It is a process
Technique for assembling and
managing data from various
sources for the purpose of
answering business
questions. Thus making
decisions that were not
previous possible
A decision support database
maintained separately from
the organization’s operational
database
56
Data Warehouse
A data warehouse is a
subject
-
oriented
integrated
time
-
varying
non
-
volatile
collection of data that is used primarily in
organizational decision making.
--
Bill Inmon, Building the Data Warehouse 1996
Data Warehousing Concepts
Decision support is key for companies wanting
to turn their organizational data into an
information asset
Traditional database is transaction
-
oriented
while data warehouse is data
-
retrieval
optimized for decision
-
support
Data Warehouse
"A subject
-
oriented, integrated, time
-
variant,
and non
-
volatile collection of data in support of
management's decision
-
making process"
OLAP (on
-
line analytical processing), Decision
Support Systems (DSS), Executive Information
Systems (EIS), and data mining applications
57
What does data warehouse do?
integrate diverse information from
various systems which enable users to
quickly produce powerful ad
-
hoc queries
and perform complex analysis
create an infrastructure for reusing the
data in numerous ways
create an open systems environment to
make useful information easily accessible
to authorized users
help managers make informed decisions
58
Benefits of Data Warehousing
Potential high returns on investment
Competitive advantage
Increased productivity of corporate
decision
-
makers
59
Comparison of OLTP and Data Warehousing
OLTP systems
Data warehousing
systems
Holds current data
Holds historic data
Stores detailed data
Stores detailed, lightly, and
summarized data
Data is dynamic
Data is largely static
Repetitive processing
Ad hoc, unstructured, and
heuristic
processing
High level of transaction throughput
Medium to low transaction
throughput
Predictable pattern of usage
Unpredictable pattern of usage
Transaction driven
Analysis driven
Application oriented
Subject oriented
Supports day
-
to
-
day decisions
Supports strategic decisions
Serves large number of
Serves relatively lower number
clerical / operational users
of managerial users
60
Data Warehouse Architecture
Operational Data
Load Manager
Warehouse Manager
Query Manager
Detailed Data
Lightly and Highly Summarized Data
Archive / Backup Data
Meta
-
Data
End
-
user Access Tools
61
End
-
user Access Tools
Reporting and query tools
Application development tools
Executive Information System (EIS)
tools
Online Analytical Processing (OLAP)
tools
Data mining tools
62
Data Warehousing Tools and Technologies
Extraction, Cleansing, and Transformation
Tools
Data Warehouse DBMS
Load performance
Load processing
Data quality management
Query performance
Terabyte scalability
Networked data warehouse
Warehouse administration
Integrated dimensional tools
Advanced query functionality
63
Data Marts
A subset of data warehouse that
supports the requirements of a
particular department or business
function
64
Online Analytical Processing (OLAP)
OLAP
The dynamic synthesis, analysis, and
consolidation of large volume of multi
-
dimensional data
Multi
-
dimensional OLAP
Cubes of data
65
Time
City
Product
type
Problems of Data Warehousing
Underestimation of resources for
data loading
Hidden problem with source systems
Required data not captured
Increased end
-
user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long duration projects
Complexity of integration
66
Codd's Rules for OLAP
Multi
-
dimensional conceptual view
Transparency
Accessibility
Consistent reporting performance
Client
-
server architecture
Generic dimensionality
Dynamic sparse matrix handling
Multi
-
user support
Unrestricted cross
-
dimensional operations
Intuitive data manipulation
Flexible reporting
Unlimited dimensions and aggregation levels
67
OLAP Tools
Multi
-
dimensional OLAP (MOLAP)
Multi
-
dimensional DBMS (MDDBMS)
Relational OLAP (ROLAP)
Creation of multiple multi
-
dimensional
views of the two
-
dimensional relations
Managed Query Environment (MQE)
Deliver selected data directly from the
DBMS to the desktop in the form of a
data cube, where it is stored, analyzed,
and manipulated locally
68
Data Mining
Definition
The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large database and using
it to make crucial business decisions
Knowledge discovery
Association rules
Sequential patterns
Classification trees
Goals
Prediction
Identification
Classification
Optimization
69
Data Mining Techniques
Predictive Modeling
Supervised training with two phases
Training phase : building a model using
large sample of historical data called
the training set
Testing phase : trying the model on
new data
Database Segmentation
Link Analysis
Deviation Detection
70
What are Data Mining Tasks?
Classification
Regression
Clustering
Summarization
Dependency modeling
Change and Deviation Detection
71
What are Data Mining Discoveries?
New Purchase Trends
Plan Investment Strategies
Detect Unauthorized Expenditure
Fraudulent Activities
Crime Trends
Smugglers
-
border crossing
72
73
Data Warehouse Architecture
Data Warehouse
Engine
Optimized Loader
Extraction
Cleansing
Analyze
Query
Metadata Repository
Relational
Databases
Legacy
Data
Purchased
Data
ERP
Systems
74
Data Warehouse for Decision
Support & OLAP
Putting Information technology to help the
knowledge worker make faster and better
decisions
Which of my customers are most likely to go
to the competition?
What product promotions have the biggest
impact on revenue?
How did the share price of software
companies correlate with profits over last 10
years?
75
Decision Support
Used to manage and control business
Data is historical or point
-
in
-
time
Optimized for inquiry rather than update
Use of the system is loosely defined and
can be ad
-
hoc
Used by managers and end
-
users to
understand the business and make
judgements
76
Data Mining works with Warehouse
Data
Data Warehousing
provides the Enterprise
with a memory
Data Mining provides
the Enterprise with
intelligence
77
We want to know ...
Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
Which types of transactions are likely to be fraudulent
given the demographics and transactional history of a
particular customer?
If I raise the price of my product by Rs. 2, what is the
effect on my ROI?
If I offer only 2,500 airline miles as an incentive to
purchase rather than 5,000, how many lost responses will
result?
If I emphasize ease
-
of
-
use of the product as opposed to its
technical capabilities, what will be the net effect on my
revenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
78
Application Areas
Industry
Application
Finance
Credit Card Analysis
Insurance
Claims, Fraud Analysis
Telecommunication
Call record analysis
Transport
Logistics management
Consumer goods
promotion analysis
Data Service providers
Value added data
Utilities
Power usage analysis
79
Data Mining in Use
The US Government uses Data Mining to
track fraud
A Supermarket becomes an information
broker
Basketball teams use it to track game
strategy
Cross Selling
Warranty Claims Routing
Holding on to Good Customers
Weeding out Bad Customers
80
What makes data mining possible?
Advances in the following areas are
making data mining deployable:
data warehousing
better and more data (i.e., operational,
behavioral, and demographic)
the emergence of easily deployed data
mining tools and
the advent of new data mining
techniques.
•
--
Gartner Group
81
Why Separate Data Warehouse?
Performance
Op dbs designed & tuned for known txs & workloads.
Complex OLAP queries would degrade perf. for op txs.
Special data organization, access & implementation
methods needed for multidimensional views & queries.
Function
Missing data: Decision support requires historical data, which
op dbs do not typically maintain.
Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.
Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be
reconciled.
82
What are Operational Systems?
They are OLTP systems
Run mission critical
applications
Need to work with
stringent performance
requirements for
routine tasks
Used to run a
business!
83
RDBMS used for OLTP
Database Systems have been used
traditionally for OLTP
clerical data processing tasks
detailed, up to date data
structured repetitive tasks
read/update a few records
isolation, recovery and integrity are
critical
84
Operational Systems
Run the business in real time
Based on up
-
to
-
the
-
second data
Optimized to handle large
numbers of simple read/write
transactions
Optimized for fast response to
predefined transactions
Used by people who deal with
customers, products
--
clerks,
salespeople etc.
They are increasingly used by
customers
85
Examples of Operational Data
Data
Industry
Usage
Technology
Volumes
Customer
File
All
Track
Customer
Details
Legacy application, flat
files, main frames
Small
-
medium
Account
Balance
Finance
Control
account
activities
Legacy applications,
hierarchical databases,
mainframe
Large
Point
-
of
-
Sale data
Retail
Generate
bills, manage
stock
ERP, Client/Server,
relational databases
Very Large
Call
Record
Telecomm
-
unications
Billing
Legacy application,
hierarchical database,
mainframe
Very Large
Production
Record
Manufact
-
uring
Control
Production
ERP,
relational databases,
AS/400
Medium
86
Application
-
Orientation vs.
Subject
-
Orientation
Application
-
Orientation
Operational
Database
Loans
Credit
Card
Trust
Savings
Subject
-
Orientation
Data
Warehouse
Customer
Vendor
Product
Activity
87
OLTP vs. Data Warehouse
OLTP systems are tuned for known
transactions and workloads while
workload is not known a priori in a data
warehouse
Special data organization, access methods
and implementation methods are needed
to support data warehouse queries
(typically multidimensional queries)
e.g
., average amount spent on phone calls
between 9AM
-
5PM in Pune during the month
of December
88
OLTP vs Data Warehouse
OLTP
Application
Oriented
Used to run
business
Detailed data
Current up to date
Isolated Data
Repetitive access
Clerical User
Warehouse (DSS)
Subject Oriented
Used to analyze
business
Summarized and
refined
Snapshot data
Integrated Data
Ad
-
hoc access
Knowledge User
(Manager)
89
OLTP vs Data Warehouse
OLTP
Performance Sensitive
Few Records accessed at
a time (tens)
Read/Update Access
No data redundancy
Database Size 100MB
-
100 GB
Data Warehouse
Performance relaxed
Large volumes accessed
at a time(millions)
Mostly Read (Batch
Update)
Redundancy present
Database Size
100 GB
-
few terabytes
90
OLTP vs Data Warehouse
OLTP
Transaction
throughput is the
performance metric
Thousands of users
Managed in
entirety
Data Warehouse
Query throughput
is the performance
metric
Hundreds of users
Managed by
subsets
91
To summarize ...
OLTP Systems are
used to
“run”
a
business
The Data
Warehouse helps
to
“optimize”
the
business
92
Why Now?
Data is being produced
ERP provides clean data
The computing power is available
The computing power is affordable
The competitive pressures are
strong
Commercial products are available
93
Myths surrounding OLAP Servers
and Data Marts
Data marts and OLAP servers are departmental
solutions supporting a handful of users
Million dollar massively parallel hardware is
needed to deliver fast time for complex queries
OLAP servers require massive and unwieldy
indices
Complex OLAP queries clog the network with
data
Data warehouses must be at least 100 GB to be
effective
–
Source
--
Arbor Software Home Page
II. On
-
Line Analytical Processing (OLAP)
Making Decision
Support Possible
95
Typical OLAP Queries
Write a multi
-
table join to compare sales for each
product line YTD this year vs. last year.
Repeat the above process to find the top 5
product contributors to margin.
Repeat the above process to find the sales of a
product line to new vs. existing customers.
Repeat the above process to find the customers
that have had negative sales growth.
96
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
What Is OLAP?
Online Analytical Processing
-
coined by
EF Codd in 1994 paper contracted by
Arbor Software
*
Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence, Executive
Information System
OLAP = Multidimensional Database
MOLAP: Multidimensional OLAP (Arbor Essbase,
Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
97
The OLAP Market
Rapid growth in the enterprise market
1995: $700 Million
1997: $2.1 Billion
Significant consolidation activity among
major DBMS vendors
10/94: Sybase acquires ExpressWay
7/95: Oracle acquires Express
11/95: Informix acquires Metacube
1/97: Arbor partners up with IBM
10/96: Microsoft acquires Panorama
Result: OLAP shifted from small vertical
niche to mainstream DBMS category
98
Strengths of OLAP
It is a powerful visualization paradigm
It provides fast, interactive response
times
It is good for analyzing time series
It can be useful to find some clusters and
outliers
Many vendors offer OLAP tools
99
Nigel Pendse, Richard Creath
-
The OLAP Report
OLAP Is FASMI
Fast
Analysis
Shared
Multidimensional
Information
100
Month
1
2
3
4
7
6
5
Product
Toothpaste
Juice
Cola
Milk
Cream
Soap
W
S
N
Dimensions:
Product, Region, Time
Hierarchical summarization paths
Product
Region
Time
Industry Country Year
Category Region Quarter
Product City Month Week
Office Day
Multi
-
dimensional Data
“Hey…I sold $100M worth of goods”
101
A Visual Operation: Pivot (Rotate)
10
47
30
12
Juice
Cola
Milk
Cream
3/1 3/2 3/3 3/4
Date
Product
102
“Slicing and Dicing”
Product
Sales Channel
Retail
Direct
Special
Household
Telecomm
Video
Audio
India
Far East
Europe
The Telecomm Slice
103
Roll
-
up and Drill Down
Sales Channel
Region
Country
State
Location Address
Sales
Representative
Higher Level of
Aggregation
Low
-
level
Details
Results of Data Mining Include:
Forecasting what may happen in the
future
Classifying people or things into
groups by recognizing patterns
Clustering people or things into
groups based on their attributes
Associating what events are likely to
occur together
Sequencing what events are likely to
lead to later events
Data mining is
not
Brute
-
force crunching of
bulk data
“Blind” application of
algorithms
Going to find relationships
where none exist
Presenting data in different
ways
A database intensive task
A difficult to understand
technology requiring an
advanced degree in
computer science
Data Mining versus OLAP
OLAP
-
On
-
line
Analytical
Processing
Provides you
with a very
good view of
what is
happening,
but can not
predict what
will happen in
the future or
why it is
happening
Data Mining Versus Statistical
Analysis
Data Mining
Originally developed to act
as expert systems to solve
problems
Less interested in the
mechanics of the
technique
If it makes sense then
let’s use it
Does not require
assumptions to be made
about data
Can find patterns in very
large amounts of data
Requires understanding
of data and business
problem
Data Analysis
Tests for statistical
correctness of models
Are statistical
assumptions of models
correct?
Eg Is the R
-
Square
good?
Hypothesis testing
Is the relationship
significant?
Use a t
-
test to validate
significance
Tends to rely on sampling
Techniques are not
optimised for large
amounts of data
Requires strong statistical
skills
Examples of What People are
Doing with Data Mining:
Fraud/Non
-
Compliance
Anomaly detection
Isolate the factors that
lead to fraud, waste and
abuse
Target auditing and
investigative efforts
more effectively
Credit/Risk Scoring
Intrusion detection
Parts failure prediction
Recruiting/Attracting
customers
Maximizing
profitability (cross
selling, identifying
profitable customers)
Service Delivery and
Customer Retention
Build profiles of
customers likely
to use which
services
Web Mining
What data mining has done for...
Scheduled its workforce
to provide faster, more accurate
answers to questions.
The US Internal Revenue Service
needed to improve customer
service and...
What data mining has done for...
analyzed suspects’ cell phone
usage to focus investigations.
The US Drug Enforcement
Agency needed to be more
effective in their drug “busts”
and
What data mining has done for...
Reduced direct mail costs by 30%
while garnering 95% of the
campaign’s revenue.
HSBC need to cross
-
sell more
effectively by identifying profiles
that would be interested in higher
yielding investments and
...
Suggestion:Predicting Washington
C
-
Span has lunched a digital
archieve of 500,000 hours of audio
debates.
Text Mining or Audio Mining of these
talks to reveal cwetrain questions
such as….
Example Application: Sports
IBM Advanced Scout analyzes
NBA game statistics
Shots blocked
Assists
Fouls
Google: “IBM Advanced Scout”
Advanced Scout
Example pattern: An analysis of the
data from a game played between
the New York Knicks and the Charlotte
Hornets revealed that “
When Glenn Rice
played the shooting guard position, he
shot 5/6 (83%) on jump shots."
Pattern is interesting:
The average shooting percentage for the
Charlotte Hornets during that game was
54%.
Data Mining: Types of Data
Relational data and transactional data
Spatial and temporal data, spatio
-
temporal observations
Time
-
series data
Text
Images, video
Mixtures of data
Sequence data
Features from processing other data
sources
Data Mining Techniques
Supervised learning
Classification and regression
Unsupervised learning
Clustering
Dependency modeling
Associations, summarization, causality
Outlier and deviation detection
Trend analysis and change detection
Different Types of Classifiers
Linear discriminant analysis (LDA)
Quadratic discriminant analysis
(QDA)
Density estimation methods
Nearest neighbor methods
Logistic regression
Neural networks
Fuzzy set theory
Decision Trees
Test Sample Estimate
Divide D into D
1
and D
2
Use D
1
to construct the classifier d
Then use resubstitution estimate
R(d,D
2
) to calculate the estimated
misclassification error of d
Unbiased and efficient, but removes
D
2
from training dataset D
V
-
fold Cross Validation
Procedure:
Construct classifier d from D
Partition D into V datasets D
1
, …, D
V
Construct classifier d
i
using D
\
D
i
Calculate the estimated misclassification
error R(d
i
,D
i
) of d
i
using test sample D
i
Final misclassification estimate:
Weighted combination of individual
misclassification errors:
R(d,D) = 1/V Σ R(d
i
,D
i
)
Cross
-
Validation: Example
d
d
1
d
2
d
3
Cross
-
Validation
Misclassification estimate obtained
through cross
-
validation is usually
nearly unbiased
Costly computation (we need to
compute d, and d
1
, …, d
V
);
computation of d
i
is nearly as
expensive as computation of d
Preferred method to estimate quality
of learning algorithms in the
machine learning literature
Decision Tree Construction
Three algorithmic components:
Split selection (CART, C4.5, QUEST,
CHAID, CRUISE, …)
Pruning (direct stopping rule, test
dataset pruning, cost
-
complexity
pruning, statistical tests, bootstrapping)
Data access (CLOUDS, SLIQ, SPRINT,
RainForest, BOAT, UnPivot operator)
Goodness of a Split
Consider node t with impurity phi(t)
The
reduction in impurity
through
splitting predicate s (t splits into
children nodes t
L
with impurity
phi(t
L
) and t
R
with impurity phi(t
R
))
is:
Δ
phi
(s,t) = phi(t)
–
p
L
phi(t
L
)
–
p
R
phi(t
R
)
Pruning Methods
Test dataset pruning
Direct stopping rule
Cost
-
complexity pruning
MDL pruning
Pruning by randomization testing
Stopping Policies
A stopping policy indicates when further
growth of the tree at a node t is
counterproductive.
All records are of the same class
The attribute values of all records are
identical
All records have missing values
At most one class has a number of
records larger than a user
-
specified
number
All records go to the same child node if t
is split (only possible with some split
selection methods)
Test Dataset Pruning
Use an independent test sample D’
to estimate the misclassification cost
using the resubstitution estimate
R(T,D’) at each node
Select the subtree T’ of T with the
smallest expected cost
Missing Values
What is the problem?
During computation of the splitting
predicate, we can selectively ignore
records with missing values (note that
this has some problems)
But if a record r misses the value of the
variable in the splitting attribute, r can
not participate further in tree
construction
Algorithms for missing values address
this problem.
Mean and Mode Imputation
Assume record r has missing value
r.X, and splitting variable is X.
Simplest algorithm:
If X is numerical (categorical), impute
the overall mean (mode)
Improved algorithm:
If X is numerical (categorical), impute
the mean(X|t.C) (the mode(X|t.C))
Decision Trees: Summary
Many application of decision trees
There are many algorithms available for:
Split selection
Pruning
Handling Missing Values
Data Access
Decision tree construction still active
research area (after 20+ years!)
Challenges: Performance, scalability,
evolving datasets, new applications
Supervised vs. Unsupervised Learning
Supervised
y=F(x): true function
D: labeled training set
D: {x
i
,F(x
i
)}
Learn:
G(x): model trained to
predict labels D
Goal:
E[(F(x)
-
G(x))
2
] ≈ 0
Well defined criteria:
Accuracy, RMSE, ...
Unsupervised
Generator: true model
D: unlabeled data
sample
D: {x
i
}
Learn
??????????
Goal:
??????????
Well defined criteria:
??????????
Clustering: Unsupervised Learning
Given:
Data Set D (training set)
Similarity/distance metric/information
Find:
Partitioning of data
Groups of similar/close items
Similarity?
Groups of similar customers
Similar demographics
Similar buying behavior
Similar health
Similar products
Similar cost
Similar function
Similar store
…
Similarity usually is domain/problem
specific
Clustering: Informal Problem
Definition
Input:
A data set of
N
records each given as a
d
-
dimensional data feature vector.
Output:
Determine a natural, useful “partitioning”
of the data set into a number of (k)
clusters and noise such that we have:
High similarity of records within each cluster
(intra
-
cluster similarity)
Low similarity of records between clusters
(inter
-
cluster similarity)
Types of Clustering
Hard Clustering:
Each object is in one and only one
cluster
Soft Clustering:
Each object has a probability of being
in each cluster
Clustering Algorithms
Partitioning
-
based clustering
K
-
means clustering
K
-
medoids clustering
EM (expectation maximization) clustering
Hierarchical clustering
Divisive clustering (top down)
Agglomerative clustering (bottom up)
Density
-
Based Methods
Regions of dense points separated by sparser
regions of relatively low density
K
-
Means Clustering Algorithm
Initialize k cluster centers
Do
Assignment step
: Assign each data point to its closest
cluster center
Re
-
estimation step
: Re
-
compute cluster centers
While
(there are still changes in the cluster centers)
Visualization at:
http://www.delft
-
cluster.nl/textminer/theory/kmeans/kmeans.html
Issues
Why is K
-
Means working:
How does it find the cluster centers?
Does it find an optimal clustering
What are good starting points for the algorithm?
What is the right number of cluster centers?
How do we know it will terminate?
Agglomerative Clustering
Algorithm:
Put each item in its own cluster (all singletons)
Find all pairwise distances between clusters
Merge the two
closest
clusters
Repeat until everything is in one cluster
Observations:
Results in a hierarchical clustering
Yields a clustering for each possible number of
clusters
Greedy clustering: Result is not “optimal” for any
cluster size
Density
-
Based Clustering
A cluster is defined as a connected dense
component.
Density is defined in terms of number of
neighbors of a point.
We can find clusters of arbitrary shape
Market Basket Analysis
Consider shopping cart filled with
several items
Market basket analysis tries to
answer the following questions:
Who makes purchases?
What do customers buy together?
In what order do customers purchase
items?
Market Basket Analysis
Given:
A database of
customer
transactions
Each transaction is
a set of items
Example:
Transaction with
TID 111 contains
items {Pen, Ink,
Milk, Juice}
TID
CID
Date
Item
Qty
111
201
5/1/99
Pen
2
111
201
5/1/99
Ink
1
111
201
5/1/99
Milk
3
111
201
5/1/99
Juice
6
112
105
6/3/99
Pen
1
112
105
6/3/99
Ink
1
112
105
6/3/99
Milk
1
113
106
6/5/99
Pen
1
113
106
6/5/99
Milk
1
114
201
7/1/99
Pen
2
114
201
7/1/99
Ink
2
114
201
7/1/99
Juice
4
Market Basket Analysis (Contd.)
Coocurrences
80% of all customers purchase items X,
Y and Z together.
Association rules
60% of all customers who purchase X
and Y also buy Z.
Sequential patterns
60% of customers who first buy X also
purchase Y within three weeks.
Confidence and Support
We prune the set of all possible
association rules using two
interestingness measures:
Confidence
of a rule:
X
Y has confidence c if P(Y|X) = c
Support
of a rule:
X
Y has support s if P(XY) = s
We can also define
Support
of an itemset (a
coocurrence) XY:
XY has support s if P(XY) = s
Market Basket Analysis:
Applications
Sample Applications
Direct marketing
Fraud detection for medical insurance
Floor/shelf planning
Web site layout
Cross
-
selling
Applications of Frequent Itemsets
Market Basket Analysis
Association Rules
Classification (especially: text, rare
classes)
Seeds for construction of Bayesian
Networks
Web log analysis
Collaborative filtering
Association Rule Algorithms
More abstract problem redux
Breadth
-
first search
Depth
-
first search
Problem Redux
Abstract:
A set of items {1,2,…,k}
A dabase of transactions
(itemsets) D={T1, T2, …,
Tn},
Tj subset {1,2,…,k}
GOAL:
Find all itemsets that appear in
at least x transactions
(“appear in” == “are subsets
of”)
I subset T: T
supports
I
For an itemset I, the number of
transactions it appears in is
called the
support
of I.
x is called the
minimum support
.
Concrete:
I = {milk, bread, cheese,
…}
D = {
{milk,bread,cheese},
{bread,cheese,juice}, …}
GOAL:
Find all itemsets that appear
in at least 1000
transactions
{milk,bread,cheese}
supports {milk,bread}
Problem Redux (Contd.)
Definitions:
An itemset is
frequent
if it
is a subset of at least x
transactions. (FI.)
An itemset is
maximally
frequent
if it is frequent
and it does not have a
frequent superset. (MFI.)
GOAL: Given x, find all
frequent (maximally
frequent) itemsets (to be
stored in the
FI (MFI)
).
Obvious relationship:
MFI subset FI
Example:
D={
{1,2,3}, {1,2,3},
{1,2,3}, {1,2,4}
}
Minimum support x = 3
{1,2}
is frequent
{1,2,3}
is maximal frequent
Support(
{1,2}
) = 4
All maximal frequent
itemsets:
{1,2,3}
Applications
Spatial association rules
Web mining
Market basket analysis
User/customer profiling
ExtenSuggestionssions: Sequential
Patterns
In the “Market Itemset Analysis”
replace Milk, Pen, etc with names of
medications and use the idea in
Hospital Data mining new proposal
The idea of swaem intelligence
–
add
to it the extra analysis pf the
inducyion rules in this set of slides.
Kraft Foods
:
Direct Marketing
Company maintains a large database of purchases by customers.
Data mining
1. Analysts identified associations among groups of products
bought by particular segments of customers.
2. Sent out 3 sets of coupons to various households.
•
Better response rates: 50 % increase in sales for one its
products
•
Continue to use of this approach
Health Insurance Commission of Australia
:
Insurance Fraud
Commission maintains a database of insurance claims,including
laboratory tests ordered during the diagnosis of patients.
Data mining
1. Identified the practice of "up coding" to reflect more
expensive tests than are necessary.
2. Now monitors orders for lab tests.
•
Commission expects to save US$1,000,000 / year by
eliminating the practice of "up coding”.
HNC Software
: Credit Card Fraud
Payment Fraud
Large issuers of cards may lose
$10 million / year due to fraud
Difficult to identify the few transactions among thousands which
reflect potential fraud
Falcon software
Mines data through neural networks
Introduced in September 1992
Models each cardholder's requested transaction against the customer's
past spending history.
processes several hundred requests per second
compares current transaction with customer's history
identifies the transactions most likely to be frauds
enables bank to stop high
-
risk transactions before they are
authorized
Used by many retail banks: currently monitors
160 million card accounts for fraud
New Account Fraud
Fraudulent applications for credit cards are growing at 50 %
per year
Falcon Sentry software
Mines data through neural networks and a rule base
Introduced in September 1992
Checks information on applications against data from
credit bureaus
Allows card issuers to simultaneously:
increase the proportion of applications received
reduce the proportion of fraudulent applications
authorized
New Account Fraud
Quality Control
IBM Microelectronics
:
Quality Control
Analyzed manufacturing data on Dynamic Random Access Memory
(DRAM) chips.
Data mining
1. Built predictive models of
manufacturing yield (% non
-
defective)
effects of production parameters on chip performance.
2. Discovered critical factors behind
production yield &
product performance.
3. Created a new design for the chip
increased yield saved millions of dollars in direct
manufacturing costs
enhanced product performance by substantially lowering the
memory cycle time
B & L Stores
Belk and Leggett Stores =
one of largest retail chains
280 stores in southeast U.S.
data warehouse contains 100s of gigabytes (billion
characters) of data
data mining to:
increase sales
reduce costs
Selected DSS Agent from MicroStrategy, Inc.
analyize merchandizing (patterns of sales)
manage inventory
Retail Sales
DSS Agent
uses intelligent agents data mining
provides multiple functions
recognizes sales patterns among stores
discovers sales patterns by
time of day
day of year
category of product
etc.
swiftly identifies trends & shifts in customer tastes
performs Market Basket Analysis (MBA)
analyzes Point
-
of
-
Sale or
-
Service (POS) data
identifies relationships among products and/or services purchased
E.g. A customer who buys Brand X slacks has a 35% chance of
buying Brand Y shirts.
Agent tool is also used by other Fortune 1000 firms
average ROI > 300 %
average payback in 1 ~ 2 years
Market Basket Analysis
Case Based Reasoning
(CBR)
case A
targ
e
t
case B
General scheme for a case based reasoning (CBR) model. The target case is
matched against similar precedents in the historical database, such as cases A and B.
Case Based Reasoning (CBR)
Learning through the accumulation of experience
Key issues
Indexing:
storing cases for quick, effective access of precedents
Retrieval:
accessing the appropriate precedent cases
Advantages
Explicit knowledge form recognizable to humans
No need to re
-
code knowledge for computer processing
Limitations
Retrieving precedents based on superficial features
E.g. Matching Indonesia with U.S. because both have similar population size
Traditional approach ignores the issue of generalizing knowledge
Genetic Algorithm
Generation of candidate solutions using the procedures of biological
evolution.
Procedure
0.
Initialize
.
Create a population of potential solutions ("organisms").
1.
Evaluate
.
Determine the level of "fitness" for each solution.
2.
Cull
.
Discard the poor solutions.
3.
Breed
.
a. Select 2 "fit" solutions to serve as parents.
b. From the 2 parents, generate offspring.
*
Crossover
:
Cut the parents at random and switch the 2 halves.
*
Mutation
:
Randomly change the value in a parent solution.
4.
Repeat
.
Go back to Step 1 above.
Genetic Algorithm (Cont.)
Advantages
Applicable to a wide range of problem domains.
Robustness:
can obtain solutions even when the performance
function is highly irregular or input data are noisy.
Implicit parallelism:
can search in many directions concurrently.
Limitations
Slow, like neural networks.
But: computation can be distributed
over multiple processors
(unlike neural networks)
Source
: www.pathology.washington.edu
Multistrategy Learning
Every technique has advantages & limitations
Multistrategy approach
Take advantage of the strengths of diverse techniques
Circumvent the limitations of each methodology
Types of Models
Prediction Models for
Predicting and Classifying
Regression algorithms
(predict numeric
outcome):
neural
networks
, rule induction,
CART (OLS regression,
GLM)
Classification algorithm
predict symbolic
outcome): CHAID,
C5.0
(discriminant analysis,
logistic regression)
Descriptive Models for
Grouping and Finding
Associations
Clustering/Grouping
algorithms: K
-
means,
Kohonen
Association algorithms:
apriori
, GRI
Neural Networks
Description
Difficult interpretation
Tends to ‘overfit’ the data
Extensive amount of training time
A lot of data preparation
Works with all data types
Rule Induction
Description
Intuitive output
Handles all forms of numeric data,
as well as non
-
numeric (symbolic)
data
C5 Algorithm
a special case of rule
induction
Target variable must be symbolic
Apriori
Description
Seeks
association rules
in dataset
‘Market basket’ analysis
Sequence discovery
Data Mining Is
The automated process of finding
relationships and patterns in stored
data
It is different from the use of SQL
queries and other business
intelligence tools
Data Mining Is
Motivated by business need, large
amounts of available data, and
humans’ limited cognitive processing
abilities
Enabled by data warehousing,
parallel processing, and data mining
algorithms
Common Types of Information
from Data Mining
Associations
--
identifies occurrences
that are linked to a single event
Sequences
--
identifies events that
are linked over time
Classification
--
recognizes patterns
that describe the group to which an
item belongs
Common Types of Information
from Data Mining
Clustering
--
discovers different
groupings within the data
Forecasting
--
estimates future
values
Commonly Used Data Mining
Techniques
Artificial neural networks
Decision trees
Genetic algorithms
Nearest neighbor method
Rule induction
The Current State of Data Mining
Tools
Many of the vendors are small companies
IBM and SAS have been in the market for
some time, and more “biggies” are
moving into this market
BI tools and RDMS products are
increasingly including basic data mining
capabilities
Packaged data mining applications are
becoming common
The Data Mining Process
Requires personnel with domain,
data warehousing, and data mining
expertise
Requires data selection, data
extraction, data cleansing, and data
transformation
Most data mining tools work with
highly granular flat files
Is an iterative and interactive
process
Why Data Mining
Credit ratings/targeted marketing
:
Given a database of 100,000 names, which persons are
the least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a
particular customer?
Customer relationship management
:
Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor?
:
Data Mining helps extract such
information
Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications,
financial transactions
from an online stream of event identify fraudulent
events
Manufacturing and production:
automatically adjust knobs when process parameter
changes
Applications (continued)
Medicine: disease outcome, effectiveness
of treatments
analyze patient disease history: find
relationship between diseases
Molecular/Pharmaceutical: identify new
drugs
Scientific data analysis:
identify new galaxies by searching for sub
clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify
layout
The KDD process
Problem fomulation
Data collection
subset data: sampling might hurt if highly skewed data
feature selection: principal component analysis,
heuristic search
Pre
-
processing: cleaning
name/address cleaning, different meanings (annual,
yearly), duplicate removal, supplying missing values
Transformation:
map complex objects e.g. time series data to features
e.g. frequency
Choosing mining task and mining method:
Result evaluation and Visualization:
Knowledge discovery is an iterative process
Relationship with other fields
Overlaps with machine learning, statistics,
artificial intelligence, databases,
visualization but more stress on
scalability of number of features and instances
stress on algorithms and architectures
whereas foundations of methods and
formulations provided by statistics and
machine learning.
automation for handling large, heterogeneous
data
Some basic operations
Predictive:
Regression
Classification
Collaborative Filtering
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
Classification
Given old data about customers and
payments, predict new applicant’s
loan eligibility.
Age
Salary
Profession
Location
Customer type
Previous customers
Classifier
Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/
bad
Classification methods
Goal:
Predict class Ci = f(x1, x2, ..
Xn)
Regression: (linear or any other
polynomial)
a*x1 + b*x2 + c = Ci.
Nearest neighour
Decision tree classifier: divide decision
space into piecewise constant regions.
Probabilistic/generative models
Neural networks: partition by non
-
linear boundaries
Define proximity between instances,
find neighbors of new instance and
assign majority class
Case based reasoning: when
attributes are more complicated than
real
-
valued.
Nearest neighbor
•
Cons
–
Slow during application.
–
No feature selection.
–
Notion of proximity vague
•
Pros
+
Fast training
Clustering
Unsupervised learning when old data with
class labels not available e.g. when
introducing a new product.
Group/cluster existing customers based on
time series of payment history such that
similar customers in same cluster.
Key requirement: Need a good measure of
similarity between instances.
Identify micro
-
markets and develop
policies for each
Applications
Customer segmentation e.g. for targeted
marketing
Group/cluster existing customers based on
time series of payment history such that
similar customers in same cluster.
Identify micro
-
markets and develop policies
for each
Collaborative filtering:
group based on common items purchased
Text clustering
Compression
Distance functions
Numeric data: euclidean, manhattan
distances
Categorical data: 0/1 to indicate
presence/absence followed by
Hamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of
1s)
data dependent measures: similarity of A and
B depends on co
-
occurance with C.
Combined numeric and categorical data:
weighted normalized distance:
Clustering methods
Hierarchical
clustering
agglomerative Vs divisive
single link Vs complete link
Partitional
clustering
distance
-
based: K
-
means
model
-
based: EM
density
-
based:
Partitional methods: K
-
means
Criteria: minimize sum of square of
distance
Between each point and centroid of the
cluster.
Between each pair of points in the
cluster
Algorithm:
Select initial partition with K clusters:
random, first K, K separated points
Repeat until stabilization:
Assign each point to closest cluster
center
Generate new cluster centers
Adjust clusters by merging/splitting
Collaborative Filtering
Given database of user preferences,
predict preference of new user
Example: predict what new movies you will
like based on
your past preferences
others with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a person
may want to buy
(and suggest it, or give discounts to
tempt customer)
Association rules
Given set T of groups of items
Example: set of item sets
purchased
Goal: find all rules on itemsets
of the form a
--
>b such that
support
of a and b > user
threshold s
conditional probability (
confidence
)
of b given a > user threshold c
Example: Milk
--
> bread
Purchase of product A
--
>
service B
Milk, cereal
Tea, milk
Tea, rice, bread
cereal
T
Prevalent
Interesting
Analysts already
know about
prevalent rules
Interesting rules
are those that
deviate
from prior
expectation
Mining’s payoff is
in finding
surprising
phenomena
1995
1998
Milk and
cereal sell
together!
Zzzz...
Milk and
cereal sell
together!
Applications of fast itemset
counting
Find correlated events:
Applications in medicine: find
redundant tests
Cross selling in retail, banking
Improve predictive capability of
classifiers that assume attribute
independence
New similarity measures of
categorical attributes [
Mannila et al,
KDD 98
]
Application Areas
Industry
Application
Finance
Credit Card Analysis
Insurance
Claims, Fraud Analysis
Telecommunication
Call record analysis
Transport
Logistics management
Consumer goods
promotion analysis
Data Service providers
Value added data
Utilities
Power usage analysis
Usage scenarios
Data warehouse mining:
assimilate data from operational sources
mine static data
Mining log data
Continuous mining: example in process
control
Stages in mining:
data selection
pre
-
processing:
cleaning
transformation
mining
result evaluation
visualization
Mining market
Around 20 to 30 mining tool vendors
Major tool players:
Clementine,
IBM’s Intelligent Miner,
SGI’s MineSet,
SAS’s Enterprise Miner.
All pretty much the same set of tools
Many embedded products:
fraud detection:
electronic commerce applications,
health care,
customer relationship management: Epiphany
Vertical integration:
Mining on the web
Web log analysis for site design:
what are popular pages,
what links are hard to find.
Electronic stores sales enhancements:
recommendations, advertisement:
Collaborative filtering
:
Net perception,
Wisewire
Inventory control: what was a shopper
looking for and could not find..
State of art in mining
OLAP
integration
Decision trees [
Information discovery,
Cognos]
find factors influencing high profits
Clustering
[Pilot software]
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment