© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Data Mining
Business Intelligence Tools and
Techniques
Robert Monroe
April 15, 2008
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Goals For Today
•
Introduce
data mining
, the data mining process, and
common business scenarios where it is useful
•
Understand where data mining fits in the BI universe
–
How it compares to and complements other BI tools
–
Relationship to data warehousing, OLAP, and relational db’s
•
Introduce common data mining models, concepts,
and techniques
–
Classification
–
Clustering
–
Link Analysis
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Acknowledgement
•
This evening’s talk, examples, and ideas were derived
primarily from two sources:
–
Course materials created by Professor Michael Trick for his
Mining Data For Decision Making
class, 45

873
–
Introduction to Data Mining and Knowledge Discovery
, by
the Two Crows Corporation
•
Many thanks to Professor Trick and the Two Crows
Corporation
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Introduction to Data Mining
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Definitions
•
Data mining
is the art and science of automatically analyzing
large data stores to discover meaningful patterns and
relationships
–
Michael Trick, Carnegie Mellon University
•
Key ideas:
–
Art and science
: data mining requires active participation of human experts
to be used effectively; it is not a “fire and forget” process
–
Automatic
: standard analyses are run automatically; selecting which
analysis to run on which data and how to interpret the results is where the
human expert comes in…
–
Discover
: you don’t know what you’re going to get…
–
Meaningful
: there are many more useless patterns than useful
–
Patterns and relationships
: are what data mining techniques discover
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Data Mining and Business Intelligence
•
Data mining tools are part of a complete and well balanced BI toolkit
•
Data mining complements data warehousing and OLAP
•
OLAP helps you ask structured questions and interactively explore answers
•
Data mining helps an analyst determine structure and relationships in the data
that were not a priori obvious
Operational
Databases
Data Marts
Data
Warehouse
ETL
ETL
ETL
OLAP
Cube
ETL
ETL
Queries
Data Mining
Tools
Queries
OLAP
GUI
Reporting
And Vis Tools
Queries
Queries
Queries
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Drowning in Data
•
One of the reasons that data mining is becoming more
popular is the explosion of data collected and available
•
Our ability to collect and store data seems to have
surpassed our ability to make sense of it
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Common Data Mining Goals
•
Classification
–
Divide a database into different groups that are very different
from each other, but whose members are very similar
•
Estimation/Prediction
–
Much like classification but attempt to predict an outcome in
the future
•
Clustering and Affinity Grouping
–
Identify things that “go together”
–
You don’t know what the clustering criteria will be a priori
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Applications of Data Mining
•
There are many, many potential applications of data
mining techniques
•
Basic requirements:
–
Availability of large data sets
–
A desire to discover patterns and relationships among the data
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Applications of Data Mining (Examples)
•
Financial Services
–
Fraud detection and protection
–
Credit risk scoring
–
Customer lifetime value analysis
•
Retail
–
Marketing campaign response analysis and prediction
–
Personalization
–
Market basket analysis
–
Store layout
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Applications of Data Mining (Examples)
•
Manufacturing
–
Defect root cause analysis
–
Predicting order flow
•
Medicine
–
Diagnostics
–
Pathology and epidemic tracking and prediction
–
Drug discovery research
•
Others?
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
The Data Mining Process
Describe
the data
Build a
predictive model
Test the
model
Verify the model
Refine
and repeat
Key elements necessary for
successful data mining:
–
Business domain knowledge
–
Precise formulation of the problem
you are trying to solve
–
Understanding of how to use the
data mining tools (and algorithms)
–
Good data
•
Enough of the right data
•
Clean data
–
Willingness to learn, refine, and
iterate
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Classification
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Classification
•
The goal of classification is to assign each member of a
set of records/objects to predefined classes
•
For example:
–
Assign a loan application to one of three categories:
•
high risk, medium risk, or low

risk
–
Assign a customer to one of three categories:
•
Highly profitable, potentially profitable, money

loser
–
In drug discovery, assign a chemical compound to:
•
High potential, low potential, likely harmful
–
Other examples?
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Classification Algorithm > Decision Trees
•
Decision trees are a common classification technique
•
Decision trees create a series of decision points (rules) that can
be used to classify a given record
•
Decision trees algorithms generate the rules automatically by
inference from previous data
•
Example: Is this loan applicant a good risk or a bad risk?
Income > $40k?
no
yes
Job > 5 years?
High debt?
yes
no
yes
no
Good Risk
Bad Risk
Bad Risk
Good Risk
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Decision Trees Example: Loan Risk
•
Once we have the tree, predicting whether the loan is
too risky is straightforward
•
Just answer the questions posed at tree branches until
you get to a leaf node (Good Risk/Bad Risk)
Income > $40k?
no
yes
Job > 5 years?
High debt?
yes
no
yes
no
Good Risk
Bad Risk
Bad Risk
Good Risk
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Building the Decision Tree is the Challenge
•
Data mining tools utilize statistical techniques to
identify which characteristics best predict the desired
classification
•
From these analyses they create rules
–
Income > 40k AND Job > 5 years => Good Risk
–
Income > 40k AND Job < 5 years => Bad Risk
–
…
•
Putting all of these rules together forms a decision tree
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Decision Tree Benefits and Drawbacks
•
Benefits:
–
Easy to use, create, and understand the results
–
Works well for discrete data, can also work w/continuous data
–
Size of tree can be specified independent of size of database
–
Can work with few or many attributes
•
Drawbacks
–
It is easy to “overfit” the data
•
Addressed by stopping rules and tree pruning
–
Greedy algorithms may not make globally optimal
classifications
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Classification Algorithm: K

Nearest Neighbor
•
The K

Nearest Neighbor algorithm is an alternative to decision
trees for automated classification
•
Given a database a classified records, determine the best
classification for a new record
R
•
Use a distance function to determine the “k

closest” neighbors to
record
R
–
Intuitively, this should introduce the records that are
“most like”
R
•
For a given value of “k” determine
R
’s k

nearest neighbors and
compute which classification best fits
R
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
K

Nearest Neighbor Example
Dimension Y
Dimension X
Question: How to classify the new item (k=1)?
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
K

Nearest Neighbor Example
Dimension Y
Dimension X
Question: How to classify the new item (k=6)?
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
K

Nearest Neighbor Benefits and Drawbacks
•
Benefits:
–
Easy to implement
–
Easy to explain the results
•
Drawbacks
–
Difficult to get a good distance function in many cases
•
Especially true when dealing with non

continuous variables
•
A poor distance function makes the results essentially worthless
–
The entire dataset is required (and must be evaluated) to
classify each new instance
•
Potentially difficult to do interactive analysis on large data sets
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Clustering (Affinity Grouping)
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Clustering
•
The goal of clustering is to divide a database into
different groups
–
The groups are not known ahead of time (unlike classification)
–
Each group should be significantly different from the others
–
Members of the same group should be very similar to each
other, along some dimension(s)
•
Examples:
–
Determine appropriate ways to segment customers
–
Partition geographic regions and divisions
–
Bundling packages of goods for special discounts
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Intuition Behind Clustering
Given many
data points
Group them
into clusters
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
How to do the Grouping? K

Means Algorithm
•
K

Means is one of many clustering algorithms available
•
The
k

means
algorithm (to define
k
clusters):
1.
Place
k
points into the space represented by the objects that
are being clustered. These points represent initial group
centroids
2.
Assign each object to the group that has the closest centroid
3.
When all objects have been assigned, recalculate the positions
of the
k
centroids
4.
Repeat Steps 2 and 3 until the centroids no longer move. This
produces a separation of the objects into groups from which
the metric to be minimized can be calculated
Clustering algorithm from Clustering Algorithm Tutorial by Matteo
Matteucci, Politecnico di Milano,
http://www.elet.polimi.it/upload/matteucc/
Clustering/tutorial_html/index.html
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
K

Means Algorithm Example: First Iteration,
k
=3
1.
Place
k
points into
the space
represented by the
objects that are
being clustered.
These points
represent initial
group centroids
2.1
Assign each
object to the group
that has the closest
centroid
3.1
When all
objects have been
assigned, recalculate
the positions of the
k centroids
Iterate
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
K

Means Algorithm Example: Second Iteration,
k
=3
1.
Place
k
points into
the space
represented by the
objects that are
being clustered.
These points
represent initial
group centroids
2.2
Assign each
object to the group
that has the closest
centroid
Iterate Again
3.2
When all
objects have been
assigned, recalculate
the positions of the
k centroids
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
K

Means Algorithm Example: Third Iteration,
k
=3
1.
Place
k
points into
the space
represented by the
objects that are
being clustered.
These points
represent initial
group centroids
2.3
Assign each
object to the group
that has the closest
centroid
3.3
When all
objects have been
assigned, recalculate
the positions of the
k centroids
No centroids
move, so done
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Clustering Challenges
•
Difficult to get a good distance function in many cases
–
Especially true when dealing with non

continuous variables
–
A poor distance function makes the results essentially
worthless
•
Clustering algorithms can be slow with large data sets
•
Difficult to interpret the results
–
… or to determine which cluster results are “good” or “better”
–
Determining causality vs. random dead

end correlations
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Link Analysis
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Link Analysis
•
Link analysis
is a mechanism for identifying relationships among
different values in a database
•
Association discovery
finds rules about items that appear together
in a single event
–
Market basket analysis is a very common example
•
Sequence discovery
finds common sequences of related events
over time
–
Click stream analysis on the web is a good example
•
Association and sequence discovery are implemented with
essentially the same techniques
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Associations
•
Associations rules are written: A => B
–
Where A is the
antecedent
statement (or LHS)
–
… and B is the
consequent
statement (or RHS)
–
Association rule means If A, then B
•
Example association rule:
–
If a customer buys hot fudge they also buy ice cream
–
Antecedent (A): “customer buys hot fudge”
–
Consequent (B): “customer buys ice cream”
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Association Analysis Concepts: Support
•
Support
is the frequency with which a particular
association appears in the database
•
Support(A) = (# of records with A) / (Total # of records)
–
Example: In a data set of 1000 food store market baskets:
•
Number of baskets that include hot fudge = 75
•
Number of baskets that include ice cream = 50
•
Number of baskets that include hot fudge AND ice cream = 25
•
Number of baskets that include hot fudge AND ice cream
AND peanuts = 5
•
Support (“hot fudge AND ice cream”) = 25/1000 = 2.5%
•
Support (“hot fudge AND ice cream AND peanuts”) = 5/1000 = 0.5%
•
Question: what value does support provide?
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Association Analysis Concepts: Confidence
•
In addition to support, we need to determine the relative frequency
with which the combination indicated by an association occurs
•
Confidence
measures how often the consequent occurs, given that
the antecedent has occurred
•
Confidence = (Frequency of A and B) / (Frequency of A)
•
Examples: In a data set of 1000 food store market baskets:
–
Number of baskets that include hot fudge = 75
–
Number of baskets that include ice cream = 50
–
Number of baskets that include peanuts = 50
–
Number of baskets that include hot fudge AND ice cream = 25
–
Number of baskets that include hot fudge AND ice cream AND peanuts = 5
–
Confidence (“hot fudge => ice cream”) = (25/75) = 33%
–
Confidence (“hot fudge AND ice cream => peanuts”) = (5/25)= 20%
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Association Analysis Concepts: Lift
•
Lift is yet another measure for the validity of an association rule
•
Lift
measures the influence that an occurrence of A has on the
likelihood that B will occur
•
Lift = (Confidence of A => B) / (Frequency of B)
•
Examples: In a data set of 1000 food store market baskets:
–
Number of baskets that include hot fudge = 75
–
Number of baskets that include ice cream = 50
–
Number of baskets that include peanuts = 50
–
Number of baskets that include hot fudge AND ice cream = 25
–
Number of baskets that include hot fudge AND ice cream AND peanuts = 5
–
Lift (“hot fudge => ice cream”) = (0.33 / 0.05) = 6.6
–
Lift (“hot fudge AND ice cream => peanuts”) = (0.2 / 0.05) = 4
•
What conclusions can we draw re: hot fudge, ice cream, peanuts?
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Selecting Association Rules
•
Link analysis algorithms search for association rules
with good (high) support, confidence, and lift
•
Thresholds for “good” values for these measures will
vary depending on application and context.
–
Higher numbers indicate stronger rules
•
Recurring reminder: application of human interpretation
and judgment is essential to effective use of this
technique
•
Warning: Evaluating association rules can quickly
become very computationally expensive
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
In

Class Exercise: Evaluate Association Rules
•
Calculate the support, confidence and lift for the
candidate association rules in problem 1 on handout
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Wrap

Up
© 2007 Robert T. Monroe
Carnegie Mellon University
©2006

2008 Robert T. Monroe
Goals For Today
•
Introduce
data mining
, the data mining process, and
common business scenarios where it is useful
•
Understand where data mining fits in the BI universe
–
How it compares to and complements other BI tools
–
Relationship to data warehousing, OLAP, and relational db’s
•
Introduce common data mining models, concepts,
and techniques
–
Classification
–
Clustering
–
Link Analysis
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο