BITT S08 Lecture 9 - Data Mining - cmu-bitt - home

sentencehuddleData Management

Nov 20, 2013 (3 years and 8 months ago)

78 views

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Data Mining

Business Intelligence Tools and
Techniques

Robert Monroe

April 15, 2008

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Goals For Today


Introduce
data mining
, the data mining process, and
common business scenarios where it is useful



Understand where data mining fits in the BI universe


How it compares to and complements other BI tools


Relationship to data warehousing, OLAP, and relational db’s



Introduce common data mining models, concepts,

and techniques


Classification


Clustering


Link Analysis

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Acknowledgement


This evening’s talk, examples, and ideas were derived
primarily from two sources:


Course materials created by Professor Michael Trick for his
Mining Data For Decision Making

class, 45
-
873


Introduction to Data Mining and Knowledge Discovery
, by
the Two Crows Corporation



Many thanks to Professor Trick and the Two Crows
Corporation


© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Introduction to Data Mining

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Definitions


Data mining

is the art and science of automatically analyzing
large data stores to discover meaningful patterns and
relationships


Michael Trick, Carnegie Mellon University



Key ideas:


Art and science
: data mining requires active participation of human experts
to be used effectively; it is not a “fire and forget” process


Automatic
: standard analyses are run automatically; selecting which
analysis to run on which data and how to interpret the results is where the
human expert comes in…


Discover
: you don’t know what you’re going to get…


Meaningful
: there are many more useless patterns than useful


Patterns and relationships
: are what data mining techniques discover

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Data Mining and Business Intelligence


Data mining tools are part of a complete and well balanced BI toolkit


Data mining complements data warehousing and OLAP


OLAP helps you ask structured questions and interactively explore answers


Data mining helps an analyst determine structure and relationships in the data
that were not a priori obvious

Operational
Databases
Data Marts
Data
Warehouse
ETL
ETL
ETL
OLAP
Cube
ETL
ETL
Queries
Data Mining
Tools
Queries
OLAP
GUI
Reporting
And Vis Tools

Queries
Queries
Queries
© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Drowning in Data


One of the reasons that data mining is becoming more
popular is the explosion of data collected and available



Our ability to collect and store data seems to have
surpassed our ability to make sense of it

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Common Data Mining Goals


Classification


Divide a database into different groups that are very different
from each other, but whose members are very similar



Estimation/Prediction


Much like classification but attempt to predict an outcome in
the future



Clustering and Affinity Grouping


Identify things that “go together”


You don’t know what the clustering criteria will be a priori

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Applications of Data Mining


There are many, many potential applications of data
mining techniques



Basic requirements:


Availability of large data sets


A desire to discover patterns and relationships among the data


© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Applications of Data Mining (Examples)


Financial Services


Fraud detection and protection


Credit risk scoring


Customer lifetime value analysis



Retail


Marketing campaign response analysis and prediction


Personalization


Market basket analysis


Store layout

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Applications of Data Mining (Examples)


Manufacturing


Defect root cause analysis


Predicting order flow



Medicine


Diagnostics


Pathology and epidemic tracking and prediction


Drug discovery research



Others?

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


The Data Mining Process


Describe

the data

Build a

predictive model

Test the

model

Verify the model

Refine

and repeat


Key elements necessary for
successful data mining:


Business domain knowledge


Precise formulation of the problem
you are trying to solve


Understanding of how to use the
data mining tools (and algorithms)


Good data


Enough of the right data


Clean data


Willingness to learn, refine, and
iterate


© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Classification

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Classification


The goal of classification is to assign each member of a
set of records/objects to predefined classes



For example:


Assign a loan application to one of three categories:


high risk, medium risk, or low
-
risk


Assign a customer to one of three categories:


Highly profitable, potentially profitable, money
-
loser


In drug discovery, assign a chemical compound to:


High potential, low potential, likely harmful



Other examples?

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Classification Algorithm > Decision Trees


Decision trees are a common classification technique


Decision trees create a series of decision points (rules) that can
be used to classify a given record


Decision trees algorithms generate the rules automatically by
inference from previous data



Example: Is this loan applicant a good risk or a bad risk?

Income > $40k?

no

yes

Job > 5 years?

High debt?

yes

no

yes

no

Good Risk

Bad Risk

Bad Risk

Good Risk

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Decision Trees Example: Loan Risk


Once we have the tree, predicting whether the loan is
too risky is straightforward



Just answer the questions posed at tree branches until
you get to a leaf node (Good Risk/Bad Risk)

Income > $40k?

no

yes

Job > 5 years?

High debt?

yes

no

yes

no

Good Risk

Bad Risk

Bad Risk

Good Risk

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Building the Decision Tree is the Challenge


Data mining tools utilize statistical techniques to
identify which characteristics best predict the desired
classification


From these analyses they create rules


Income > 40k AND Job > 5 years => Good Risk


Income > 40k AND Job < 5 years => Bad Risk





Putting all of these rules together forms a decision tree



© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Decision Tree Benefits and Drawbacks


Benefits:


Easy to use, create, and understand the results


Works well for discrete data, can also work w/continuous data


Size of tree can be specified independent of size of database


Can work with few or many attributes



Drawbacks


It is easy to “overfit” the data


Addressed by stopping rules and tree pruning


Greedy algorithms may not make globally optimal
classifications


© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Classification Algorithm: K
-
Nearest Neighbor


The K
-
Nearest Neighbor algorithm is an alternative to decision
trees for automated classification



Given a database a classified records, determine the best
classification for a new record
R



Use a distance function to determine the “k
-
closest” neighbors to
record
R


Intuitively, this should introduce the records that are

“most like”
R



For a given value of “k” determine
R
’s k
-
nearest neighbors and
compute which classification best fits
R

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


K
-
Nearest Neighbor Example

Dimension Y

Dimension X

Question: How to classify the new item (k=1)?

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


K
-
Nearest Neighbor Example

Dimension Y

Dimension X

Question: How to classify the new item (k=6)?

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


K
-
Nearest Neighbor Benefits and Drawbacks


Benefits:


Easy to implement


Easy to explain the results



Drawbacks


Difficult to get a good distance function in many cases


Especially true when dealing with non
-
continuous variables


A poor distance function makes the results essentially worthless


The entire dataset is required (and must be evaluated) to
classify each new instance


Potentially difficult to do interactive analysis on large data sets

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Clustering (Affinity Grouping)

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Clustering


The goal of clustering is to divide a database into
different groups


The groups are not known ahead of time (unlike classification)


Each group should be significantly different from the others


Members of the same group should be very similar to each
other, along some dimension(s)



Examples:


Determine appropriate ways to segment customers


Partition geographic regions and divisions


Bundling packages of goods for special discounts


© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Intuition Behind Clustering


Given many

data points


Group them

into clusters

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


How to do the Grouping? K
-
Means Algorithm


K
-
Means is one of many clustering algorithms available


The
k
-
means

algorithm (to define
k

clusters):

1.
Place
k

points into the space represented by the objects that
are being clustered. These points represent initial group
centroids

2.
Assign each object to the group that has the closest centroid

3.
When all objects have been assigned, recalculate the positions
of the
k

centroids

4.
Repeat Steps 2 and 3 until the centroids no longer move. This
produces a separation of the objects into groups from which
the metric to be minimized can be calculated


Clustering algorithm from Clustering Algorithm Tutorial by Matteo
Matteucci, Politecnico di Milano,
http://www.elet.polimi.it/upload/matteucc/

Clustering/tutorial_html/index.html

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


K
-
Means Algorithm Example: First Iteration,
k
=3

1.

Place
k

points into
the space
represented by the
objects that are
being clustered.
These points
represent initial
group centroids

2.1

Assign each
object to the group
that has the closest
centroid

3.1

When all
objects have been
assigned, recalculate
the positions of the
k centroids



Iterate

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


K
-
Means Algorithm Example: Second Iteration,
k
=3

1.

Place
k

points into
the space
represented by the
objects that are
being clustered.
These points
represent initial
group centroids

2.2

Assign each
object to the group
that has the closest
centroid



Iterate Again

3.2

When all
objects have been
assigned, recalculate
the positions of the
k centroids

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


K
-
Means Algorithm Example: Third Iteration,
k
=3

1.

Place
k

points into
the space
represented by the
objects that are
being clustered.
These points
represent initial
group centroids

2.3

Assign each
object to the group
that has the closest
centroid

3.3

When all
objects have been
assigned, recalculate
the positions of the
k centroids


No centroids

move, so done


© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Clustering Challenges


Difficult to get a good distance function in many cases


Especially true when dealing with non
-
continuous variables


A poor distance function makes the results essentially
worthless



Clustering algorithms can be slow with large data sets



Difficult to interpret the results


… or to determine which cluster results are “good” or “better”


Determining causality vs. random dead
-
end correlations


© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Link Analysis

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Link Analysis


Link analysis

is a mechanism for identifying relationships among
different values in a database



Association discovery

finds rules about items that appear together
in a single event


Market basket analysis is a very common example



Sequence discovery

finds common sequences of related events
over time


Click stream analysis on the web is a good example



Association and sequence discovery are implemented with
essentially the same techniques

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Associations


Associations rules are written: A => B


Where A is the
antecedent

statement (or LHS)


… and B is the
consequent

statement (or RHS)


Association rule means If A, then B



Example association rule:


If a customer buys hot fudge they also buy ice cream


Antecedent (A): “customer buys hot fudge”


Consequent (B): “customer buys ice cream”


© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Association Analysis Concepts: Support


Support

is the frequency with which a particular
association appears in the database


Support(A) = (# of records with A) / (Total # of records)


Example: In a data set of 1000 food store market baskets:


Number of baskets that include hot fudge = 75


Number of baskets that include ice cream = 50


Number of baskets that include hot fudge AND ice cream = 25


Number of baskets that include hot fudge AND ice cream

AND peanuts = 5



Support (“hot fudge AND ice cream”) = 25/1000 = 2.5%


Support (“hot fudge AND ice cream AND peanuts”) = 5/1000 = 0.5%



Question: what value does support provide?

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Association Analysis Concepts: Confidence


In addition to support, we need to determine the relative frequency
with which the combination indicated by an association occurs


Confidence

measures how often the consequent occurs, given that
the antecedent has occurred


Confidence = (Frequency of A and B) / (Frequency of A)



Examples: In a data set of 1000 food store market baskets:


Number of baskets that include hot fudge = 75


Number of baskets that include ice cream = 50


Number of baskets that include peanuts = 50


Number of baskets that include hot fudge AND ice cream = 25


Number of baskets that include hot fudge AND ice cream AND peanuts = 5



Confidence (“hot fudge => ice cream”) = (25/75) = 33%


Confidence (“hot fudge AND ice cream => peanuts”) = (5/25)= 20%

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Association Analysis Concepts: Lift


Lift is yet another measure for the validity of an association rule


Lift

measures the influence that an occurrence of A has on the
likelihood that B will occur


Lift = (Confidence of A => B) / (Frequency of B)



Examples: In a data set of 1000 food store market baskets:


Number of baskets that include hot fudge = 75


Number of baskets that include ice cream = 50


Number of baskets that include peanuts = 50


Number of baskets that include hot fudge AND ice cream = 25


Number of baskets that include hot fudge AND ice cream AND peanuts = 5



Lift (“hot fudge => ice cream”) = (0.33 / 0.05) = 6.6


Lift (“hot fudge AND ice cream => peanuts”) = (0.2 / 0.05) = 4



What conclusions can we draw re: hot fudge, ice cream, peanuts?

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Selecting Association Rules


Link analysis algorithms search for association rules
with good (high) support, confidence, and lift



Thresholds for “good” values for these measures will
vary depending on application and context.


Higher numbers indicate stronger rules



Recurring reminder: application of human interpretation
and judgment is essential to effective use of this
technique



Warning: Evaluating association rules can quickly
become very computationally expensive

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


In
-
Class Exercise: Evaluate Association Rules


Calculate the support, confidence and lift for the
candidate association rules in problem 1 on handout

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Wrap
-
Up

© 2007 Robert T. Monroe

Carnegie Mellon University





©2006
-

2008 Robert T. Monroe


Goals For Today


Introduce
data mining
, the data mining process, and
common business scenarios where it is useful



Understand where data mining fits in the BI universe


How it compares to and complements other BI tools


Relationship to data warehousing, OLAP, and relational db’s



Introduce common data mining models, concepts,

and techniques


Classification


Clustering


Link Analysis