Machine Learning

spraytownspeakerAI and Robotics

Oct 16, 2013 (3 years and 10 months ago)

101 views

Data Mining

Session 1

Professor Amit Basu

abasu@smu.edu

The Domain of Data Mining

Data Bases

Artificial


Intelligence

Statistics

Relational Databases

Data Warehouses

Query Languages

Information Retrieval

Expert Systems

Pattern Recognition

Case
-
based Reasoning

Regression

Clustering

Bayesian Statistics

Data Mining and Statistics


Many data mining models are based on statistical
methods


DM is
exploratory data analysis


DM deals with
very large

datasets


DM is a secondary purpose of data


Data originally collected for other purposes


Data structured in ways that are often not directly
usable for DM


Data preparation is a difficult part of DM

The Knowledge Discovery Process

Data


Scrubbing

Data

Warehouse

Data


Mining

Data


Visualization

Reports

Actions

Monitors

Databases

Data Mining Approaches


Description/Summarization


Classification


Estimation


Prediction


Association


Clustering


Graphs


Decision Trees


Rules


Relationships


Templates/Cases

Data Mining Technologies


Data Warehouses and OLAP


Statistics


Neural Networks


Genetic Algorithms


Machine Learning


Association Rules

The Promise of Data Mining


Business Process Improvement


Customer Service


Product Development and Research


Product Quality


Marketing


Segmentation


Target marketing


CRM

The Other Side: Learning Things
that are not True


Patterns may not represent any underlying
rule


Sample may not reflect its parent
population, hence bias


Data may be at the wrong level of detail
(granularity; aggregation)





The Other Side:

Learning Things that are
True, but not Useful


Learning things that are already known


Umbrellas in London


Married people buy baby
-
food



Learning things that cannot be used


More medicine sales after earthquakes

Two Issues for Today’s Session


Preparing Data for Data Mining


Classification modeling with Decision Tree
Models

Preparing Data


Assembling a suitable data set


Relevant


From appropriate sources


Over appropriate period


Adequate


Rich enough for analysis and model type


Missing data


Transforming


Sampling


Partitioning


Decision Trees


A tree structure in which


Each internal node is labeled with an
attribute
x


Each arc is labeled with a predicate on the
attribute in the parent node


Each leaf node is labeled with an output
class C
j

Example

Name

Gender

Height

Output

Mary

F

1.6

Short

Bill

M

2

Tall

Jean

F

1.9

Medium

Sheila

F

1.88

Medium

Steffy

F

1.7

Short

John

M

1.85

Medium

Jo

F

1.6

Short

Dave

M

1.7

Short

Adam

M

2.2

Tall

Steve

M

2.1

Tall

Florence

F

1.8

Medium

Todd

M

1.95

Medium

Katy

F

1.9

Medium

Angie

F

1.8

Medium

Wynona

F

1.75

Medium

Gender

Height

Height

Short

Medium

Tall

Short

Medium

Tall

F

M

<1.7

<1.8

How Does it Work?

1.
Partition data into input and output variables


Ideally, have a single output variable


With multiple outputs, have many combinations

2.
Find a way to break the data set into subsets
based on specific values of some variable(s)

3.
Construct a decision tree

4.
Translate the tree into appropriate rules


x

x

x

x

x

x

o

x

x

x

x

x

o

o

o

o

o

o

o

o

o

X1

X2

Example

x

x

x

x

x

x

o

x

x

x

x

x

o

o

o

o

o

o

o

o

o

X1

X2

Split1

Split2

Entropy
-
based Partitioning


The greater the entropy of a dataset, the less
distinct are its features


If a dataset is partitioned, and the two
partitions have the same distribution, then the
partitioning does not reduce entropy


The goal is to create partitions so as to
minimize entropy


The best case is when the partitioning results in
zero entropy => single class



Select the split for each node that results in
minimizing the entropy of the data sets

Overfitting


When should you stop splitting nodes?


Is it useful to have a tree in which every
node is “pure?


Has instances from a single class


Why/Why not?


Role of validation data

x

x

x

x

x

x

o

x

x

x

x

x

o

o

o

o

o

o

o

o

o

X1

X2

Split1

Split2

Split3

Split4

Split5

Split6

Overfitting

Impact of Over
-
fitting

Design Issues


Selection of Data, categorization of
Variables


Customization of DT run


accuracy of clusters (entropy level limits)


size (depth) of decision tree


Estimating gain from split


Support at node after split


Applicability of DT/ML

Dimension


DT/ML


Comments


Accuracy


mod/high


controllable, over
-
fitting


Explainability


High


Intuitive trees and rules


Response Speed


High


simple traversal


Compactness


moderate


depends on algorithms


Scalability


moderate


learning depends on dataset size


Flexibility


High


performance may vary with changes


Embeddability


High


relatively self
-
contained


Tolerance for complexity


moderate


discovers simple relationships


Tolerance for noise


moderate


can ignore outlying values


Tolerance for sparse data


low





Dev. speed


moderate


setup is non
-
trivial


Need for experts


moderate


to review rules/trees


Computing resources
needed


moderate


depends on dataset & complexity