CSC 5800: Intelligent Systems: Algorithms and Tools

finickyontarioΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

267 εμφανίσεις

1
CSC 5800:
Intelligent Systems:
Algorithms and Tools
Acknowledgement: This lecture is partially based on the slides from
Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, “Introduction to Data Mining”,
Addison-Wesley (2005).
Course Review
What is Data?
• Collection of data objects and
their attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field,
characteristic, or feature
• A collection of attributes
describe an object
– Object is also known as
record, point, case,
sample, entity, or instance
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10

Attributes
Objects
2
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Attribute is a characteristic/feature/property.
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum
value
Types of Attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium,
short}
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
3
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: = 
– Order: < >
– Addition: + -
– Multiplication: * /
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Attribute
Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, )
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
contingency
correlation, 
2
test
Ordinal
The values of an ordinal attribute
provide enough information to order
objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval
For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
Ratio
For ratio variables, both differences
and ratios are meaningful. (*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
4
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Asymmetric Attributes
• Only presence (a non-zero attribute value) is regarded as
important
• Stored in sparse matrix form
• Examples:
– Words present in documents
– Courses taken by students
– Items present in customer transactions
• It can be either
– Asymmetric Binary
– Asymmetric Discrete
– Asymmetric Continuous
• Most students look very similar if they are compared based
on the courses that they don’t take
5
Types of Data Sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
• Data that consists of a collection of
records, each of which consists of a fixed
set of attributes
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10

6
Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
Document Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding
term occurs in the document.
7
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
TID
Items
1
Bread, Coke, Milk

2

Beer, Bread

3

Beer, Coke, Diaper, Milk

4

Beer, Bread, Diaper, Milk

5

Coke, Diaper, Milk


Graph Data
• (1) Data with Relationships among objects
– Examples: Generic graph and HTML Links
5
2
1
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
8
Graph Data
• (2) Data with Objects that are Graphs
– Substructure Mining is an important area
– E.g. Chemical Data - Benzene Molecule: C
6
H
6
Ordered Data
• (1)
Sequential Data - Sequences of transactions
– Eg. People buying DVD players, buy DVDs later.
An element of
the sequence
Items/Events
9
Ordered Data
• (2)Sequence data – no time stamps, but
order is still important. E.g. Genome data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• (3)Time Series data – series of some
measurements taken over time
– E.g. financial Data
10
Ordered Data
• (4) Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
collected for a
variety of
geographical
locations
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
11
Data Aggregation
• Combining two or more attributes (or
objects) into a single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale (or resolution)
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
• May remove noise or outliers
Sampling
• Sampling is the main technique employed for
data selection.
– It is often used for both the preliminary investigation of
the data and the final data analysis.
• Statisticians sample because obtaining the
entire set of data of interest is too expensive
or time consuming.
• Sampling is used in data mining because
processing the entire set of data of interest is
too expensive or time consuming.
12
Sampling …
• The key principle for effective sampling is
the following:
– using a sample will work almost as well as
using the entire data sets, if the sample is
representative
– A sample is representative if it has
approximately the same property (of interest)
as the original set of data
Types of Sampling
• Simple Random Sampling
– Sampling without replacement
– Sampling with replacement
• Stratified sampling
13
Types of Sampling
• Stratified sampling
– Simple random sampling may have very poor
performance in the presence of skew
– Split the data into several partitions; then draw
random samples from each partition
Curse of Dimensionality
• When dimensionality increases, data becomes increasingly sparse in
the space that it occupies
• Definitions of density and distance between points, which is critical for
clustering and outlier detection, become less meaningful
• The exponential growth of hypervolume as a function of dimensionality
[Bellman ’61]
• For example, 100 evenly-spaced sample points in a unit interval with no
more than 0.01 distance between points;
• Sampling of a 10-dimensional unit hypercube with a lattice with a
spacing of 0.01 needs 10
20
sample points:
• The 10-dimensional hypercube can be said to be a factor of 10
18
"larger"
than the unit interval.
14
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by
data mining algorithms (can be made sub-linear)
– Allow data to be more easily visualized, (even if
not into 2 or 3 dimensions, pair-wise combination
possibilities are greatly reduced)
– May help to eliminate irrelevant features or reduce
noise
– Better interpretable models
Dimensionality Reduction
• Techniques:
– Principal Component Analysis (PCA)
– Locally Linear Embedding (LLE)
– Multidimensional Scaling (MDS)
– ISOMAP
15
Principal Component Analysis (PCA)
• Also named the discrete Karhunen-Loève transform
(or KLT, named after Kari Karhunen and Michel Loève)
• The Hotelling transform(in the honor of Harold
Hotelling)
• PCA is mathematically defined as an orthogonal linear
transformation that transforms the data to a new
coordinate system such that the greatest variance by
any projection of the data comes to lie on the first
coordinate (called the first principal component), the
second greatest variance on the second coordinate, and
so on.
• A Line in a 3D space is actually one dimensional
Principal Component Analysis (PCA)
• Input:Data Matrix X[N,M]
• Output:Set of transformed coordinates
• Algorithm:
(1) Calculate the empirical mean
(2) Find the covariance matrix
1
1
__


















n
yYxX
C
n
i
ii
16
Principal Component Analysis (PCA)
(3) Find the eigenvectors and eigenvalues of
the covariance matrix
(4) Rearrange the eigenvectors and eigenvalues
(5) Compute the cumulative energy content for
each eigenvector
Principal Component Analysis (PCA)
(6) Select a subset of the eigenvectors as basis
vectors
Save the first L columns of V as the M × L
matrix W:
The goal is to choose as small a value of L as
possible while achieving a reasonably high
value of g on a percentage basis.
17
MULTIDIMENSIONAL SCALING (MDS)
• Definition:MDS transforms a distance matrix into
a set of coordinates such that the (Euclidean)
distances derived from these coordinates
approximate as well as possible the original
distances.
• Basic Idea :To transform the distance matrix into
a cross-product matrix and then to find its eigen-
decomposition
• Minimize: the stress quantity






ij
ij
ij
ijij
d
dd
stress
2
2
'
Isometric Feature Mapping (ISOMAP)
• Dashed blue line: Euclidean distance
• Solid blue line: Geodesic Distance (shortest
path distance if we are only allowed to “travel“
along the manifold)
18
ISOMAP
Step 1: Calculate (Euclidean) distances between all pairs of data
points (i,j) and store them in distance matrix D
X
as D
X
(i,j)
Step 2: Construct neighborhood graph by connecting two points i
and j
a) if they are closer than ε (ε-Isomap), D
X
(i,j) <ε, or
b) if i is one of the K nearest neighbors of j (K-Isomap).
Let the “weight” of the edge between i and j be the distance
between them
Step 3: Compute shortest path between each pair of points (using
Dijkstra's shortest path algorithm for example), store the path
lengths in the matrix D
G
as an approximate geodesic distance
Step 4: Apply classical MDS to matrix D
G
in order to find
embedding of data in d-dimensional Euclidean space
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– Duplicate information. Much or all of the information
contained in one or more other attributes
– Example: (1) purchase price of a product and the
amount of sales tax paid, (2) city, state and zip code.
• Irrelevant features
– contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
19
Feature Subset Selection Techniques
• Brute-force approach:
– Try all possible feature subsets as input to data mining
algorithm (2
n
-1)
• Embedded approaches:
– Feature selection occurs naturally as part of the data
mining algorithm
• Filter approaches:
– Features are selected before data mining algorithm is run
• Wrapper approaches:
– Use the data mining algorithm as a black box to find best
subset of attributes
Feature Creation
• Create new attributes that can capture the
important information in a data set much
more efficiently than the original attributes
• Three general methodologies:
– Feature Extraction
• domain-specific
– Mapping Data to New Space
– Feature Construction
• combining features (density=mass/volume)
20
Mapping Data to a New Space
Two Sine Waves
Two Sine Waves + Noise Frequency

Fourier transform

Wavelet transform
Discretization Without Using Class
Labels
Data Equal interval width
Equal frequency K-means
21
Attribute Transformation
• A function that maps the entire set of
values of a given attribute to a new set of
replacement values such that each old
value can be identified with one of the new
values
– Simple functions: x
k
, log(x), e
x
, |x|
– Standardization and Normalization
Data Normalization
• Min-max normalization:to [new_min
A
, new_max
A
]
– Ex. Let price range for different products is $12 to $98
normalized to [0.0, 1.0]. Then $73.60 is mapped to
716.00)0.00.1(
1298
126.73



22
Data Normalization
• Z-score normalization:
(μ: mean, σ: standard deviation):
• Ex. Let μ= 54, σ= 16. Then
225.1
16
546.73


Data Normalization
• Normalization by decimal scaling
Where j is the smallest integer such that Max(|v’|)<1
j
v
v
10
'
23
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity
for Simple Attributes
p and q are the attribute values for two data objects.
24
Euclidean Distance
• Euclidean Distance
Where n is the number of dimensions (attributes)
and p
k
and q
k
are, respectively, the k
th
attributes
(components) or data objects p and q.
• Standardization is necessary, if scales
differ.



n
k
kk
qpdist
1
2
)(
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
p
oi nt x
y
p
1
0 2
p
2
2 0
p
3
3 1
p
4
5 1
Distance Matrix
p
1
p
2
p
3
p
4
p
1
0 2.828 3.162 5.099
p
2
2.828 0 1.414 3.162
p
3
3.162 1.414 0 2
p
4
5.099 3.162 2 0
25
Minkowski Distance
• Minkowski Distance is a generalization
of Euclidean Distance
Where r is a parameter, n is the number of
dimensions (attributes) and p
k
and q
k
are,
respectively, the kth attributes (components) or
data objects p and q.
r
n
k
r
kk
qpdist
1
1
)||(



Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L
1
norm) distance.
– A common example of this is the Hamming distance, which is just
the number of bits that are different between two binary vectors
• r = 2. Euclidean distance
• r . Chebyshev Distance
(L
max
norm, L

norm)
– This is the maximum difference between any component of the
vectors
• Do not confuse r with n, i.e., all these distances are defined
for all numbers of dimensions.
26
Minkowski Distance: Examples
• How many steps does
the King need to
move from one place
to another? –
Chebyshev Distance
• How many steps does
the Rook need to
move from one place
to another? –
L1 Norm
Minkowski Distance
Distance Matrix
p
oi nt x
y
p
1
0 2
p
2
2 0
p
3
3 1
p
4
5 1
L1
p
1
p
2
p
3
p
4
p
1
0 4 4 6
p
2
4 0 2 4
p
3
4 2 0 2
p
4
6 4 2 0
L2
p
1
p
2
p
3
p
4
p
1
0 2.828 3.162 5.099
p
2
2.828 0 1.414 3.162
p
3
3.162 1.414 0 2
p
4
5.099 3.162 2 0
L

p1 瀲 瀳 p
p
1
0 2 3 5
p
2
2 0 1 3
p
3
3 1 0 2
p
4
5 3 2 0
27
Mahalanobis Distance
T
qpqpqpsmahalanobi )()(),(
1


For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
 is the covariance matrix of
the input data X





n
i
k
ik
j
ijkj
XXXX
n
1
,
))((
1
1
Mahalanobis Distance
Covariance Matrix:







3.02.0
2.03.0
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
28
Common Properties of a Distance
• Distances, such as the Euclidean distance,
have some well known properties.
1.(Positive definiteness) d(p, q) 0 for all p and q and
d(p, q) = 0 only if p = q.
2.(Symmetry) d(p, q) = d(q, p) for all p and q.
3.(Triangle Inequality) d(p, r) d(p, q) + d(q, r) for all
points p, q, and r.
where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
• A distance that satisfies these properties is a metric
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1.s(p, q) = 1 (or maximum similarity) only if p = q.
2.s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between
points (data objects), p and q.
29
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M
01
= the number of attributes where p was 0 and q was 1
M
10
= the number of attributes where p was 1 and q was 0
M
00
= the number of attributes where p was 0 and q was 0
M
11
= the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M
11
+ M
00
) / (M
01
+ M
10
+ M
11
+ M
00
)
J = number of 11 matches / number of not-both-zero attributes values
= (M
11
) / (M
01
+ M
10
+ M
11
)
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) =
(0+7) / (2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
30
Cosine Similarity
• If d
1
and d
2
are two document vectors,then
cos( d
1
,d
2
) = (d
1
 d
2
)/||d
1
|| ||d
2
||,
where  indicates vector dot product and || d || is the length of vector d.
• Example:
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
 d
2
= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d
1
|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
0.5
= (42)
0.5
= 6.481
||d
2
|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
0.5
= (6)
0.5
= 2.245
cos( d
1
, d
2
) = .3150
Correlation
• Correlation measures the linear relationship
between objects
• To compute correlation, we standardize
data objects, p and q, and then take their
dot product
)(/))(( pstdpmeanpp
kk



)(/))(( qstdqmeanqq
kk



qpqpncorrelatio




),(
31
Using Weights to Combine Similarities
• May not want to treat all attributes the
same.
– Use weights w
k
which are between 0 and 1
and sum to 1.
What is data exploration?
• Key motivations of data exploration include
– Helping to select the right tool for preprocessing or
analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data
analysis tools
• Related to the area of Exploratory Data Analysis
(EDA)
– Created by statistician John Tukey
– Chapter 1 of the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to
better understand its characteristics.
32
Data Exploration Techniques
• In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory
• We will focus on
– Summary statistics
– Visualization
– Online Analytical Processing (OLAP)
Iris Sample Data Set
• Many of the exploratory data techniques are
illustrated with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning
Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– Created by Douglas Fisher
– Three flower types (classes):
• Setosa
• Virginica
• Versicolour
– Four (non-class) attributes
• Sepal width and length
• Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
33
Summary Statistics
• Summary statistics are numbers that
summarize properties of the data
– Summarized properties include frequency,
location and spread

Examples: location - mean
spread - standard deviation
– Most summary statistics can be calculated in a
single pass through the data
Measures of Spread
• Range is the difference between the max and min
• The variance or standard deviation is the most common
measure of the spread of a set of points.
• However, this is also sensitive to outliers, so that other
measures are often used.
34
Visualization
• Visualization is the conversion of data into a visual or
tabular format so that the characteristics of the data and
the relationships among data items or attributes can be
analyzed or reported.
• Visualization of data is one of the most powerful and
appealing techniques for data exploration.
– Humans have a well developed ability to analyze large
amounts of information that is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Example: Sea Surface Temperature
• The following shows the Sea Surface
Temperature (SST) for July 1982
– 250,000 data points are summarized in a single figure
35
Selection
• Is the elimination or the de-emphasis of
certain objects and attributes
• Selection may involve choosing a subset of
attributes
– Dimensionality reduction is often used to reduce the
number of dimensions to two or three
– Alternatively, pairs of attributes can be considered
• Selection may also involve choosing a subset
of objects
– A region of the screen can only show so many points
– Can sample, but want to preserve points in sparse areas
Visualization Techniques: Histograms
• Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
36
Two-Dimensional Histograms
• Show the joint distribution of the values of
two attributes
• Example: petal width and petal length
Visualization Techniques
• Box Plots
– Another way of displaying the distribution of
data (especially percentiles)
outlier
10
th
percentile
25
th
percentile
75
th
percentile
50
th
percentile
90
th
percentile
37
Example of Box Plots
• Box plots can be used to compare
attributes
Pie Chart
38
Box Plots for individual classes
Empirical CDFs
39
Percentile Plots
Scatter Plot Array of Iris Attributes
40
41
Contour Plot Example: SST Dec, 1998
Celsius
Visualization of the Iris Data Matrix
standard
deviation
42
Visualization of the Iris Correlation Matrix
Parallel Coordinates Plots for Iris Data
43
Star Plots for Iris Data
Setosa
Versicolo
ur
Virginica
44
Chernoff Faces for Iris Data
Setosa
Versicolo
ur
Virginica
On-Line Analytical Processing (OLAP)
• Proposed by E. F. Codd, the father of the relational
database.
• Relational databases put data into tables, while OLAP
uses a multidimensional array representation.
– Such representations of data previously existed in
statistics and other fields
• There are a number of data analysis and data exploration
operations that are easier with such a data representation.
45
Creating a Multidimensional Array
• Two key steps in converting tabular data
into a multidimensional array.
– First, identify which attributes are to be the dimensions
and which attribute is to be the target attribute whose
values appear as entries in the multidimensional array.
– Second, find the value of each entry in the
multidimensional array by summing the values (of the
target attribute) or count of all objects that have the
attribute values corresponding to that entry.
Example: Iris data
• We show how the attributes, petal length,
petal width, and species type can be
converted to a multidimensional array
– First, we discretized the petal width and length to have
categorical values: low, medium, and high
– We get the following table - note the count attribute
46
Example: Iris data (continued)
• Each unique tuple of petal width, petal
length, and species type identifies one
element of the array.
• This element is assigned the corresponding
count value.
• The figure illustrates
the result.
• All non-specified
tuples are 0.
Example: Iris data (continued)
• Slices of the multidimensional array are
shown by the following cross-tabulations
47
OLAP Operations: Data Cube
• The key operation of a OLAP is the formation of a data
cube
• A data cube is a multidimensional representation of data,
together with all possible aggregates.
• By all possible aggregates, we mean the aggregates that
result by selecting a proper subset of the dimensions and
summing over all remaining dimensions.
• For example, if we choose the species type dimension of
the Iris data and sum over all other dimensions, the result
will be a one-dimensional entry with three entries, each of
which gives the number of flowers of each type.
A Sample Data Cube
Total annual sales
of TV in U.S.A.
Date
Country
sum
sum
TV
VCR
PC
1Qtr
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
sum
48
• Consider a data set that records the sales of products at a
number of company stores at various dates.
• This data can be represented
as a 3 dimensional array
• There are 3 two-dimensional
aggregates (3 choose 2 ),
3 one-dimensional aggregates,
and 1 zero-dimensional
aggregate (the overall total)
Data Cube Example
one of the two dimensional aggregates, along
with two of the one-dimensional
aggregates, and the overall total
Data Cube Example (continued)
49
OLAP Operations: Slicing and Dicing
• Slicing is selecting a group of cells from the
entire multidimensional array by specifying
a specific value for one or more
dimensions.
• Dicing involves selecting a subset of cells
by specifying a range of attribute values.
– This is equivalent to defining a subarray from
the complete array.
• In practice, both operations can also be
accompanied by aggregation over some
dimensions.
OLAP Operations: Roll-up and Drill-down
• Attribute values often have a hierarchical
structure.
– Each date is associated with a year, month, and week.
– A location is associated with a continent, country, state
(province, etc.), and city.
– Products can be divided into various categories, such
as clothing, electronics, and furniture.
50
OLAP Operations: Roll-up and Drill-down
• This hierarchical structure gives rise to the
roll-up and drill-down operations.
– For sales data, we can aggregate (roll up) the
sales across all the dates in a month.
– Conversely, given a view of the data where the
time dimension is broken into months, we could
split the monthly sales totals (drill down) into
daily sales totals.
– Likewise, we can drill down or roll up on the
location or product ID attributes.
Data Mining

Classification
[Predictive]

Association Rule Discovery
[Descriptive]

Clustering
[Descriptive]

Anomaly Detection
[Predictive]
51
Illustrating Classification Task
Apply
Model
Learn
Model
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
10

Tid
Attrib1
Attrib2
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10

Classification Techniques

Decision Tree based Methods

Rule-based Methods

Instance based Learning

Neural Networks

Bayes Classification

Support Vector Machines

Ensemble Methods
52
Example of a Decision Tree
HOwn
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Model: Decision Tree
Another Example of Decision Tree
MarSt
HOwn
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
53
Decision Tree Classification Task
Apply
Model
Learn
Model
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
10

Tid
Attrib1
Attrib2
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10

Decision
Tree
Apply Model to Test Data
HOwn
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10

Test Data
Assign Cheat to “No”
54
Decision Tree Classification Task
Apply
Model
Learn
Model
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
10

Tid
Attrib1
Attrib2
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10

Decision
Tree
Hunt’s Algorithm
Don’t
Cheat
HOwn
Don’t
Cheat
Don’t
Cheat
Yes No
HOwn
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K >= 80K
HOwn
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Tid
Home
Owner
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10

55
Tree Induction

Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

Issues
– Determine how to split the records

How to specify the attribute test condition?

How to determine the best split?
– Determine when to stop splitting
Splitting Based on Nominal Attributes

Multi-way split:Use as many partitions as distinct
values.

Binary split:Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury}
{Sports}
CarType
{Sports,
Luxury}
{Family}
OR
56

Multi-way split:Use as many partitions as distinct
values.

Binary split:Divides values into two subsets.
Need to find optimal partitioning.

What about this split?
Splitting Based on Ordinal Attributes
Size
Small
Medium
Large
Size
{Medium,
Large}
{Small}
Size
{Small,
Medium}
{Large}
OR
Size
{Small,
Large}
{Medium}
Splitting Based on Continuous Attributes

Different ways of handling
– Discretization to form an ordinal categorical
attribute

Static – discretize once at the beginning

Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A  v)

consider all possible splits and finds the best cut

can be more compute intensive
57
Splitting Based on Continuous Attributes
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
58
How to determine the Best Split

Greedy approach:
– Nodes with homogeneous class distribution
are preferred

Need a measure of node impurity:
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity

Gini Index

Entropy

Misclassification error
59
How to Find the Best Split
B?
Yes No
Node N3
Node N4
A?
Yes No
Node N1
Node N2
Before Splitting:
C0 N10
C1 N11


C0 N20
C1 N21


C0 N30
C1 N31


C0 N40
C1 N41


C0 N00
C1 N01


M0
M1
M2 M3 M4
M12
M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI

Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/n
c
) when records are equally
distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information


j
tjptGINI
2
)]|([1)(
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
60
Examples for computing GINI
C1 0
C2 6


C1 2
C2 4


C1 1
C2 5


P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)
2
– P(C2)
2
= 1 – 0 – 1 = 0


j
tjptGINI
2
)]|([1)(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)
2
– (5/6)
2
= 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)
2
– (4/6)
2
= 0.444
Splitting Based on GINI

When a node p is split into k partitions (children), the
quality of split is computed as,
where,n
i
= number of records at child i,
n = number of records at node p.



k
i
i
split
iGINI
n
n
GINI
1
)(
61
Categorical Attributes: Computing Gini Index

For each distinct value, gather counts for each class in
the dataset

Use the count matrix to make decisions
CarType
{Sports,
Luxur
y}
{
Famil
y}
C1
3 1
C2
2 4
Gini 0.400
CarType
{
S
p
orts
}
{Family,
Luxur
y}
C1
2 2
C2
1 5
Gini 0.419
CarType
Famil
y
S
p
orts
Luxur
y
C1
1 2 1
C2
4 1 1
Gini 0.393
Multi-way split Two-way split
(find best partition of values)
Continuous Attributes: Computing Gini Index

Use Binary Decisions based on one
value

Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values

Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A < v and A  v

Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute
its Gini index
– Computationally Inefficient!
Repetition of work.
Tid
Home
Owner
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10

62
Continuous Attributes: Computing Gini Index...

For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
Taxable Income
60
70
75
85
90
95
100
120
125
220
55
65
72
80
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No
0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300
0.343 0.375 0.400 0.420
Split Positions
Sorted Values
Alternative Splitting Criteria based on INFO

Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.

Maximum (log n
c
) when records are equally distributed
among all classes implying least information

Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations



j
t
j
p
t
j
p
t
E
ntropy )|(log)|()(
63
Examples for computing Entropy
C1 0
C2 6


C1 2
C2 4


C1 1
C2 5


P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log
2
(1/6) – (5/6) log
2
(1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log
2
(2/6) – (4/6) log
2
(4/6) = 0.92



j
t
j
p
t
j
p
t
E
ntropy )|(log)|()(
2
Splitting Based on INFO...

Information Gain:
Parent Node, p is split into k partitions;
n
i
is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.









k
i
i
split
iEntropy
n
n
pEntropyGAIN
1
)()(
64
Splitting Based on INFO...

Gain Ratio:
Parent Node, p is split into k partitions
n
i
is the number of records in partition i
– Adjusts Information Gain by the entropy of the
partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
– Designed to overcome the disadvantage of Information
Gain
SplitINFO
GAI
N
GainRATIO
Split
split




k
i
ii
n
n
n
n
SplitINFO
1
log
Splitting Criteria based on Classification Error

Classification error at a node t :

Measures misclassification error made by a node.

Maximum (1 - 1/n
c
) when records are equally distributed
among all classes, implying least interesting information

Minimum (0.0) when all records belong to one class, implying
most interesting information
)
|
(max1)( ti
P
t
E
rror
i


65
Examples for Computing Error
C1 0
C2 6


C1 2
C2 4


C1 1
C2 5


P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
)
|
(max1)( ti
P
t
E
rror
i


Comparison among Splitting Criteria
For a 2-class problem:
66
Stopping Criteria for Tree Induction

Stop expanding a node when all the records
belong to the same class

Stop when the number of records have fallen
below some minimum threshold

Early termination (to be discussed later)
Notes on Overfitting

Overfitting can be due to lack of representative
samples or due to some noise

Overfitting results in decision trees that are more
complex than necessary

Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records

Need new ways for estimating errors
67
Occam’s Razor

Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model

For complex models, there is a greater chance
that it was fitted accidentally by errors in data

Therefore, one should include model complexity
when evaluating a model
Model Evaluation

Metrics for Performance Evaluation
– How to evaluate the performance of a model?

Methods for Performance Evaluation
– How to obtain reliable estimates?

Methods for Model Comparison
– How to compare the relative performance
among competing models?
68
Metrics for Performance Evaluation

Focus on the predictive capability of a model
– Rather than how fast it takes to classify or
build models, scalability, etc.

Confusion Matrix:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…

Most widely-used metric:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a
(TP)
b
(FN)
Class=No c
(FP)
d
(TN)
F
N
FPT
N
TP
T
N
TP
dcba
da









Accuracy
69
Limitation of Accuracy

Consider a 2-class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10

If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example
Cost Matrix
PREDICTED CLASS
ACTUAL
CLASS
C(i|j)
Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
Class=No C(Yes|No) C(No|No)
C(i|j): Cost of misclassifying class j example as class I
Example: cancer vs non-cancer
70
Computing Cost of Classification
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
C(i|j)
+ -
+ -1 100
- 1 0
Model
M
1
PREDICTED CLASS
ACTUAL
CLASS
+ -
+ 150 40
- 60 250
Model
M
2
PREDICTED CLASS
ACTUAL
CLASS
+ -
+ 250 45
- 5 200
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255
Cost vs Accuracy
Count
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes
a b
Class=No
c d
Cost
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes
p q
Class=No
q p
N = a + b + c + d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (q-p)  Accuracy]
Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
2. C(Yes|Yes)=C(No|No) = p
71
Cost-Sensitive Measures
cba
a
pr
rp
ba
a
ca
a








2
22
(F) measure-F
(r) Recall
(p)Precision

F-measure is the harmonic mean between r and p

Joint measure between r and p
dwcwbwaw
dwaw
4321
41
Accuracy Weighted



Methods of Estimation

Holdout
– Reserve 2/3 for training and 1/3 for testing

Random subsampling
– Repeated holdout

Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n

Stratified sampling

Bootstrap
– Sampling with replacement
72
Training and Testing

Natural performance measure for classification
problems: error rate
– Success: instance’s class is predicted correctly
– Error: instance’s class is predicted incorrectly
– Error rate: proportion of errors made over the
whole set of instances

Resubstitution error: error rate obtained from training
data

Resubstitution error is (hopelessly) optimistic!
Training and Testing

Test set: independent instances that have played no
part in formation of classifier
– Assumption: both training data and test data are
representative samples of the underlying problem

Test and training data may differ in nature
– Example: classifiers built using customer data
from two different towns A and B
– To estimate performance of classifier from town A
in completely new town, test it on data from B
– Assumption : The distribution of test samples are
similar to the training samples
73
Note on parameter tuning

It is important that the test data is not used in any way
to create the classifier

Some learning schemes operate in two stages:
– Stage 1: build the basic structure
– Stage 2: optimize parameter settings

The test data can’t be used for parameter tuning!

Proper procedure uses three sets: training data,
validation data, and test data

Validation data is used to optimize parameters

Generally, the larger the training data the better the
classifier (but returns diminish)

The larger the test data the more accurate the error
estimate
Holdout Procedure

Holdout procedure: method of splitting original data into
training and test set
– Dilemma: ideally both training set and test set should
be large!

What to do if the amount of data is limited?
– The holdout method reserves a certain amount for
testing and uses the remainder for training

Usually: one third for testing, the rest for training
– Problem: the samples might not be representative
– Example: class might be missing in the test data

Advanced version uses stratification

Ensures that each class is represented with
approximately equal proportions in both subsets
74
Holdout Procedure

Advanced version uses stratification
– Ensures that each class is represented with
approximately equal proportions in both subsets

Holdout estimate can be made more reliable by repeating
the process with different subsamples
– In each iteration, a certain proportion is randomly
selected for training (possibly with stratification)
– The error rates on the different iterations are
averaged to yield an overall error rate

This is called the repeated holdout method

Still not optimum: the different test sets overlap

¨ Can we prevent overlapping?
Cross-validation

Crossvalidation avoids overlapping test sets
– First step: split data into k subsets of equal size
– Second step: use each subset in turn for testing, the
remainder for training

This procedure is called k-fold crossvalidation

Often the subsets are stratified before the cross-
validation is performed

The error estimates are averaged to yield an overall error
estimate
75
More on Cross-validation

Standard method for evaluation: stratified tenfold cross-
validation

Why ten?
– Extensive experiments have shown that this is the best
choice to get an accurate estimate
– There is also some theoretical evidence for this
Stratification reduces the estimate’s variance

Even better: repeated stratified cross-validation
– E.g. tenfold cross-validation is repeated ten times and
results are averaged (reduces the variance)
Leave-One-Out Cross-validation

Leave-One-Out: a particular form of crossvalidation:
– Set number of folds to number of training instances
– i.e. for n training instances, build classifier n times

Makes best use of the data

Involves no random sub-sampling

Very computationally expensive

Disadvantage of Leave-One-Out-CV:
– stratification is not possible
– It guarantees a non-stratified sample because there is
only one instance in the test set!
76
The bootstrap

CV uses sampling without replacement
– The same instance, once selected, can not be selected
again for a particular training/test set

The bootstrap uses sampling with replacement to form the
training set
– Sample a dataset of n instances n times with
replacement to form a new dataset of n instances
– Use this data as the training set
– Use the instances from the original dataset that don’t
occur in the new training set for testing
The 0.632 bootstrap

Also called the 0.632 bootstrap
– A particular instance has a probability of 1–1/n of not
being picked
– Thus its probability of ending up in the test data is:

This means the training data will contain approximately
63.2% of the instances
77
Estimating error with the bootstrap

The error estimate on the test data will be very pessimistic
– Trained on just ~63% of the instances

Therefore, combine it with the resubstitution error:

The resubstitution error gets less weight than the error on
the test data

Repeat process several times with different replacement
samples; average the results
Estimating error with the bootstrap

The error estimate on the test data will be very pessimistic
– Trained on just ~63% of the instances

Therefore, combine it with the resubstitution error:

The resubstitution error gets less weight than the error on
the test data

Repeat process several times with different replacement
samples; average the results
78
Predicting probabilities

Performance measure so far: success rate

Also called 0-1 loss function:
– 0 if prediction is correct
– 1 if prediction is incorrect

Most classifiers produces class probabilities

Depending on the application, we might want to check the
accuracy of the probability estimates

0-1 loss is not the right thing to use in those cases
Quadratic loss function

p1 … pk are probability estimates for an instance

c is the index of the instance’s actual class

a1 … ak = 0, except for ac which is 1

Quadratic loss is:

Want to minimize
79
The kappa statistic

Two confusion matrices for a 3-class problem:
actual predictor (left) vs. random predictor (right)

Number of successes: sum of entries in diagonal (D)

Kappa statistic:
measures relative improvement over random predictor
Cost-sensitive Classification

Can take costs into account when making predictions
– Basic idea: only predict high-cost class when very
confident about prediction

Given: predicted class probabilities
– Normally we just predict the most likely class
– Here, we should make the prediction that minimizes the
expected cost
– Expected cost: dot product of vector of class
probabilities and appropriate column in cost matrix
80
Cost-sensitive Learning

So far we haven't taken costs into account at training time

Most learning schemes do not perform cost-sensitive
learning (Homework 2)

They generate the same classifier no matter what costs are
assigned to the different classes
– Example: standard decision tree learner

Simple methods for cost-sensitive learning: (Homework 3)
– Resampling of instances according to costs
– Weighting of instances according to costs

Taking costs into the training procedure will be an algorithm
specific task.
Lift Charts

In practice, costs are rarely known

Decisions are usually made by comparing possible
scenarios
– Example: promotional mailout to 1,000,000
households
– Mail to all; 0.1% respond (1000) Data mining tool
identifies subset of 100,000 most promising, 0.4%
of these respond (400)- 40% of responses for 10%
of cost may pay off Identify subset of 400,000
most promising, 0.2% respond (800)

● A lift chart allows a visual comparison
81
Generating Lift Charts

Sort instances according to predicted probability

of being positive:

x axis is sample size

y axis is number of true positives
Sample Lift Chart

x axis is sample size

y axis is number of true positives
82
ROC (Receiver Operating Characteristic)

Developed in 1950s for signal detection theory to
analyze noisy signals
– Characterize the trade-off between positive
hits and false alarms

ROC curve plots TP (on the y-axis) against FP
(on the x-axis)

Performance of each classifier represented as a
point on the ROC curve
– changing the threshold of algorithm, sample
distribution or cost matrix changes the location
of the point
ROC Curve
(TP,FP):

(0,0): declare everything
to be negative class

(1,1): declare everything
to be positive class

(1,0): ideal

Diagonal line:
– Random guessing
– Below diagonal line:

prediction is opposite of
the true class
83
Precision-Recall Graphs

Recall: The percentage of the total
relevant documents in a database
retrieved by your search.
– If you knew that there were 1000
relevant documents in a database
and your search retrieved 100 of
these relevant documents, your
recall would be 10%.

Precision:The percentage of relevant
documents in relation to the number of
documents retrieved.
– If your search retrieves 100
documents and 20 of these are
relevant, your precision is 20%.
Summary of the Plots
84
Evaluating Numeric Prediction

Difference: error measures

Actual target values: a1 a2 …an

Predicted target values: p1 p2 … pn

Most popular measure: mean squared error
Performance measures
85
Test of Significance

Given two models:
– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances

Can we say M1 is better than M2?
– How much confidence can we place on accuracy of
M1 and M2?
– Can the difference in performance measure be
explained as a result of random fluctuations in the test
set?
Confidence Interval for Accuracy

Consider a model that produces an accuracy of
80% when evaluated on 100 test instances:
– N=100, acc = 0.8
– Let 1- = 0.95 (95% confidence)
– From probability table, Z
/2
=1.96
1- Z
0.99 2.58
0.98 2.33
0.95 1.96
0.90 1.65
N 50 100 500 1000 5000
p(lower) 0.670 0.711 0.763 0.774 0.789
p(upper) 0.888 0.866 0.833 0.824 0.811
86
Rule-Based Classifier

Classify records by using a collection of
“if…then…” rules

Rule: (Condition) y
– where

Condition is a conjunctions of attributes

y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:

(Blood Type=Warm)  (Lay Eggs=Yes) Birds

(Taxable Income < 50K)  (Refund=Yes) Evade=No
Rule-based Classifier (Example)
R1: (Give Birth = no)  (Can Fly = yes) Birds
R2: (Give Birth = no)  (Live in Water = yes) Fishes
R3: (Give Birth = yes)  (Blood Type = warm) Mammals
R4: (Give Birth = no)  (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fl
y
Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
87
Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes) Birds
R2: (Give Birth = no)  (Live in Water = yes) Fishes
R3: (Give Birth = yes)  (Blood Type = warm) Mammals
R4: (Give Birth = no)  (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Name Blood Type Give Birth Can Fl
y
Live in Water Class
hawk warm no yes no?
grizzly bear warm yes no no?
Rule Coverage and Accuracy

Coverage of a rule:
– Fraction of records
that satisfy the
antecedent of a rule

Accuracy of a rule:
– Fraction of records
that satisfy both the
antecedent and
consequent of a
rule
Tid
Refund
Marital
Status
Taxable
Income
Class
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10

(Status=Single) No
Coverage = 40%, Accuracy = 50%
88
Characteristics of Rule-Based Classifier

Mutually exclusive rules
– Classifier contains mutually exclusive rules if
the rules are independent of each other
– Every record is covered by at most one rule

Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
From Decision Trees To Rules
YES
YES
NO
NO
NO
NO
NO
NO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Classification Rules
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
89
Rules Can Be Simplified
YES
YES
NO
NO
NO
NO
NO
NO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10

Initial Rule: (Refund=No)  (Status=Married) No
Simplified Rule: (Status=Married) No
Effect of Rule Simplification

Rules are no longer mutually exclusive
– A record may trigger more than one rule
– Solution?

Ordered rule set

Unordered rule set – use voting schemes

Rules are no longer exhaustive
– A record may not trigger any rules
– Solution?

Use a default class
90
Ordered Rule Set

Rules are rank ordered according to their priority
– An ordered rule set is known as a decision list

When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no)  (Can Fly = yes) Birds
R2: (Give Birth = no)  (Live in Water = yes) Fishes
R3: (Give Birth = yes)  (Blood Type = warm) Mammals
R4: (Give Birth = no)  (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes?
Building Classification Rules

Direct Method:

Extract rules directly from data

e.g.: RIPPER, CN2, Holte’s 1R

Indirect Method:

Extract rules from other classification models (e.g.
decision trees, neural networks, etc).

e.g: C4.5rules
91
Direct Method: Sequential Covering
1.
Start from an empty rule
2.
Grow a rule using the Learn-One-Rule function
3.
Remove training records covered by the rule
4.
Repeat Step (2) and (3) until stopping criterion
is met
Example of Sequential Covering
(ii) Step 1
92
Example of Sequential Covering…
(iii) Step 2
R1
(iv) Step 3
R1
R2
Aspects of Sequential Covering

Rule Growing

Instance Elimination

Rule Evaluation

Stopping Criterion

Rule Pruning
93
Rule Growing

Two common strategies
Rule Growing (Examples)

CN2 Algorithm:
– Start from an empty conjunct: {}
– Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …
– Determine the rule consequent by taking majority class of instances
covered by the rule

RIPPER Algorithm:
– Start from an empty rule: {} => class
– Add conjuncts that maximizes FOIL’s information gain measure:

R0: {} => class (initial rule)

R1: {A} => class (rule after adding conjunct)

Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]

where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1
94
Instance Elimination

Why do we need to
eliminate instances?
– Otherwise, the next rule is
identical to previous rule

Why do we remove
positive instances?
– Ensure that the next rule is
different

Why do we remove
negative instances?
– Prevent underestimating
accuracy of rule
– Compare rules R2 and R3
in the diagram
Rule Evaluation

Metrics:
– Laplace
– M-estimate
kn
n
c



1
kn
kpn
c



n :
Number of instances
covered by rule
n
c
:
Number of + instances
covered by rule
k
: Number of classes
p
: Prior probability
95
Stopping Criterion and Rule Pruning

Stopping criterion
– Compute the gain
– If gain is not significant, discard the new rule

Rule Pruning
– Similar to post-pruning of decision trees
– Reduced Error Pruning:

Remove one of the conjuncts in the rule

Compare error rate on validation set before and
after pruning

If error improves, prune the conjunct
Indirect Methods
96
Instance-Based Classifiers
Atr1
……...
AtrN
Class
A
B
B
C
A
C
B
Set of Stored Cases
Atr1
……...
AtrN
Unseen Case

Store the training records

Use training records to
predict the class label of
unseen cases
Nearest-Neighbor Classifiers

Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of
k
, the number of
nearest neighbors to retrieve

To classify an unknown record:
– Compute distance to other
training records
– Identify
k
nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record
97
1 nearest-neighbor
Voronoi Diagram
Nearest Neighbor Classification

Compute distance between two points:
– Euclidean distance

Determine the class from nearest neighbor list
– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance

weight factor, w = 1/d
2



i
ii
qpqpd
2
)(),(
98
Nearest neighbor Classification…

k-NN classifiers are lazy learners
– It does not build models explicitly
– Unlike eager learners such as decision tree
induction and rule-based systems
– Classifying unknown records are relatively
expensive
Bayes Classifier

A probabilistic framework
for solving classification
problems

Conditional Probability:

Bayes theorem:
)(
)()|(
)|(
AP
C
P
C
A
P
ACP 
)(
),(
)|(
)(
),(
)|(
CP
CAP
CAP
AP
C
A
P
ACP


99
Example of Bayes Theorem

Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck, what’s the probability
he/she has meningitis?
0002.0
20/1
50000/15.0
)(
)()|(
)|( 


SP
M
P
M
SP
SMP
Bayesian Classifiers

Consider each attribute and class label as random
variables

Given a record with attributes (A
1
, A
2
,…,A
n
)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A
1
, A
2
,…,A
n
)

Can we estimate P(C| A
1
, A
2
,…,A
n
) directly from
data?
100
Bayesian Classifiers

Approach:
– compute the posterior probability P(C | A
1
, A
2
, …, A
n
) for
all values of C using the Bayes theorem
– Choose value of C that maximizes
P(C | A
1
, A
2
, …, A
n
)
– Equivalent to choosing value of C that maximizes
P(A
1
, A
2
, …, A
n
|C) P(C)

How to estimate P(A
1
, A
2
, …, A
n
| C )?
)(
)()|(
)|(
21
21
21
n
n
n
AAAP
CPC
A
A
A
P
AAACP


 
Naïve Bayes Classifier

Assume independence among attributes A
i
when class is
given:
– P(A
1
, A
2
, …, A
n
|C) = P(A
1
| C
j
) P(A
2
| C
j
)… P(A
n
| C
j
)
– Can estimate P(A
i
| C
j
) for all A
i
and C
j
.
– New point is classified to C
j
if P(C
j
)

P(A
i
| C
j
) is
maximal.
101
How to Estimate Probabilities from Data?

Class: P(C) = N
c
/N
– e.g., P(No) = 7/10,
P(Yes) = 3/10

For discrete attributes:
P(A
i
| C
k
) = |A
ik
|/ N
c
– where |A
ik
| is number of
instances having attribute
A
i
and belongs to class C
k
– Examples:
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
k
Tid
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10

c
c
c
How to Estimate Probabilities from Data?

For continuous attributes:
– Discretize the range into bins

one ordinal attribute per bin

violates independence assumption
– Two-way split:(A < v) or (A > v)

choose only one of the two splits as new attribute
– Probability density estimation:

Assume attribute follows a normal distribution

Use data to estimate parameters of distribution
(e.g., mean and standard deviation)

Once probability distribution is known, can use it to
estimate the conditional probability P(A
i
|c)
102
How to Estimate Probabilities from Data?

Normal distribution:
– One for each (A
i
,c
i
) pair

For (Income, Class=No):
– If Class=No

sample mean = 110

sample variance = 2975
Tid
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10

2
2
2
)(
2
2
1
)|(
ij
iji
A
ij
ji
ecAP






0072.0
)54.54(2
1
)|120(
)2975(2
)110120(
2



eNoIncomeP

Example of Naïve Bayes Classifier
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:sample mean=110
sample variance=2975
If class=Yes:sample mean=90
sample variance=25
naive Bayes Classifier:
120K)IncomeMarried,
N
o,Refun
d
(



X

P(X|Class=No) = P(Refund=No|Class=No)

P(Married| Class=No)

P(Income=120K| Class=No)
= 4/7

4/7

0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)

P(Married| Class=Yes)

P(Income=120K| Class=Yes)
= 1

0

1.2

10
-9
= 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Given a Test Record:
103
Naïve Bayes Classifier

If one of the conditional probability is zero, then
the entire expression becomes zero

Probability estimation:
mN
mpN
CAP
cN
N
CAP
N
N
CAP
c
ic
i
c
ic
i
c
ic
i







)|(:estimate-m
1
)|(:Laplace
)|( :Original
c: number of classes
p: prior probability
m: parameter
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
python no no no no non-mammals
salmon no no yes no non-mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
bat yes yes no yes mammals
pigeon no yes no yes non-mammals
cat yes no no yes mammals
leopard shark yes no yes no non-mammals
turtle no no sometimes yes non-mammals
penguin no no sometimes yes non-mammals
porcupine yes no no yes mammals
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
eagle no yes no yes non-mammals
Give Birth Can Fly Live in Water Have Legs Class
yes no yes no?
0027.0
20
13
004.0)()|(
021.0
20
7
06.0)()|(
0042.0
13
4
13
3
13
10
13
1
)|(
06.0
7
2
7
2
7
6
7
6
)|(




NPNAP
MPMAP
NAP
MAP
A: attributes
M: mammals
N: non-mammals
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
104
Artificial Neural Networks
PERCEPTRON (Rosenblatt in 1960s)
105
Perceptron Training Rule
Gradient Descent
106
Gradient Descent
Gradient Descent
107
Incremental Gradient Descent
Sigmoid Unit
108
Network Diagram
Inputs : x
i
Output : y
Weights : w
ij
Biases : b
i
Targets : t
11
w
nk
w
01
w
02
w
k
w
0
1k
b
y
1
b
2
b
k
b
n
x
3
x
2
x
1
x
# of Input Nodes : n
# of Hidden Layers : 1
# of Hidden Nodes : k
# of Output Nodes : 1
 
2
1
1
C(w) ( ) (,,)
Q
i
t i
y
i w x
Q

 

Backpropagation
109
Support Vector Machines

Find hyperplane maximizes the margin => B1 is better than B2
Support Vector Machines
0 bxw


1 bxw


1

bxw








1bxw if1
1bxw if1
)(




xf
2
||||
2
Margin
w


110
Support Vector Machines

We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:

This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)
2
||||
2
Margin
w








1bxw if1
1bxw if1
)(
i
i




i
xf
2
||||
)(
2
w
wL


Support Vector Machines

What if the problem is not linearly separable?
111
Support Vector Machines

What if the problem is not linearly separable?
– Introduce slack variables

Need to minimize:

Subject to:






ii
ii
1bxw if1
-1bxw if1
)(






i
xf









N
i
k
i
C
w
wL
1
2
2
||||
)( 

Large Margin
112
Nonlinear Support Vector Machines

What if decision boundary is not linear?
Summary

Maximizing the Margin

Linear SVM

Lagrange Multipliers

Linearly Non-separable SVM

Nonlinear SVM – XOR problem

The Kernel Trick
113
Ensemble Methods

Application
phase

T
T
1
T
2

T
S
(
x
, ?)
h* = F(h
1
, h
2
, …, h
S
)
(
x
, y*)
Learning
phase

h
1
h
2

h
S
different
training sets
and/or
learning algorithms
Ensemble Methods

Boosting

Bagging

Stacking

Error Correcting Output Coding

Random Forests
114
How to make an effective ensemble?
Two basic decisions when designing ensembles:
1