1
CSC 5800:
Intelligent Systems:
Algorithms and Tools
Acknowledgement: This lecture is partially based on the slides from
PangNing Tan, Michael Steinbach, and Vipin Kumar, “Introduction to Data Mining”,
AddisonWesley (2005).
Course Review
What is Data?
• Collection of data objects and
their attributes
• An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field,
characteristic, or feature
• A collection of attributes
describe an object
– Object is also known as
record, point, case,
sample, entity, or instance
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Attributes
Objects
2
Attribute Values
• Attribute values are numbers or symbols assigned to an attribute
• Attribute is a characteristic/feature/property.
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum
value
Types of Attributes
– Nominal
• Examples: ID numbers, eye color, zip codes
– Ordinal
• Examples: rankings (e.g., taste of potato chips on a
scale from 110), grades, height in {tall, medium,
short}
– Interval
• Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
• Examples: temperature in Kelvin, length, time, counts
3
Properties of Attribute Values
• The type of an attribute depends on which of the following
properties it possesses:
– Distinctness: =
– Order: < >
– Addition: + 
– Multiplication: * /
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Attribute
Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, )
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
contingency
correlation,
2
test
Ordinal
The values of an ordinal attribute
provide enough information to order
objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval
For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+,  )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
Ratio
For ratio variables, both differences
and ratios are meaningful. (*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
4
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating
point variables.
Asymmetric Attributes
• Only presence (a nonzero attribute value) is regarded as
important
• Stored in sparse matrix form
• Examples:
– Words present in documents
– Courses taken by students
– Items present in customer transactions
• It can be either
– Asymmetric Binary
– Asymmetric Discrete
– Asymmetric Continuous
• Most students look very similar if they are compared based
on the courses that they don’t take
5
Types of Data Sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Record Data
• Data that consists of a collection of
records, each of which consists of a fixed
set of attributes
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
6
Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multidimensional space, where each
dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
Document Data
• Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the corresponding
term occurs in the document.
7
Transaction Data
• A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
Graph Data
• (1) Data with Relationships among objects
– Examples: Generic graph and HTML Links
5
2
1
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
NBody Computation and Dense Linear System Solvers
8
Graph Data
• (2) Data with Objects that are Graphs
– Substructure Mining is an important area
– E.g. Chemical Data  Benzene Molecule: C
6
H
6
Ordered Data
• (1)
Sequential Data  Sequences of transactions
– Eg. People buying DVD players, buy DVDs later.
An element of
the sequence
Items/Events
9
Ordered Data
• (2)Sequence data – no time stamps, but
order is still important. E.g. Genome data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• (3)Time Series data – series of some
measurements taken over time
– E.g. financial Data
10
Ordered Data
• (4) SpatioTemporal Data
Average Monthly
Temperature of
land and ocean
collected for a
variety of
geographical
locations
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
11
Data Aggregation
• Combining two or more attributes (or
objects) into a single attribute (or object)
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale (or resolution)
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
• May remove noise or outliers
Sampling
• Sampling is the main technique employed for
data selection.
– It is often used for both the preliminary investigation of
the data and the final data analysis.
• Statisticians sample because obtaining the
entire set of data of interest is too expensive
or time consuming.
• Sampling is used in data mining because
processing the entire set of data of interest is
too expensive or time consuming.
12
Sampling …
• The key principle for effective sampling is
the following:
– using a sample will work almost as well as
using the entire data sets, if the sample is
representative
– A sample is representative if it has
approximately the same property (of interest)
as the original set of data
Types of Sampling
• Simple Random Sampling
– Sampling without replacement
– Sampling with replacement
• Stratified sampling
13
Types of Sampling
• Stratified sampling
– Simple random sampling may have very poor
performance in the presence of skew
– Split the data into several partitions; then draw
random samples from each partition
Curse of Dimensionality
• When dimensionality increases, data becomes increasingly sparse in
the space that it occupies
• Definitions of density and distance between points, which is critical for
clustering and outlier detection, become less meaningful
• The exponential growth of hypervolume as a function of dimensionality
[Bellman ’61]
• For example, 100 evenlyspaced sample points in a unit interval with no
more than 0.01 distance between points;
• Sampling of a 10dimensional unit hypercube with a lattice with a
spacing of 0.01 needs 10
20
sample points:
• The 10dimensional hypercube can be said to be a factor of 10
18
"larger"
than the unit interval.
14
Dimensionality Reduction
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by
data mining algorithms (can be made sublinear)
– Allow data to be more easily visualized, (even if
not into 2 or 3 dimensions, pairwise combination
possibilities are greatly reduced)
– May help to eliminate irrelevant features or reduce
noise
– Better interpretable models
Dimensionality Reduction
• Techniques:
– Principal Component Analysis (PCA)
– Locally Linear Embedding (LLE)
– Multidimensional Scaling (MDS)
– ISOMAP
15
Principal Component Analysis (PCA)
• Also named the discrete KarhunenLoève transform
(or KLT, named after Kari Karhunen and Michel Loève)
• The Hotelling transform(in the honor of Harold
Hotelling)
• PCA is mathematically defined as an orthogonal linear
transformation that transforms the data to a new
coordinate system such that the greatest variance by
any projection of the data comes to lie on the first
coordinate (called the first principal component), the
second greatest variance on the second coordinate, and
so on.
• A Line in a 3D space is actually one dimensional
Principal Component Analysis (PCA)
• Input:Data Matrix X[N,M]
• Output:Set of transformed coordinates
• Algorithm:
(1) Calculate the empirical mean
(2) Find the covariance matrix
1
1
__
n
yYxX
C
n
i
ii
16
Principal Component Analysis (PCA)
(3) Find the eigenvectors and eigenvalues of
the covariance matrix
(4) Rearrange the eigenvectors and eigenvalues
(5) Compute the cumulative energy content for
each eigenvector
Principal Component Analysis (PCA)
(6) Select a subset of the eigenvectors as basis
vectors
Save the first L columns of V as the M × L
matrix W:
The goal is to choose as small a value of L as
possible while achieving a reasonably high
value of g on a percentage basis.
17
MULTIDIMENSIONAL SCALING (MDS)
• Definition:MDS transforms a distance matrix into
a set of coordinates such that the (Euclidean)
distances derived from these coordinates
approximate as well as possible the original
distances.
• Basic Idea :To transform the distance matrix into
a crossproduct matrix and then to find its eigen
decomposition
• Minimize: the stress quantity
ij
ij
ij
ijij
d
dd
stress
2
2
'
Isometric Feature Mapping (ISOMAP)
• Dashed blue line: Euclidean distance
• Solid blue line: Geodesic Distance (shortest
path distance if we are only allowed to “travel“
along the manifold)
18
ISOMAP
Step 1: Calculate (Euclidean) distances between all pairs of data
points (i,j) and store them in distance matrix D
X
as D
X
(i,j)
Step 2: Construct neighborhood graph by connecting two points i
and j
a) if they are closer than ε (εIsomap), D
X
(i,j) <ε, or
b) if i is one of the K nearest neighbors of j (KIsomap).
Let the “weight” of the edge between i and j be the distance
between them
Step 3: Compute shortest path between each pair of points (using
Dijkstra's shortest path algorithm for example), store the path
lengths in the matrix D
G
as an approximate geodesic distance
Step 4: Apply classical MDS to matrix D
G
in order to find
embedding of data in ddimensional Euclidean space
Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– Duplicate information. Much or all of the information
contained in one or more other attributes
– Example: (1) purchase price of a product and the
amount of sales tax paid, (2) city, state and zip code.
• Irrelevant features
– contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
19
Feature Subset Selection Techniques
• Bruteforce approach:
– Try all possible feature subsets as input to data mining
algorithm (2
n
1)
• Embedded approaches:
– Feature selection occurs naturally as part of the data
mining algorithm
• Filter approaches:
– Features are selected before data mining algorithm is run
• Wrapper approaches:
– Use the data mining algorithm as a black box to find best
subset of attributes
Feature Creation
• Create new attributes that can capture the
important information in a data set much
more efficiently than the original attributes
• Three general methodologies:
– Feature Extraction
• domainspecific
– Mapping Data to New Space
– Feature Construction
• combining features (density=mass/volume)
20
Mapping Data to a New Space
Two Sine Waves
Two Sine Waves + Noise Frequency
•
Fourier transform
•
Wavelet transform
Discretization Without Using Class
Labels
Data Equal interval width
Equal frequency Kmeans
21
Attribute Transformation
• A function that maps the entire set of
values of a given attribute to a new set of
replacement values such that each old
value can be identified with one of the new
values
– Simple functions: x
k
, log(x), e
x
, x
– Standardization and Normalization
Data Normalization
• Minmax normalization:to [new_min
A
, new_max
A
]
– Ex. Let price range for different products is $12 to $98
normalized to [0.0, 1.0]. Then $73.60 is mapped to
716.00)0.00.1(
1298
126.73
22
Data Normalization
• Zscore normalization:
(μ: mean, σ: standard deviation):
• Ex. Let μ= 54, σ= 16. Then
225.1
16
546.73
Data Normalization
• Normalization by decimal scaling
Where j is the smallest integer such that Max(v’)<1
j
v
v
10
'
23
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity
for Simple Attributes
p and q are the attribute values for two data objects.
24
Euclidean Distance
• Euclidean Distance
Where n is the number of dimensions (attributes)
and p
k
and q
k
are, respectively, the k
th
attributes
(components) or data objects p and q.
• Standardization is necessary, if scales
differ.
n
k
kk
qpdist
1
2
)(
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
p
oi nt x
y
p
1
0 2
p
2
2 0
p
3
3 1
p
4
5 1
Distance Matrix
p
1
p
2
p
3
p
4
p
1
0 2.828 3.162 5.099
p
2
2.828 0 1.414 3.162
p
3
3.162 1.414 0 2
p
4
5.099 3.162 2 0
25
Minkowski Distance
• Minkowski Distance is a generalization
of Euclidean Distance
Where r is a parameter, n is the number of
dimensions (attributes) and p
k
and q
k
are,
respectively, the kth attributes (components) or
data objects p and q.
r
n
k
r
kk
qpdist
1
1
)(
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L
1
norm) distance.
– A common example of this is the Hamming distance, which is just
the number of bits that are different between two binary vectors
• r = 2. Euclidean distance
• r . Chebyshev Distance
(L
max
norm, L
norm)
– This is the maximum difference between any component of the
vectors
• Do not confuse r with n, i.e., all these distances are defined
for all numbers of dimensions.
26
Minkowski Distance: Examples
• How many steps does
the King need to
move from one place
to another? –
Chebyshev Distance
• How many steps does
the Rook need to
move from one place
to another? –
L1 Norm
Minkowski Distance
Distance Matrix
p
oi nt x
y
p
1
0 2
p
2
2 0
p
3
3 1
p
4
5 1
L1
p
1
p
2
p
3
p
4
p
1
0 4 4 6
p
2
4 0 2 4
p
3
4 2 0 2
p
4
6 4 2 0
L2
p
1
p
2
p
3
p
4
p
1
0 2.828 3.162 5.099
p
2
2.828 0 1.414 3.162
p
3
3.162 1.414 0 2
p
4
5.099 3.162 2 0
L
p1 瀲 瀳 p
p
1
0 2 3 5
p
2
2 0 1 3
p
3
3 1 0 2
p
4
5 3 2 0
27
Mahalanobis Distance
T
qpqpqpsmahalanobi )()(),(
1
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
is the covariance matrix of
the input data X
n
i
k
ik
j
ijkj
XXXX
n
1
,
))((
1
1
Mahalanobis Distance
Covariance Matrix:
3.02.0
2.03.0
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
28
Common Properties of a Distance
• Distances, such as the Euclidean distance,
have some well known properties.
1.(Positive definiteness) d(p, q) 0 for all p and q and
d(p, q) = 0 only if p = q.
2.(Symmetry) d(p, q) = d(q, p) for all p and q.
3.(Triangle Inequality) d(p, r) d(p, q) + d(q, r) for all
points p, q, and r.
where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
• A distance that satisfies these properties is a metric
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1.s(p, q) = 1 (or maximum similarity) only if p = q.
2.s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between
points (data objects), p and q.
29
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M
01
= the number of attributes where p was 0 and q was 1
M
10
= the number of attributes where p was 1 and q was 0
M
00
= the number of attributes where p was 0 and q was 0
M
11
= the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M
11
+ M
00
) / (M
01
+ M
10
+ M
11
+ M
00
)
J = number of 11 matches / number of notbothzero attributes values
= (M
11
) / (M
01
+ M
10
+ M
11
)
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M
01
= 2 (the number of attributes where p was 0 and q was 1)
M
10
= 1 (the number of attributes where p was 1 and q was 0)
M
00
= 7 (the number of attributes where p was 0 and q was 0)
M
11
= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M
11
+ M
00
)/(M
01
+ M
10
+ M
11
+ M
00
) =
(0+7) / (2+1+0+7) = 0.7
J = (M
11
) / (M
01
+ M
10
+ M
11
) = 0 / (2 + 1 + 0) = 0
30
Cosine Similarity
• If d
1
and d
2
are two document vectors,then
cos( d
1
,d
2
) = (d
1
d
2
)/d
1
 d
2
,
where indicates vector dot product and  d  is the length of vector d.
• Example:
d
1
= 3 2 0 5 0 0 0 2 0 0
d
2
= 1 0 0 0 0 0 0 1 0 2
d
1
d
2
= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
d
1
 = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
0.5
= (42)
0.5
= 6.481
d
2
 = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
0.5
= (6)
0.5
= 2.245
cos( d
1
, d
2
) = .3150
Correlation
• Correlation measures the linear relationship
between objects
• To compute correlation, we standardize
data objects, p and q, and then take their
dot product
)(/))(( pstdpmeanpp
kk
)(/))(( qstdqmeanqq
kk
qpqpncorrelatio
),(
31
Using Weights to Combine Similarities
• May not want to treat all attributes the
same.
– Use weights w
k
which are between 0 and 1
and sum to 1.
What is data exploration?
• Key motivations of data exploration include
– Helping to select the right tool for preprocessing or
analysis
– Making use of humans’ abilities to recognize patterns
• People can recognize patterns not captured by data
analysis tools
• Related to the area of Exploratory Data Analysis
(EDA)
– Created by statistician John Tukey
– Chapter 1 of the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to
better understand its characteristics.
32
Data Exploration Techniques
• In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major areas of interest, and not thought of as just
exploratory
• We will focus on
– Summary statistics
– Visualization
– Online Analytical Processing (OLAP)
Iris Sample Data Set
• Many of the exploratory data techniques are
illustrated with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning
Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– Created by Douglas Fisher
– Three flower types (classes):
• Setosa
• Virginica
• Versicolour
– Four (nonclass) attributes
• Sepal width and length
• Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
33
Summary Statistics
• Summary statistics are numbers that
summarize properties of the data
– Summarized properties include frequency,
location and spread
•
Examples: location  mean
spread  standard deviation
– Most summary statistics can be calculated in a
single pass through the data
Measures of Spread
• Range is the difference between the max and min
• The variance or standard deviation is the most common
measure of the spread of a set of points.
• However, this is also sensitive to outliers, so that other
measures are often used.
34
Visualization
• Visualization is the conversion of data into a visual or
tabular format so that the characteristics of the data and
the relationships among data items or attributes can be
analyzed or reported.
• Visualization of data is one of the most powerful and
appealing techniques for data exploration.
– Humans have a well developed ability to analyze large
amounts of information that is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Example: Sea Surface Temperature
• The following shows the Sea Surface
Temperature (SST) for July 1982
– 250,000 data points are summarized in a single figure
35
Selection
• Is the elimination or the deemphasis of
certain objects and attributes
• Selection may involve choosing a subset of
attributes
– Dimensionality reduction is often used to reduce the
number of dimensions to two or three
– Alternatively, pairs of attributes can be considered
• Selection may also involve choosing a subset
of objects
– A region of the screen can only show so many points
– Can sample, but want to preserve points in sparse areas
Visualization Techniques: Histograms
• Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
36
TwoDimensional Histograms
• Show the joint distribution of the values of
two attributes
• Example: petal width and petal length
Visualization Techniques
• Box Plots
– Another way of displaying the distribution of
data (especially percentiles)
outlier
10
th
percentile
25
th
percentile
75
th
percentile
50
th
percentile
90
th
percentile
37
Example of Box Plots
• Box plots can be used to compare
attributes
Pie Chart
38
Box Plots for individual classes
Empirical CDFs
39
Percentile Plots
Scatter Plot Array of Iris Attributes
40
41
Contour Plot Example: SST Dec, 1998
Celsius
Visualization of the Iris Data Matrix
standard
deviation
42
Visualization of the Iris Correlation Matrix
Parallel Coordinates Plots for Iris Data
43
Star Plots for Iris Data
Setosa
Versicolo
ur
Virginica
44
Chernoff Faces for Iris Data
Setosa
Versicolo
ur
Virginica
OnLine Analytical Processing (OLAP)
• Proposed by E. F. Codd, the father of the relational
database.
• Relational databases put data into tables, while OLAP
uses a multidimensional array representation.
– Such representations of data previously existed in
statistics and other fields
• There are a number of data analysis and data exploration
operations that are easier with such a data representation.
45
Creating a Multidimensional Array
• Two key steps in converting tabular data
into a multidimensional array.
– First, identify which attributes are to be the dimensions
and which attribute is to be the target attribute whose
values appear as entries in the multidimensional array.
– Second, find the value of each entry in the
multidimensional array by summing the values (of the
target attribute) or count of all objects that have the
attribute values corresponding to that entry.
Example: Iris data
• We show how the attributes, petal length,
petal width, and species type can be
converted to a multidimensional array
– First, we discretized the petal width and length to have
categorical values: low, medium, and high
– We get the following table  note the count attribute
46
Example: Iris data (continued)
• Each unique tuple of petal width, petal
length, and species type identifies one
element of the array.
• This element is assigned the corresponding
count value.
• The figure illustrates
the result.
• All nonspecified
tuples are 0.
Example: Iris data (continued)
• Slices of the multidimensional array are
shown by the following crosstabulations
47
OLAP Operations: Data Cube
• The key operation of a OLAP is the formation of a data
cube
• A data cube is a multidimensional representation of data,
together with all possible aggregates.
• By all possible aggregates, we mean the aggregates that
result by selecting a proper subset of the dimensions and
summing over all remaining dimensions.
• For example, if we choose the species type dimension of
the Iris data and sum over all other dimensions, the result
will be a onedimensional entry with three entries, each of
which gives the number of flowers of each type.
A Sample Data Cube
Total annual sales
of TV in U.S.A.
Date
Country
sum
sum
TV
VCR
PC
1Qtr
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
sum
48
• Consider a data set that records the sales of products at a
number of company stores at various dates.
• This data can be represented
as a 3 dimensional array
• There are 3 twodimensional
aggregates (3 choose 2 ),
3 onedimensional aggregates,
and 1 zerodimensional
aggregate (the overall total)
Data Cube Example
one of the two dimensional aggregates, along
with two of the onedimensional
aggregates, and the overall total
Data Cube Example (continued)
49
OLAP Operations: Slicing and Dicing
• Slicing is selecting a group of cells from the
entire multidimensional array by specifying
a specific value for one or more
dimensions.
• Dicing involves selecting a subset of cells
by specifying a range of attribute values.
– This is equivalent to defining a subarray from
the complete array.
• In practice, both operations can also be
accompanied by aggregation over some
dimensions.
OLAP Operations: Rollup and Drilldown
• Attribute values often have a hierarchical
structure.
– Each date is associated with a year, month, and week.
– A location is associated with a continent, country, state
(province, etc.), and city.
– Products can be divided into various categories, such
as clothing, electronics, and furniture.
50
OLAP Operations: Rollup and Drilldown
• This hierarchical structure gives rise to the
rollup and drilldown operations.
– For sales data, we can aggregate (roll up) the
sales across all the dates in a month.
– Conversely, given a view of the data where the
time dimension is broken into months, we could
split the monthly sales totals (drill down) into
daily sales totals.
– Likewise, we can drill down or roll up on the
location or product ID attributes.
Data Mining
Classification
[Predictive]
Association Rule Discovery
[Descriptive]
Clustering
[Descriptive]
Anomaly Detection
[Predictive]
51
Illustrating Classification Task
Apply
Model
Learn
Model
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
10
Tid
Attrib1
Attrib2
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Classification Techniques
Decision Tree based Methods
Rulebased Methods
Instance based Learning
Neural Networks
Bayes Classification
Support Vector Machines
Ensemble Methods
52
Example of a Decision Tree
HOwn
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data
Model: Decision Tree
Another Example of Decision Tree
MarSt
HOwn
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
53
Decision Tree Classification Task
Apply
Model
Learn
Model
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
10
Tid
Attrib1
Attrib2
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Decision
Tree
Apply Model to Test Data
HOwn
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Test Data
Assign Cheat to “No”
54
Decision Tree Classification Task
Apply
Model
Learn
Model
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
10
Tid
Attrib1
Attrib2
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Decision
Tree
Hunt’s Algorithm
Don’t
Cheat
HOwn
Don’t
Cheat
Don’t
Cheat
Yes No
HOwn
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K >= 80K
HOwn
Don’t
Cheat
Yes No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Tid
Home
Owner
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
55
Tree Induction
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
– Determine when to stop splitting
Splitting Based on Nominal Attributes
Multiway split:Use as many partitions as distinct
values.
Binary split:Divides values into two subsets.
Need to find optimal partitioning.
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury}
{Sports}
CarType
{Sports,
Luxury}
{Family}
OR
56
Multiway split:Use as many partitions as distinct
values.
Binary split:Divides values into two subsets.
Need to find optimal partitioning.
What about this split?
Splitting Based on Ordinal Attributes
Size
Small
Medium
Large
Size
{Medium,
Large}
{Small}
Size
{Small,
Medium}
{Large}
OR
Size
{Small,
Large}
{Medium}
Splitting Based on Continuous Attributes
Different ways of handling
– Discretization to form an ordinal categorical
attribute
Static – discretize once at the beginning
Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
– Binary Decision: (A < v) or (A v)
consider all possible splits and finds the best cut
can be more compute intensive
57
Splitting Based on Continuous Attributes
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
58
How to determine the Best Split
Greedy approach:
– Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
Nonhomogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
59
How to Find the Best Split
B?
Yes No
Node N3
Node N4
A?
Yes No
Node N1
Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
M0
M1
M2 M3 M4
M12
M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
Gini Index for a given node t :
(NOTE: p( j  t) is the relative frequency of class j at node t).
– Maximum (1  1/n
c
) when records are equally
distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
j
tjptGINI
2
)]([1)(
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
60
Examples for computing GINI
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)
2
– P(C2)
2
= 1 – 0 – 1 = 0
j
tjptGINI
2
)]([1)(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)
2
– (5/6)
2
= 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)
2
– (4/6)
2
= 0.444
Splitting Based on GINI
When a node p is split into k partitions (children), the
quality of split is computed as,
where,n
i
= number of records at child i,
n = number of records at node p.
k
i
i
split
iGINI
n
n
GINI
1
)(
61
Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in
the dataset
Use the count matrix to make decisions
CarType
{Sports,
Luxur
y}
{
Famil
y}
C1
3 1
C2
2 4
Gini 0.400
CarType
{
S
p
orts
}
{Family,
Luxur
y}
C1
2 2
C2
1 5
Gini 0.419
CarType
Famil
y
S
p
orts
Luxur
y
C1
1 2 1
C2
4 1 1
Gini 0.393
Multiway split Twoway split
(find best partition of values)
Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one
value
Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A < v and A v
Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute
its Gini index
– Computationally Inefficient!
Repetition of work.
Tid
Home
Owner
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
62
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
Taxable Income
60
70
75
85
90
95
100
120
125
220
55
65
72
80
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No
0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300
0.343 0.375 0.400 0.420
Split Positions
Sorted Values
Alternative Splitting Criteria based on INFO
Entropy at a given node t:
(NOTE: p( j  t) is the relative frequency of class j at node t).
– Measures homogeneity of a node.
Maximum (log n
c
) when records are equally distributed
among all classes implying least information
Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
j
t
j
p
t
j
p
t
E
ntropy )(log)()(
63
Examples for computing Entropy
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log
2
(1/6) – (5/6) log
2
(1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log
2
(2/6) – (4/6) log
2
(4/6) = 0.92
j
t
j
p
t
j
p
t
E
ntropy )(log)()(
2
Splitting Based on INFO...
Information Gain:
Parent Node, p is split into k partitions;
n
i
is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
k
i
i
split
iEntropy
n
n
pEntropyGAIN
1
)()(
64
Splitting Based on INFO...
Gain Ratio:
Parent Node, p is split into k partitions
n
i
is the number of records in partition i
– Adjusts Information Gain by the entropy of the
partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
– Designed to overcome the disadvantage of Information
Gain
SplitINFO
GAI
N
GainRATIO
Split
split
k
i
ii
n
n
n
n
SplitINFO
1
log
Splitting Criteria based on Classification Error
Classification error at a node t :
Measures misclassification error made by a node.
Maximum (1  1/n
c
) when records are equally distributed
among all classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying
most interesting information
)

(max1)( ti
P
t
E
rror
i
65
Examples for Computing Error
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
)

(max1)( ti
P
t
E
rror
i
Comparison among Splitting Criteria
For a 2class problem:
66
Stopping Criteria for Tree Induction
Stop expanding a node when all the records
belong to the same class
Stop when the number of records have fallen
below some minimum threshold
Early termination (to be discussed later)
Notes on Overfitting
Overfitting can be due to lack of representative
samples or due to some noise
Overfitting results in decision trees that are more
complex than necessary
Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records
Need new ways for estimating errors
67
Occam’s Razor
Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model
For complex models, there is a greater chance
that it was fitted accidentally by errors in data
Therefore, one should include model complexity
when evaluating a model
Model Evaluation
Metrics for Performance Evaluation
– How to evaluate the performance of a model?
Methods for Performance Evaluation
– How to obtain reliable estimates?
Methods for Model Comparison
– How to compare the relative performance
among competing models?
68
Metrics for Performance Evaluation
Focus on the predictive capability of a model
– Rather than how fast it takes to classify or
build models, scalability, etc.
Confusion Matrix:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a b
Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…
Most widelyused metric:
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes a
(TP)
b
(FN)
Class=No c
(FP)
d
(TN)
F
N
FPT
N
TP
T
N
TP
dcba
da
Accuracy
69
Limitation of Accuracy
Consider a 2class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
If model predicts everything to be class 0,
accuracy is 9990/10000 = 99.9 %
– Accuracy is misleading because model does
not detect any class 1 example
Cost Matrix
PREDICTED CLASS
ACTUAL
CLASS
C(ij)
Class=Yes Class=No
Class=Yes C(YesYes) C(NoYes)
Class=No C(YesNo) C(NoNo)
C(ij): Cost of misclassifying class j example as class I
Example: cancer vs noncancer
70
Computing Cost of Classification
Cost
Matrix
PREDICTED CLASS
ACTUAL
CLASS
C(ij)
+ 
+ 1 100
 1 0
Model
M
1
PREDICTED CLASS
ACTUAL
CLASS
+ 
+ 150 40
 60 250
Model
M
2
PREDICTED CLASS
ACTUAL
CLASS
+ 
+ 250 45
 5 200
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255
Cost vs Accuracy
Count
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes
a b
Class=No
c d
Cost
PREDICTED CLASS
ACTUAL
CLASS
Class=Yes Class=No
Class=Yes
p q
Class=No
q p
N = a + b + c + d
Accuracy = (a + d)/N
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N – a – d)
= q N – (q – p)(a + d)
= N [q – (qp) Accuracy]
Accuracy is proportional to cost if
1. C(YesNo)=C(NoYes) = q
2. C(YesYes)=C(NoNo) = p
71
CostSensitive Measures
cba
a
pr
rp
ba
a
ca
a
2
22
(F) measureF
(r) Recall
(p)Precision
Fmeasure is the harmonic mean between r and p
Joint measure between r and p
dwcwbwaw
dwaw
4321
41
Accuracy Weighted
Methods of Estimation
Holdout
– Reserve 2/3 for training and 1/3 for testing
Random subsampling
– Repeated holdout
Cross validation
– Partition data into k disjoint subsets
– kfold: train on k1 partitions, test on the remaining one
– Leaveoneout: k=n
Stratified sampling
Bootstrap
– Sampling with replacement
72
Training and Testing
Natural performance measure for classification
problems: error rate
– Success: instance’s class is predicted correctly
– Error: instance’s class is predicted incorrectly
– Error rate: proportion of errors made over the
whole set of instances
Resubstitution error: error rate obtained from training
data
Resubstitution error is (hopelessly) optimistic!
Training and Testing
Test set: independent instances that have played no
part in formation of classifier
– Assumption: both training data and test data are
representative samples of the underlying problem
Test and training data may differ in nature
– Example: classifiers built using customer data
from two different towns A and B
– To estimate performance of classifier from town A
in completely new town, test it on data from B
– Assumption : The distribution of test samples are
similar to the training samples
73
Note on parameter tuning
It is important that the test data is not used in any way
to create the classifier
Some learning schemes operate in two stages:
– Stage 1: build the basic structure
– Stage 2: optimize parameter settings
The test data can’t be used for parameter tuning!
Proper procedure uses three sets: training data,
validation data, and test data
Validation data is used to optimize parameters
Generally, the larger the training data the better the
classifier (but returns diminish)
The larger the test data the more accurate the error
estimate
Holdout Procedure
Holdout procedure: method of splitting original data into
training and test set
– Dilemma: ideally both training set and test set should
be large!
What to do if the amount of data is limited?
– The holdout method reserves a certain amount for
testing and uses the remainder for training
Usually: one third for testing, the rest for training
– Problem: the samples might not be representative
– Example: class might be missing in the test data
Advanced version uses stratification
Ensures that each class is represented with
approximately equal proportions in both subsets
74
Holdout Procedure
Advanced version uses stratification
– Ensures that each class is represented with
approximately equal proportions in both subsets
Holdout estimate can be made more reliable by repeating
the process with different subsamples
– In each iteration, a certain proportion is randomly
selected for training (possibly with stratification)
– The error rates on the different iterations are
averaged to yield an overall error rate
This is called the repeated holdout method
Still not optimum: the different test sets overlap
¨ Can we prevent overlapping?
Crossvalidation
Crossvalidation avoids overlapping test sets
– First step: split data into k subsets of equal size
– Second step: use each subset in turn for testing, the
remainder for training
This procedure is called kfold crossvalidation
Often the subsets are stratified before the cross
validation is performed
The error estimates are averaged to yield an overall error
estimate
75
More on Crossvalidation
Standard method for evaluation: stratified tenfold cross
validation
Why ten?
– Extensive experiments have shown that this is the best
choice to get an accurate estimate
– There is also some theoretical evidence for this
Stratification reduces the estimate’s variance
Even better: repeated stratified crossvalidation
– E.g. tenfold crossvalidation is repeated ten times and
results are averaged (reduces the variance)
LeaveOneOut Crossvalidation
LeaveOneOut: a particular form of crossvalidation:
– Set number of folds to number of training instances
– i.e. for n training instances, build classifier n times
Makes best use of the data
Involves no random subsampling
Very computationally expensive
Disadvantage of LeaveOneOutCV:
– stratification is not possible
– It guarantees a nonstratified sample because there is
only one instance in the test set!
76
The bootstrap
CV uses sampling without replacement
– The same instance, once selected, can not be selected
again for a particular training/test set
The bootstrap uses sampling with replacement to form the
training set
– Sample a dataset of n instances n times with
replacement to form a new dataset of n instances
– Use this data as the training set
– Use the instances from the original dataset that don’t
occur in the new training set for testing
The 0.632 bootstrap
Also called the 0.632 bootstrap
– A particular instance has a probability of 1–1/n of not
being picked
– Thus its probability of ending up in the test data is:
This means the training data will contain approximately
63.2% of the instances
77
Estimating error with the bootstrap
The error estimate on the test data will be very pessimistic
– Trained on just ~63% of the instances
Therefore, combine it with the resubstitution error:
The resubstitution error gets less weight than the error on
the test data
Repeat process several times with different replacement
samples; average the results
Estimating error with the bootstrap
The error estimate on the test data will be very pessimistic
– Trained on just ~63% of the instances
Therefore, combine it with the resubstitution error:
The resubstitution error gets less weight than the error on
the test data
Repeat process several times with different replacement
samples; average the results
78
Predicting probabilities
Performance measure so far: success rate
Also called 01 loss function:
– 0 if prediction is correct
– 1 if prediction is incorrect
Most classifiers produces class probabilities
Depending on the application, we might want to check the
accuracy of the probability estimates
01 loss is not the right thing to use in those cases
Quadratic loss function
p1 … pk are probability estimates for an instance
c is the index of the instance’s actual class
a1 … ak = 0, except for ac which is 1
Quadratic loss is:
Want to minimize
79
The kappa statistic
Two confusion matrices for a 3class problem:
actual predictor (left) vs. random predictor (right)
Number of successes: sum of entries in diagonal (D)
Kappa statistic:
measures relative improvement over random predictor
Costsensitive Classification
Can take costs into account when making predictions
– Basic idea: only predict highcost class when very
confident about prediction
Given: predicted class probabilities
– Normally we just predict the most likely class
– Here, we should make the prediction that minimizes the
expected cost
– Expected cost: dot product of vector of class
probabilities and appropriate column in cost matrix
80
Costsensitive Learning
So far we haven't taken costs into account at training time
Most learning schemes do not perform costsensitive
learning (Homework 2)
They generate the same classifier no matter what costs are
assigned to the different classes
– Example: standard decision tree learner
Simple methods for costsensitive learning: (Homework 3)
– Resampling of instances according to costs
– Weighting of instances according to costs
Taking costs into the training procedure will be an algorithm
specific task.
Lift Charts
In practice, costs are rarely known
Decisions are usually made by comparing possible
scenarios
– Example: promotional mailout to 1,000,000
households
– Mail to all; 0.1% respond (1000) Data mining tool
identifies subset of 100,000 most promising, 0.4%
of these respond (400) 40% of responses for 10%
of cost may pay off Identify subset of 400,000
most promising, 0.2% respond (800)
● A lift chart allows a visual comparison
81
Generating Lift Charts
Sort instances according to predicted probability
of being positive:
x axis is sample size
y axis is number of true positives
Sample Lift Chart
x axis is sample size
y axis is number of true positives
82
ROC (Receiver Operating Characteristic)
Developed in 1950s for signal detection theory to
analyze noisy signals
– Characterize the tradeoff between positive
hits and false alarms
ROC curve plots TP (on the yaxis) against FP
(on the xaxis)
Performance of each classifier represented as a
point on the ROC curve
– changing the threshold of algorithm, sample
distribution or cost matrix changes the location
of the point
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
– Random guessing
– Below diagonal line:
prediction is opposite of
the true class
83
PrecisionRecall Graphs
Recall: The percentage of the total
relevant documents in a database
retrieved by your search.
– If you knew that there were 1000
relevant documents in a database
and your search retrieved 100 of
these relevant documents, your
recall would be 10%.
Precision:The percentage of relevant
documents in relation to the number of
documents retrieved.
– If your search retrieves 100
documents and 20 of these are
relevant, your precision is 20%.
Summary of the Plots
84
Evaluating Numeric Prediction
Difference: error measures
Actual target values: a1 a2 …an
Predicted target values: p1 p2 … pn
Most popular measure: mean squared error
Performance measures
85
Test of Significance
Given two models:
– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances
Can we say M1 is better than M2?
– How much confidence can we place on accuracy of
M1 and M2?
– Can the difference in performance measure be
explained as a result of random fluctuations in the test
set?
Confidence Interval for Accuracy
Consider a model that produces an accuracy of
80% when evaluated on 100 test instances:
– N=100, acc = 0.8
– Let 1 = 0.95 (95% confidence)
– From probability table, Z
/2
=1.96
1 Z
0.99 2.58
0.98 2.33
0.95 1.96
0.90 1.65
N 50 100 500 1000 5000
p(lower) 0.670 0.711 0.763 0.774 0.789
p(upper) 0.888 0.866 0.833 0.824 0.811
86
RuleBased Classifier
Classify records by using a collection of
“if…then…” rules
Rule: (Condition) y
– where
Condition is a conjunctions of attributes
y is the class label
– LHS: rule antecedent or condition
– RHS: rule consequent
– Examples of classification rules:
(Blood Type=Warm) (Lay Eggs=Yes) Birds
(Taxable Income < 50K) (Refund=Yes) Evade=No
Rulebased Classifier (Example)
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fl
y
Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
87
Application of RuleBased Classifier
A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Name Blood Type Give Birth Can Fl
y
Live in Water Class
hawk warm no yes no?
grizzly bear warm yes no no?
Rule Coverage and Accuracy
Coverage of a rule:
– Fraction of records
that satisfy the
antecedent of a rule
Accuracy of a rule:
– Fraction of records
that satisfy both the
antecedent and
consequent of a
rule
Tid
Refund
Marital
Status
Taxable
Income
Class
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
(Status=Single) No
Coverage = 40%, Accuracy = 50%
88
Characteristics of RuleBased Classifier
Mutually exclusive rules
– Classifier contains mutually exclusive rules if
the rules are independent of each other
– Every record is covered by at most one rule
Exhaustive rules
– Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
– Each record is covered by at least one rule
From Decision Trees To Rules
YES
YES
NO
NO
NO
NO
NO
NO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Classification Rules
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
89
Rules Can Be Simplified
YES
YES
NO
NO
NO
NO
NO
NO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Initial Rule: (Refund=No) (Status=Married) No
Simplified Rule: (Status=Married) No
Effect of Rule Simplification
Rules are no longer mutually exclusive
– A record may trigger more than one rule
– Solution?
Ordered rule set
Unordered rule set – use voting schemes
Rules are no longer exhaustive
– A record may not trigger any rules
– Solution?
Use a default class
90
Ordered Rule Set
Rules are rank ordered according to their priority
– An ordered rule set is known as a decision list
When a test record is presented to the classifier
– It is assigned to the class label of the highest ranked rule it has
triggered
– If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes?
Building Classification Rules
Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holte’s 1R
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
91
Direct Method: Sequential Covering
1.
Start from an empty rule
2.
Grow a rule using the LearnOneRule function
3.
Remove training records covered by the rule
4.
Repeat Step (2) and (3) until stopping criterion
is met
Example of Sequential Covering
(ii) Step 1
92
Example of Sequential Covering…
(iii) Step 2
R1
(iv) Step 3
R1
R2
Aspects of Sequential Covering
Rule Growing
Instance Elimination
Rule Evaluation
Stopping Criterion
Rule Pruning
93
Rule Growing
Two common strategies
Rule Growing (Examples)
CN2 Algorithm:
– Start from an empty conjunct: {}
– Add conjuncts that minimizes the entropy measure: {A}, {A,B}, …
– Determine the rule consequent by taking majority class of instances
covered by the rule
RIPPER Algorithm:
– Start from an empty rule: {} => class
– Add conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ]
where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
p1: number of positive instances covered by R1
n1: number of negative instances covered by R1
94
Instance Elimination
Why do we need to
eliminate instances?
– Otherwise, the next rule is
identical to previous rule
Why do we remove
positive instances?
– Ensure that the next rule is
different
Why do we remove
negative instances?
– Prevent underestimating
accuracy of rule
– Compare rules R2 and R3
in the diagram
Rule Evaluation
Metrics:
– Laplace
– Mestimate
kn
n
c
1
kn
kpn
c
n :
Number of instances
covered by rule
n
c
:
Number of + instances
covered by rule
k
: Number of classes
p
: Prior probability
95
Stopping Criterion and Rule Pruning
Stopping criterion
– Compute the gain
– If gain is not significant, discard the new rule
Rule Pruning
– Similar to postpruning of decision trees
– Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and
after pruning
If error improves, prune the conjunct
Indirect Methods
96
InstanceBased Classifiers
Atr1
……...
AtrN
Class
A
B
B
C
A
C
B
Set of Stored Cases
Atr1
……...
AtrN
Unseen Case
•
Store the training records
•
Use training records to
predict the class label of
unseen cases
NearestNeighbor Classifiers
Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of
k
, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other
training records
– Identify
k
nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record
97
1 nearestneighbor
Voronoi Diagram
Nearest Neighbor Classification
Compute distance between two points:
– Euclidean distance
Determine the class from nearest neighbor list
– take the majority vote of class labels among
the knearest neighbors
– Weigh the vote according to distance
weight factor, w = 1/d
2
i
ii
qpqpd
2
)(),(
98
Nearest neighbor Classification…
kNN classifiers are lazy learners
– It does not build models explicitly
– Unlike eager learners such as decision tree
induction and rulebased systems
– Classifying unknown records are relatively
expensive
Bayes Classifier
A probabilistic framework
for solving classification
problems
Conditional Probability:
Bayes theorem:
)(
)()(
)(
AP
C
P
C
A
P
ACP
)(
),(
)(
)(
),(
)(
CP
CAP
CAP
AP
C
A
P
ACP
99
Example of Bayes Theorem
Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability
he/she has meningitis?
0002.0
20/1
50000/15.0
)(
)()(
)(
SP
M
P
M
SP
SMP
Bayesian Classifiers
Consider each attribute and class label as random
variables
Given a record with attributes (A
1
, A
2
,…,A
n
)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C A
1
, A
2
,…,A
n
)
Can we estimate P(C A
1
, A
2
,…,A
n
) directly from
data?
100
Bayesian Classifiers
Approach:
– compute the posterior probability P(C  A
1
, A
2
, …, A
n
) for
all values of C using the Bayes theorem
– Choose value of C that maximizes
P(C  A
1
, A
2
, …, A
n
)
– Equivalent to choosing value of C that maximizes
P(A
1
, A
2
, …, A
n
C) P(C)
How to estimate P(A
1
, A
2
, …, A
n
 C )?
)(
)()(
)(
21
21
21
n
n
n
AAAP
CPC
A
A
A
P
AAACP
Naïve Bayes Classifier
Assume independence among attributes A
i
when class is
given:
– P(A
1
, A
2
, …, A
n
C) = P(A
1
 C
j
) P(A
2
 C
j
)… P(A
n
 C
j
)
– Can estimate P(A
i
 C
j
) for all A
i
and C
j
.
– New point is classified to C
j
if P(C
j
)
P(A
i
 C
j
) is
maximal.
101
How to Estimate Probabilities from Data?
Class: P(C) = N
c
/N
– e.g., P(No) = 7/10,
P(Yes) = 3/10
For discrete attributes:
P(A
i
 C
k
) = A
ik
/ N
c
– where A
ik
 is number of
instances having attribute
A
i
and belongs to class C
k
– Examples:
P(Status=MarriedNo) = 4/7
P(Refund=YesYes)=0
k
Tid
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
c
c
c
How to Estimate Probabilities from Data?
For continuous attributes:
– Discretize the range into bins
one ordinal attribute per bin
violates independence assumption
– Twoway split:(A < v) or (A > v)
choose only one of the two splits as new attribute
– Probability density estimation:
Assume attribute follows a normal distribution
Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
Once probability distribution is known, can use it to
estimate the conditional probability P(A
i
c)
102
How to Estimate Probabilities from Data?
Normal distribution:
– One for each (A
i
,c
i
) pair
For (Income, Class=No):
– If Class=No
sample mean = 110
sample variance = 2975
Tid
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
2
2
2
)(
2
2
1
)(
ij
iji
A
ij
ji
ecAP
0072.0
)54.54(2
1
)120(
)2975(2
)110120(
2
eNoIncomeP
Example of Naïve Bayes Classifier
P(Refund=YesNo) = 3/7
P(Refund=NoNo) = 4/7
P(Refund=YesYes) = 0
P(Refund=NoYes) = 1
P(Marital Status=SingleNo) = 2/7
P(Marital Status=DivorcedNo)=1/7
P(Marital Status=MarriedNo) = 4/7
P(Marital Status=SingleYes) = 2/7
P(Marital Status=DivorcedYes)=1/7
P(Marital Status=MarriedYes) = 0
For taxable income:
If class=No:sample mean=110
sample variance=2975
If class=Yes:sample mean=90
sample variance=25
naive Bayes Classifier:
120K)IncomeMarried,
N
o,Refun
d
(
X
P(XClass=No) = P(Refund=NoClass=No)
P(Married Class=No)
P(Income=120K Class=No)
= 4/7
4/7
0.0072 = 0.0024
P(XClass=Yes) = P(Refund=No Class=Yes)
P(Married Class=Yes)
P(Income=120K Class=Yes)
= 1
0
1.2
10
9
= 0
Since P(XNo)P(No) > P(XYes)P(Yes)
Therefore P(NoX) > P(YesX)
=> Class = No
Given a Test Record:
103
Naïve Bayes Classifier
If one of the conditional probability is zero, then
the entire expression becomes zero
Probability estimation:
mN
mpN
CAP
cN
N
CAP
N
N
CAP
c
ic
i
c
ic
i
c
ic
i
)(:estimatem
1
)(:Laplace
)( :Original
c: number of classes
p: prior probability
m: parameter
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
python no no no no nonmammals
salmon no no yes no nonmammals
whale yes no yes no mammals
frog no no sometimes yes nonmammals
komodo no no no yes nonmammals
bat yes yes no yes mammals
pigeon no yes no yes nonmammals
cat yes no no yes mammals
leopard shark yes no yes no nonmammals
turtle no no sometimes yes nonmammals
penguin no no sometimes yes nonmammals
porcupine yes no no yes mammals
eel no no yes no nonmammals
salamander no no sometimes yes nonmammals
gila monster no no no yes nonmammals
platypus no no no yes mammals
owl no yes no yes nonmammals
dolphin yes no yes no mammals
eagle no yes no yes nonmammals
Give Birth Can Fly Live in Water Have Legs Class
yes no yes no?
0027.0
20
13
004.0)()(
021.0
20
7
06.0)()(
0042.0
13
4
13
3
13
10
13
1
)(
06.0
7
2
7
2
7
6
7
6
)(
NPNAP
MPMAP
NAP
MAP
A: attributes
M: mammals
N: nonmammals
P(AM)P(M) > P(AN)P(N)
=> Mammals
104
Artificial Neural Networks
PERCEPTRON (Rosenblatt in 1960s)
105
Perceptron Training Rule
Gradient Descent
106
Gradient Descent
Gradient Descent
107
Incremental Gradient Descent
Sigmoid Unit
108
Network Diagram
Inputs : x
i
Output : y
Weights : w
ij
Biases : b
i
Targets : t
11
w
nk
w
01
w
02
w
k
w
0
1k
b
y
1
b
2
b
k
b
n
x
3
x
2
x
1
x
# of Input Nodes : n
# of Hidden Layers : 1
# of Hidden Nodes : k
# of Output Nodes : 1
2
1
1
C(w) ( ) (,,)
Q
i
t i
y
i w x
Q
Backpropagation
109
Support Vector Machines
Find hyperplane maximizes the margin => B1 is better than B2
Support Vector Machines
0 bxw
1 bxw
1
bxw
1bxw if1
1bxw if1
)(
xf
2

2
Margin
w
110
Support Vector Machines
We want to maximize:
– Which is equivalent to minimizing:
– But subjected to the following constraints:
This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)
2

2
Margin
w
1bxw if1
1bxw if1
)(
i
i
i
xf
2

)(
2
w
wL
Support Vector Machines
What if the problem is not linearly separable?
111
Support Vector Machines
What if the problem is not linearly separable?
– Introduce slack variables
Need to minimize:
Subject to:
ii
ii
1bxw if1
1bxw if1
)(
i
xf
N
i
k
i
C
w
wL
1
2
2

)(
Large Margin
112
Nonlinear Support Vector Machines
What if decision boundary is not linear?
Summary
Maximizing the Margin
Linear SVM
Lagrange Multipliers
Linearly Nonseparable SVM
Nonlinear SVM – XOR problem
The Kernel Trick
113
Ensemble Methods
Application
phase
T
T
1
T
2
…
T
S
(
x
, ?)
h* = F(h
1
, h
2
, …, h
S
)
(
x
, y*)
Learning
phase
h
1
h
2
…
h
S
different
training sets
and/or
learning algorithms
Ensemble Methods
Boosting
Bagging
Stacking
Error Correcting Output Coding
Random Forests
114
How to make an effective ensemble?
Two basic decisions when designing ensembles:
1
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο