CPSC 7373: Artificial Intelligence
Lecture 6: Machine Learning
Jiang Bian, Fall 2012
University of Arkansas at Little Rock
Machine Learning
•
ML is a branch of artificial intelligence
–
Take empirical data as input
–
And yield patterns or predictions thought to be features of
the underlying mechanism that generated the data.
•
Three frontiers for machine learning:
–
Data mining: using historical data to improve decisions
•
Medical records

> medical knowledge
–
Software applications that we can’t program
•
Autonomous driving
•
Speech recognition
–
Self learning programs
•
Google ads that learns user interests
Machine Learning
•
Bayes networks:
–
R
easoning with known models
•
Machine learning:
–
Learn models from data
•
Supervised Learning
•
Unsupervised learning
Patient diagnosis
•
Given
:
–
9714 patient records, each describing a
pregnancy
and
birth
–
Each patient record contains 215
features
•
Learn to predict
:
–
Classes of future patients at high risk for Emergency
Cesarean Section
Datamining
result
•
One of 18 learned rules
:
If No previous vaginal delivery, and
Abnormal 2nd Trimester Ultrasound, and
Mal

presentation
at admission
Then Probability of Emergency C

Section is 0.6
Over training data: 26/41 = .63,
Over test data: 12/20 = .60
Credit risk analysis
•
Rules learned from synthesized data:
If Other

Delinquent

Accounts > 2, and
Number

Delinquent

Billing

Cycles > 1
Then Profitable

Customer? = No
[Deny Credit Card application]
If Other

Delinquent

Accounts = 0, and
(Income > $30k) OR (Years

of

Credit > 3)
Then Profitable

Customer? = Yes
[Accept Credit Card application]
Examples
–
cond.
•
Companies that are famous for using machine
learning:
–
Google: web mining (PageRank, search engine,
etc.)
–
Netflix: DVD Recommendations
•
The Netflix prize ($1 million) and the recommendation
problem
–
Amazon: Product placement
Self driving car
•
Stanley (
Standford
) DARPA Grand Challenge
(2005 winner)
•
https://www.youtube.com/watch?feature=pla
yer_embedded&v=Q1xFdQfq5Fk&noredirect=
1#
!
Taxonomy
•
What is being learned?
–
Parameters (e.g., probabilities in the Bayes network)
–
Structure (e.g., the links in the Bayes network)
–
Hidden concepts/groups (e.g., group of Netflix users)
•
What from?
–
Supervised (e.g., labels)
–
Unsupervised (e.g., replacement principles to learn hidden concepts)
–
Reinforcement learning (e.g., try different actions and receive feedbacks from the environment)
•
What for?
–
Prediction (e.g., stock market)
–
Diagnosis (e.g., to explain something)
–
Summarization (e.g., summarize a paper)
•
How?
–
Passive/Active
–
Online/offline
•
Outputs?
–
Classification
v.s
., regression (continuous)
•
Details?
–
Generative (general idea of the data) and discriminative (distinguish the data).
Supervised learning
•
Each instance has a feature vector and a target
label
–
f(
X
m
) =
y
m
=> f(x) = y
Quiz
•
Which function is preferable?
–
f
a
OR
f
b
??
x
y
fa
fb
Occam’s razor
•
Everything else being equal, choose the less
complex hypothesis (the one with less
assumptions).
FIT
Low Complexity
Complexity
FIT
Training data error
unknown data error
OVER FITTING
error
Spam Detection
Dear Sir,
First, I must solicit your confidence in this transaction, this is by virtue
of its nature being utterly confidential and top secret …
TO BE REMOVED FROM FUTURE MAILLINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT “REMOVE” IN THE SUBJECT
99 MILLION EMAIL ADDRESSES FOR $99
OK, I know this is blatantly OT but I’m beginning to go instance. Had an
old Dell Dimension XPS sitting in the corner and decided to put it to
use. I know it was working pre being stuck in the corner, but when I
plugged it in, hit the power, nothing happened.
Spam Detection
Email
SPAM
H
AM
f
(x) ?
Bag Of Words (BOW)
e.g., Hello, I will say hello
Dictionary [hello, I, will, say]
Hello
–
2
I
–
1
will
–
1
say
–
1
Dictionary [hello, good

bye]
Hello
–
2
Good

bye

0
Spam Detection
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs money
Size of Vocabulary: ???
P(SPAM) = ???
Maximum Likelihood
•
SSSHHHHH
–
P(S) =
π
•
11100000
P(
y
i
) =
π
(if
y
i
=
S)
= 1
–
π
(
if
y
i
= H
)
•
P(data)
Quiz
•
Maximum Likelihood Solutions:
–
P(“SECRET”SPAM) = ??
–
P(“SECRET”HAM) =
??
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs money
Quiz
•
Maximum Likelihood Solutions:
–
P(“SECRET”SPAM) = 1/3
–
P(“SECRET”HAM) =
1/15
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs money
Relationship to Bayes Networks
•
We built a
Bayes network where the parameters of the Bayes
networks are estimated using supervised learning by a maximum
likelihood estimator based on training data
.
•
The Bayes network has at its root an unobservable variable called
spam, which is binary, and it has as many children as there are
words in a message, where each word has an identical conditional
distribution of the word occurrence given the class spam or not
spam.
Spam
W
1
W
2
W
3
DICTIONARY HAS 12 WORDS:
OFFER, IS, SECRET, CLICK, SPORTS, …
How many parameters?
P(“SECRET”SPAM
) = 1/3
P(“SECRET
”
H
AM) = 1/15
SPAM Classification

1
•
Message M=“SPORTS”
•
P(SPAMM) = ???
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs money
SPAM Classification

1
•
Message M=“SPORTS”
•
P(SPAMM) = 3/18
𝑃
𝑆𝑃𝐴𝑀
𝑀
=
𝑃
𝑀
𝑆𝑃𝐴𝑀
𝑃
(
𝑆𝑃𝐴𝑀
)
𝑃
𝑀
𝑆𝑃𝐴𝑀
𝑃
𝑆𝑃𝐴𝑀
+
𝑃
𝑀
𝐻𝐴𝑀
𝑃
(
𝐻𝐴𝑀
)
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs
money
𝑃
𝑆𝑃𝐴𝑀
𝑀
=
1
9
∗
3
8
1
9
∗
3
8
+
5
15
∗
5
8
SPAM Classification

2
•
M = “SECRET IS SECRET”
•
P(SPAMM) = ???
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs
money
SPAM Classification

2
•
M = “SECRET IS SECRET”
•
P(SPAMM) = 25/26 = 0.9615
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs
money
𝑃
𝑆𝑃𝐴𝑀
𝑀
=
𝑃
𝑀
𝑆𝑃𝐴𝑀
𝑃
(
𝑆𝑃𝐴𝑀
)
𝑃
𝑀
𝑆𝑃𝐴𝑀
𝑃
𝑆𝑃𝐴𝑀
+
𝑃
𝑀
𝐻𝐴𝑀
𝑃
(
𝐻𝐴𝑀
)
𝑃
𝑆𝑃𝐴𝑀
𝑀
=
1
3
∗
1
9
∗
1
3
∗
3
8
1
3
∗
1
9
∗
1
3
∗
3
8
+
1
15
∗
1
15
∗
1
15
∗
5
8
SPAM Classification

3
•
M = “TODAY IS SECRET”
•
P(SPAMM) = ???
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs money
SPAM Classification

3
•
M = “
TODAY IS SECRET
”
•
P(SPAMM) = 0
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs
money
𝑃
𝑆𝑃𝐴𝑀
𝑀
=
𝑃
𝑀
𝑆𝑃𝐴𝑀
𝑃
(
𝑆𝑃𝐴𝑀
)
𝑃
𝑀
𝑆𝑃𝐴𝑀
𝑃
𝑆𝑃𝐴𝑀
+
𝑃
𝑀
𝐻𝐴𝑀
𝑃
(
𝐻𝐴𝑀
)
𝑃
𝑆𝑃𝐴𝑀
𝑀
=
0
∗
1
9
∗
1
3
∗
3
8
0
∗
1
9
∗
1
3
∗
3
8
+
1
15
∗
1
15
∗
1
15
∗
5
8
=
0
Laplace Smoothing
•
Maximum Likelihood estimation:
–
P
=
𝐶
𝑥
𝑁
•
LS(k)
–
P
=
𝐶
𝑥
+
𝑘
𝑁
+
𝑘

𝑥

•
K = 1
[1 message 1 spam] P(SPAM) = ???
•
K = 1
[10
message
6
spam] P(SPAM) = ???
•
K = 1
[
100
message
60
spam] P(SPAM) = ???
Laplace Smoothing

2
•
LS(k)
–
P
•
K = 1
[1 message 1 spam]
–
P(SPAM) =
•
K = 1
[10
message
6
spam
]
–
P(SPAM
) =
•
K = 1
[
100
message
60
spam
]
–
P(SPAM
) =
= 0.5980
Laplace Smoothing

3
•
K = 1
–
P(SPAM) = ???
–
P(HAM) = ???
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs money
Laplace Smoothing

4
•
K = 1
–
P(SPAM) =
3
+
1
8
+
2
=
2
5
–
P(HAM) =
5
+
1
8
+
2
=3/5
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs
money
P(“TODAY”SPAM
) = ???
P(“TODAY”HAM)= ???
Laplace Smoothing

4
•
K = 1
–
P(“TODAY”SPAM
)
•
=
0
+
1
9
+
12
=
1
21
–
P(“TODAY”HAM
)
•
=
2
+
1
15
+
12
=
3
27
=
1
9
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs
money
Laplace Smoothing

4
•
M = “TODAY IS SECRET”
•
P(SPAMM) = ???
–
K =
1
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs
money
Laplace Smoothing

4
•
M = “TODAY IS SECRET”
•
P(SPAMM)
–
=
–
•
SPAM
–
Offer is secret
–
Click secret link
–
Secret sports link
•
HAM
–
Play sports today
–
Went play sports
–
Secret sports event
–
Sport is today
–
Sport costs
money
𝑃
𝑆𝑃𝐴𝑀
𝑀
=
𝑃
𝑀
𝑆𝑃𝐴𝑀
𝑃
(
𝑆𝑃𝐴𝑀
)
𝑃
𝑀
𝑆𝑃𝐴𝑀
𝑃
𝑆𝑃𝐴𝑀
+
𝑃
𝑀
𝐻𝐴𝑀
𝑃
(
𝐻𝐴𝑀
)
Summary Naïve Bayes
•
1
,
2
,
3
,
…
,
→
y
x
1
x
2
x
3
Generative model:
•
Bag

of

Words (BOW) model
•
Maximum Likelihood estimation
•
Laplace Smoothing
Advanced SPAM Filters
•
Features:
–
Does
the email come from a known spamming IP or computer?
–
Have you emailed this person
before?
–
H
ave
1000 other people recently received the same message?
–
Is the email header consistent
?
–
All Caps?
–
Do the inline URLs point to those pages where they say they're
pointing to?
–
Are you addressed by your correct name?
•
SPAM filters
keep learning as people flag emails as spam,
and of course spammers keep learning as well and trying to
fool modern spam filters.
Overfitting
Prevention
•
Occam’s Razor:
–
there is a trade off between how well we can fit the data,
and how smooth our learning algorithm is.
•
How do we determine the k in Laplace smoothing?
•
Cross

validation:
Training Data
Train
CV
Test
80%
1
0%
1
0%
Classification
vs
Regression
•
Supervised Learning
–
Classification:
•
To predict whether an Email is a SPAM or HAM
–
Regression:
•
To predict the temperature for tomorrow’s weather
Regression Example
•
Given this data, a friend has a house of 1000
sq
ft.
•
How much should he ask?
•
200K?
•
275K?
•
300K?
Regression Example
Linear:
Maybe: 200K
Second order polynomial:
Maybe: 275K
Linear Regression
•
Data
•
We are looking for y = f
(x
)
n=1, x is one

dimensional
High

dimensional: w is a vector
Linear Regression
•
Quiz:
–
w
0
= ??
–
w
1
= ??
x
y
3
0
6

3
4

1
5

2
Loss function
•
Loss function:
–
Goal is to minimize the residue error after fitting
the linear regression function as good as possible
–
Quadratic Loss/Error:
Minimize Quadratic Loss
•
W
e
are minimizing the quadratic loss, that
is:
Minimize Quadratic Loss
•
Quiz:
–
w
0
= ??
–
w
1
= ??
x
y
3
0
6

3
4

1
5

2
Minimize Quadratic Loss
•
Quiz:
–
w
0
= ??
–
w
1
= ??
x
y
3
0
6

3
4

1
5

2
Quiz
•
Quiz:
–
w
0
= ??
–
w
1
= ?
?
x
y
2
2
4
5
6
5
8
8
0
2
4
6
8
10
0
2
4
6
8
10
Y
Y
Quiz
•
Quiz:
–
w
0
=
0.5
–
w
1
=
0.9
x
y
2
2
4
5
6
5
8
8
0
2
4
6
8
10
0
2
4
6
8
10
Y
Y
Problem with Linear Regression
Problem with Linear Regression
Days
Temp
Logistic Regression:
Quiz: Range of z?
a.
(
0,1
)
b. (

1, 1)
c. (

1,0)
d. (

2, 2)
e. None
Logistic Regression
Logistic Regression:
Quiz: Range of z?
a.
(
0,1
)
Regularization
•
Overﬁtting
occurs when a model captures
idiosyncrasies of
the input
data, rather than
generalizing
.
–
Too many parameters relative to the amount of training
data
P = 1, L1 regularization
P = 2, L2 regularization
Minimize Complicated Loss Function
•
Close

form solution for minimize complicated loss
function doesn’t always exist.
•
We need to use an iterative method
–
Gradient Descent
a
b
c
Gradient of a, b, c; and whether they are positive, about zero or negative
Quiz
a
c
c
Which gradient is the largest?
a??
b??
c??
equal?
Quiz
•
Will gradient descent likely reach the global
minimum?
Loss
w
Global Minimum
Gradient Descent Implementation
Perceptron Algorithm
•
The
perceptron is an algorithm for supervised classification
of an input into one of two possible outputs
.
•
It is a type of linear classifier, i.e. a classification algorithm
that makes its predictions based on a linear predictor
function combining a set of weights with the feature vector
describing a given input
.
•
In the context of artificial neural networks, the perceptron
algorithm is also termed the single

layer perceptron, to
distinguish it from the case of a multilayer perceptron,
which is a more complicated neural network
.
•
As
a linear classifier, the (single

layer) perceptron is the
simplest kind of
feed

forward
neural network
.
Perceptron
Start with random guess for
error
Basis of SVM
Q: Which linear separate will you prefer?
a
b
c
Basis of SVM
Q: Which linear separate will you prefer?
b)
a
b
c
The margin of the linear separator is the
distance of the separator to the closest training
example.
Maximum
margin learning
algorithms:
1)
SVM
2)
Boosting
SVM
•
SVM derives
a linear
separator, and it
takes the one that actually maximizes
the
margin
•
By doing so it attains additional
robost

ness over
perceptron.
•
The
problem of finding the margin
maximizing linear separator can be
solved by a quadratic program which is
an integer method for finding the best
linear separator that maximizes the
margin.
SVM
U
se
linear techniques to solve
nonlinear separation
problems.
x2
x1
“Kernel trick
”:
x3
“An
Introduction to Kernel

Based
Learning Algorithms”
k Nearest Neighbors
•
Parametric: # of parameters independent of training set size.
•
Non

parametric: # of parametric can grow
1

nearest Neighbors
kNN
•
Learning: memorize all data
•
Label New Example:
–
Find k Nearest Neighbors
–
Choose the
majority class label as your final class
label for the new example
kNN

Quiz
K=
1
K=
3
K=5
K=7
K=9
Problems of KNN
•
Very large data sets:
–
KDD trees
•
Very large
feature spaces
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment