# More Machine Learning

AI and Robotics

Oct 16, 2013 (4 years and 9 months ago)

105 views

More Machine Learning

Linear Regression

Squared Error

L1 and L2 Regularization

Gradient Descent

Recall:

Key Components of Intelligent Agents

Representation Language:

Graph, Bayes Nets

Inference Mechanism:
A*, variable elimination, Gibbs
sampling

Learning Mechanism:
Maximum Likelihood, Laplace
Smoothing,
many more: linear regression, perceptron, k
-
Nearest Neighbor, …

-------------------------------------

Evaluation Metric:

Likelihood,
many more: squared error, 0
-
1
loss, conditional likelihood, precision/recall, …

Recall: Types of Learning

The techniques we have discussed so far are examples of a particular kind of learning:

Supervised
:

the training examples included the correct labels or outputs.

Vs. Unsupervised (or semi
-
supervised, or distantly
-
supervised, …)
: None (or some, or only part, …) of the
labels in the training data are known.

Parameter Estimation
:

We only tried to learn the parameters in the BN, not the structure of the BN graph.

Vs. Structure learning:
The BN graph is not given as an input, and the learning algorithm’s job is to figure
out what the graph should look like.

The distinctions below aren’t
actually about the learning algorithm itself, but rather about the type of
model being
learned:

Classification
: the output is a discrete value, like Happy or not Happy, or Spam or Ham.

Vs. Regression:

the output is a real number.

Generative
:

The model of the data represents a full joint distribution over all relevant variables.

Vs. Discriminative:

The model assumes some fixed subset of the variables will always be “inputs” or
“evidence”, and it creates a distribution for the remaining variables conditioned on the evidence variables.

Parametric

vs. Nonparametric:

I will explain this later.

We won’t talk much about structure learning, but we will cover some other kinds of learning (regression,
unsupervised, discriminative,
nonparameteric
, …) in later lectures.

Regression vs. Classification

Our NBC spam detector was a
classifier
:

the output
Y

was one of two options, Ham or Spam.

More generally, classifiers give an output from a (usually small) finite
(or
countably

infinite) set of options.

E.g., predicting
who will win the presidency in the next election is a
classification

problem (finite set of possible outcomes: US citizens).

Regression

models give a real number as output.

E.g., predicting
what the temperature will be tomorrow is a
regression

problem
. Any real number greater than or equal to 0 (Kelvin) is a
possible outcome.

Quiz: regression vs. classification

For each prediction task below, determine whether
regression or classification is more appropriate.

Task

Regression or Classification?

Predict who will win the Super Bowl

next year

Predict the gender of a baby when it’s born

Pr敤ic琠瑨攠W敩杨琠潦⁡⁣桩l搠潮o X敡爠er潭⁮ow

Pr敤ic琠瑨攠Wv敲慧攠lif攠e硰xcW慮捹a潦⁡ll 扡bi敳⁢潲渠W潤oX

Pr敤ic琠瑨W

price of Apple, Inc.’s stock at the close of
trading tomorrow.

Predict whether Microsoft or Apple will have a higher
valuation at the close

of trading tomorrow

Answers: regression vs. classification

For each prediction task below, determine whether
regression or classification is more appropriate.

Task

Regression or Classification?

Predict who will win the Super Bowl

next year

C

Predict the gender of a baby when it’s born

C

Pr敤ic琠瑨攠W敩杨琠潦⁡⁣桩l搠潮o X敡爠er潭⁮ow

R

Pr敤ic琠瑨攠Wv敲慧攠lif攠e硰xcW慮捹a潦⁡ll 扡bi敳⁢潲渠W潤oX

R

Pr敤ic琠瑨W

price of Apple, Inc.’s stock at the close of
trading tomorrow.

R

Predict whether Microsoft or Apple will have a higher
valuation at the close

of trading tomorrow

C

Concrete Example

0
50000
100000
150000
200000
250000
0
500
1000
1500
2000
2500
3000
House Price, \$

Square Footage

Suppose I want to buy a house that’s 2000 square feet.

Predict how much it will cost.

175000

More realistic data

Percentage of the population under the federal poverty level

Violent Crime per Capita

Reported Crime Statistics for U.S. Counties

Linear Regression

Suppose there are
N

input variables, X
1
, …, X
N

(all real
numbers).

A linear regression is a function that looks like this:

Y = w
0

+ w
1
X
1

+ w
2
X
2

+ … +
w
N
X
N

The
w
i

variables are called
weights

or
parameters
. Each one is
a real number.

The set of all functions that look like this (one function for
each choice of weights w
0

through
w
N
) is called the
Hypothesis Class

for linear regression.

Hypotheses

0
50000
100000
150000
200000
250000
0
500
1000
1500
2000
2500
3000
House Price, \$

Square Footage

In this example, there is only one input variable: X
1

is square footage.

The hypothesis class is all functions Y = w
0

+ w
1

* (square footage).

Several example elements of the hypothesis class are drawn.

100+900*X1

55100+900*X1

80000+270*X1

Learning for Linear Regression

Linear regression tells us a whole set of possible functions to
use for prediction.

How do we choose the best one from this set?

This is the learning problem for linear regression:

Input
: a set of training examples, where each example
contains a value for (X
1
, …, X
N
, Y)

Output
: a set of weights (w
0
, …,
w
N
) for the “best
-
fitting”
linear regression model.

Quiz: Learning for Linear Regression

X

Y

10

80

30

40

15

70

55

-
10

For the data on the left, what’s
the best fit linear regression
model?

Answer: Learning for Linear
Regression

X

Y

10

80

30

40

15

70

55

-
10

For the data on the left,
what’s the best fit linear
regression model?

80 = w0 + w1 * 10

40 = w0 + w1 * 30

80
-
40 = w0
-
w0 + w1 * 10
-
w1*30

40 = w1 * (
-
20)

-
2 = w1

80 = w0 + (
-
2)*10

100 = w0

Y= 100 + (
-
2) * X

Linear Regression with Noisy Data

0
50000
100000
150000
200000
250000
0
500
1000
1500
2000
2500
3000
House Price, \$

Square Footage

In the previous example, we could use only two points and find a line that passed through all
of the remaining points.

In this example, points are only “approximately” linear. No single line passes through all
points exactly. We’ll need a more complex algorithm to handle this.

Quadratic Loss (a.k.a. “Squared Error”)

Let’s write our training data D with this notation:

Define
𝑂𝑆𝑆
(

,
𝐷
)
=




(

1
,

,

𝑁
)
2

=



𝑤
0

𝑤
1

1

𝑤
𝑁

𝑁
2

Intuitively, this is how much error the function makes on
the training data.

X
11

X
12

X
1N

Y
1

X
21

X
22

X
2N

Y
2

X
M1

X
M2

X
MN

Y
M

Objective Function

The goal of a linear regression is to find the
best

linear
function. We’ll say that “best” means the one with the
least amount of quadratic loss.

Mathematically, we say we want f* that satisfies:



(

1
,

,

𝑁
)=

argmin

0
,

,

𝑁
𝑂𝑆𝑆
𝑤
0
+
𝑤
1

1
+

+
𝑤
𝑁

𝑁
,
𝐷

We call LOSS the
objective function

for our training
algorithm, since it’s the function we’re trying to minimize.

Closed
-
form Solution

for 1 input variable



1
=
argmin

0
,

1
𝑂𝑆𝑆
𝑤
0
+
𝑤
1

1
,
𝐷

To minimize the LOSS function, we’ll take the partial derivatives, and set them to zero:

𝜕 𝑂𝑆𝑆
𝜕
𝑤
1
=
𝜕



𝑤
0

𝑤
1

1

2

𝜕
𝑤
1
=

2

(


𝑤
0

𝑤
1

1
)

1

Set this expression equal to zero:

1


1
𝑤
0

𝑤
1

1
2

=
0

𝑤
1

1
2

=

1



𝑤
0

1

𝑤
1
=

1



𝑤
0

1

1
2

Closed
-
form Solution

for 1 input variable



1
=
argmin

0
,

1
𝑂𝑆𝑆
𝑤
0
+
𝑤
1

1
,
𝐷

To minimize the LOSS function, we’ll take the partial derivatives, and set them to zero:

𝜕 𝑂𝑆𝑆
𝜕
𝑤
0
=
𝜕



𝑤
0

𝑤
1

1

2

𝜕
𝑤
0
=

2

(


𝑤
0

𝑤
1

1
)

Set this expression equal to zero:



𝑤
0

𝑤
1

1

=
0

𝑤
0
=
1




𝑤
1


1

“Closed
-
form” Result

𝑤
0
=
1




𝑤
1


1

𝑤
1
=

1



𝑤
0

1

1
2

Substitute for w
0

in the second equation gives:

𝑤
1
=

1



1




𝑤
1


1

1

1
2

𝑤
1
=

1



1




1

+
𝑤
1


1

2

1
2

𝑤
1

1
2

1


1

2

1
2

=

1



1




1

1
2

𝑤
1
=

1



1




1

1
2

1


1

2

Quiz: Learning for Linear Regression

X

Y

10

80

30

40

15

70

55

-
10

Using the closed
-
form solution
for Quadratic Loss, compute w0
and w1 for this dataset.

𝑤
1
=

1



1




1

1
2

1


1

2

𝑤
0
=
1




𝑤
1


1

Answer: Learning for Linear
Regression

X

Y

10

80

30

40

15

70

55

-
10

Using the closed
-
form solution
for Quadratic Loss, compute w0
and w1 for this dataset.

𝑤
1
=

1



1




1

1
2

1


1

2
=
800
+
1200
+
1050

550

1
4
180
110
100
+
900
+
225
+
3025

1
4
110
2
=



𝑤
0
=
1




𝑤
1


1

=
1
4
180

2
4
110
=


Note that w1, w0 match what we calculated before!

Overfitting

and Regularization

It is very common to use a technique called
regularization

to
combat
overfitting

for linear methods.

Regularization

changes the
objective function

for training by
adding a penalty for the size of the weights:

LOSS(f, D) =




(

1
,

,

𝑁
)
2

+

𝑤



When p=1, this is called L1 regularization.

When p=2, this is called L2 regularization.

1 and 2 are by far the two most commonly
-
used values of p.

Parameter loss

Gradient Descent

For more complex loss functions, it is often NOT
POSSIBLE to find closed
-
form solutions.

Instead, people resort to “iterative methods” that
iteratively find better and better parameter
estimates, until they converge to the best setting.

We’ll go over one example of this kind of method,
called “gradient descent”.

Gradient Descent

Gradient Descent Algorithm

Create weights
𝑤


, i

0

1.
(
𝑤
0
0
,

𝑤
1
0
)

some initial values (often zero)

2.
While |
𝑤
1

-
𝑤
1

1
+
|
𝑤
0

𝑤
0

1
>
ℎ ℎ𝑜𝑙
:

for each
j:

𝑤


+
1

𝑤


𝛼
𝜕𝐿
𝜕



i

i
+1

3. Return (
𝑤
0

,

𝑤
1

)

Learning rate

Quiz: Gradient

𝝏𝑳𝒐𝒔𝒔
𝝏𝒘

positive

About
zero

𝝏𝑳𝒐𝒔𝒔
𝝏𝒘

negative

a

b

c

w

LOSS

a

b

c

Check the boxes that apply.

Answer: Gradient

𝝏𝑳𝒐𝒔𝒔
𝝏𝒘

positive

About
zero

𝝏𝑳𝒐𝒔𝒔
𝝏𝒘

negative

a

x

b

x

c

x

w

LOSS

a

b

c

Check the boxes that apply.

Quiz: Gradient

Where is
𝝏𝑳𝒐𝒔𝒔
𝝏𝒘

the largest?

a

b

c

Equal everywhere

w

LOSS

a

b

c

Answer: Gradient

Where is
𝝏𝑳𝒐𝒔𝒔
𝝏𝒘

the largest?

a

b

c

Equal everywhere

x

w

LOSS

a

b

c

Quiz: Gradient Descent

Which

point will allow gradient descent to reach the global
minimum, if it is used as the initialization for parameter w?

a

b

c

w

LOSS

a

b

c

Answer: Gradient Descent

Which

point will allow gradient descent to reach the global
minimum, if it is used as the initialization for parameter w?

a

b

c

x

w

LOSS

a

b

c