Artificial Intelligence
Lecture 2
Dr. Bo Yuan, Professor
Department of Computer Science and Engineering
Shanghai
Jiaotong
University
boyuan@sjtu.edu.cn
Review of Lecture One
•
O
verview of AI
–
Knowledge

based rules in logics (expert system, automata, …) :
Symbolism in logics
–
Kernel

based heuristics (
n
eural network, SVM, …) :
Connection for nonlinearity
–
L
earning and inference (Bayesian, Markovian, …) :
To sparsely sample for convergence
–
Interactive and stochastic computing (Uncertainty, heterogeneity) :
To overcome the
limit of Turin Machine
•
Course Content
–
Focus mainly on
l
earning and
i
nference
–
Discuss current problems and research efforts
–
Perception and behavior (vision, robotic, NLP, bionics …) not included
•
Exam
–
Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS)
–
Course materials
Today’s Content
•
Overview of machine
l
earning
•
Linear regression
–
Gradient decent
–
Least square fit
–
Stochastic gradient decent
–
The normal equation
•
Applications
Basic Terminologies
•
x
=
Input variables/features
•
y
=
Output variables/target variables
•
(x, y) =
Training examples, the
i
th
training example = (x
(
i
)
, y
(
i
)
)
•
m
(j)
=
Number of training examples (1, …, m)
•
n
(
i
)
=
Number of input variables/features (0, …,n
)
•
h
(x) =
Hypothesis/function/model that outputs the predicative
value under a given input x
•
q
=
Parameter/weight, which parameterizes the mapping of
X
to its predictive value, thus
•
We define x
0
= 1 (the intercept), thus able to use a matrix representation:
Gradient Decent
T
he Cost Function is defined as:
Using the matrix to represent
the
training samples with respect to
q
:
The gradient decent is based on
the
partial derivatives with respect to
q
:
The algorithm is therefore: Loop {
} (for every j)
There is another alternative to iterate, called stochastic gradient decent:
Normal Equation
An explicit way to directly obtain
q
T
he Optimization Problem by the Normal Equation
We set the derivatives to zero, and obtain the Normal Equations:
Today’s Content
•
Linear Regression
–
Locally Weighted Regression (an adaptive method)
•
Probabilistic Interpretation
–
Maxima Likelihood Estimation vs. Least Square
(Gaussian Distribution)
•
Classification by Logistic Regression
–
LMS updating
–
A Perceptron

based Learning Algorithm
Linear Regression
1.
Number of Features
2.
Over

fitting and under

fitting Issue
3.
Feature selection problem (to be covered later)
4.
Adaptive issue
Some definitions:
•
P
arametric Learning
(fixed set of
q,
睩瑨t
n
being constant)
•
Non

parametric Learning
(number of
q
†
杲o猠睩瑨t
m
linearly)
Locally

Weighted Regression (Loess/
Lowess
Regression) non

parametric
•
A bell

shape weighting (not a Gaussian)
•
Every time you need to use the entire training data set to train for a
given input to predict its output (computational complexity)
Extension of Linear Regression
•
Linear Additive (straight

line):
x
1
=1, x
2
=x
•
Polynomial:
x
1
=1, x
2
=x, …,
x
n
=x
n

1
•
Chebyshev
Orthogonal
P
olynomial:
x
1
=1, x
2
=x, …,
x
n
=2x(x
n

1

x
n

2
)
•
Fourier Trigonometric Polynomial:
x1=0.5, followed by
sin
and
cos
of
different frequencies of
x
n
•
Pairwise Interaction:
linear terms + x
k1,k2
(k =1, …, N)
•
…
•
The central problem underlying these representations are whether or not
the optimization processes for
q
are
convex
.
Probabilistic Interpretation
•
Why Ordinary Least Square (OLE)? Why not other power terms?
–
Assume
–
PDF for Gaussian is
–
This implies that
–
Or, ~
= Random Noises, ~
Why Gaussian for random variables? Central limit
t
heorem?
•
Consider training data are stochastic
•
A
ssume are
i.i.d
.
(independently identically distributed)
–
Likelihood of L(
q
) = the probability of y given
x parameterized by
q
•
What is Maximum Likelihood Estimation (MLE)?
–
Chose parameters
q
to maximize the function , so to
make the training data set as probable as possible;
–
Likelihood L(
q
) of the parameters, probability of the data.
Maximum Likelihood (updated)
The Equivalence of MLE and OLE
= J(
q
) !?
Sigmoid (Logistic) Function
Other functions that smoothly increase from 0 to 1 can also be found, but
f
or a couple of good reasons (we will see next time for the Generalize Linear
Methods) that the choice of the logistic function is a
natural
one.
Recall (Note the positive sign rather than negative)
Let’s working with just one training example (x, y), and to derive the
Gradient Ascent
rule:
One Useful Property of the Logistic Function
Identical to Least Square Again?
Comments 0
Log in to post a comment