Artificial Intelligence Lecture 2

unknownlippsAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

115 views

Artificial Intelligence

Lecture 2

Dr. Bo Yuan, Professor

Department of Computer Science and Engineering

Shanghai
Jiaotong

University


boyuan@sjtu.edu.cn

Review of Lecture One


O
verview of AI


Knowledge
-
based rules in logics (expert system, automata, …) :
Symbolism in logics


Kernel
-
based heuristics (
n
eural network, SVM, …) :
Connection for nonlinearity


L
earning and inference (Bayesian, Markovian, …) :
To sparsely sample for convergence


Interactive and stochastic computing (Uncertainty, heterogeneity) :
To overcome the
limit of Turin Machine



Course Content


Focus mainly on
l
earning and
i
nference


Discuss current problems and research efforts


Perception and behavior (vision, robotic, NLP, bionics …) not included



Exam


Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS)


Course materials

Today’s Content


Overview of machine
l
earning


Linear regression


Gradient decent


Least square fit


Stochastic gradient decent


The normal equation


Applications


Basic Terminologies


x

=


Input variables/features


y

=


Output variables/target variables


(x, y) =

Training examples, the
i
th

training example = (x
(
i
)
, y
(
i
)
)


m
(j)
=


Number of training examples (1, …, m)


n
(
i
)

=


Number of input variables/features (0, …,n
)


h
(x) =

Hypothesis/function/model that outputs the predicative


value under a given input x


q

=


Parameter/weight, which parameterizes the mapping of


X

to its predictive value, thus




We define x
0

= 1 (the intercept), thus able to use a matrix representation:








Gradient Decent

T
he Cost Function is defined as:

Using the matrix to represent

the
training samples with respect to
q
:


The gradient decent is based on

the

partial derivatives with respect to
q
:


The algorithm is therefore: Loop {

} (for every j)

There is another alternative to iterate, called stochastic gradient decent:

Normal Equation

An explicit way to directly obtain
q

T
he Optimization Problem by the Normal Equation

We set the derivatives to zero, and obtain the Normal Equations:

Today’s Content


Linear Regression


Locally Weighted Regression (an adaptive method)



Probabilistic Interpretation


Maxima Likelihood Estimation vs. Least Square
(Gaussian Distribution)



Classification by Logistic Regression


LMS updating


A Perceptron
-
based Learning Algorithm


Linear Regression

1.
Number of Features

2.
Over
-
fitting and under
-
fitting Issue

3.
Feature selection problem (to be covered later)

4.
Adaptive issue


Some definitions:


P
arametric Learning
(fixed set of
q,
睩瑨t
n

being constant)


Non
-
parametric Learning
(number of
q

杲o猠睩瑨t
m

linearly)


Locally
-
Weighted Regression (Loess/
Lowess

Regression) non
-
parametric


A bell
-
shape weighting (not a Gaussian)


Every time you need to use the entire training data set to train for a
given input to predict its output (computational complexity)


Extension of Linear Regression


Linear Additive (straight
-
line):


x
1
=1, x
2
=x


Polynomial:





x
1
=1, x
2
=x, …,
x
n
=x
n
-
1


Chebyshev

Orthogonal
P
olynomial:

x
1
=1, x
2
=x, …,
x
n
=2x(x
n
-
1
-
x
n
-
2
)


Fourier Trigonometric Polynomial:

x1=0.5, followed by
sin

and
cos

of





different frequencies of
x
n


Pairwise Interaction:



linear terms + x
k1,k2

(k =1, …, N)






The central problem underlying these representations are whether or not
the optimization processes for
q

are
convex
.

Probabilistic Interpretation


Why Ordinary Least Square (OLE)? Why not other power terms?



Assume




PDF for Gaussian is




This implies that



Or, ~



= Random Noises, ~

Why Gaussian for random variables? Central limit
t
heorem?


Consider training data are stochastic




A
ssume are
i.i.d
.
(independently identically distributed)


Likelihood of L(
q
) = the probability of y given
x parameterized by
q







What is Maximum Likelihood Estimation (MLE)?


Chose parameters
q

to maximize the function , so to
make the training data set as probable as possible;


Likelihood L(
q
) of the parameters, probability of the data.



Maximum Likelihood (updated)

The Equivalence of MLE and OLE

= J(
q
) !?

Sigmoid (Logistic) Function

Other functions that smoothly increase from 0 to 1 can also be found, but

f
or a couple of good reasons (we will see next time for the Generalize Linear

Methods) that the choice of the logistic function is a
natural

one.

Recall (Note the positive sign rather than negative)


Let’s working with just one training example (x, y), and to derive the
Gradient Ascent
rule:















One Useful Property of the Logistic Function

Identical to Least Square Again?