# Artificial Intelligence Lecture 2

Τεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 9 μήνες)

145 εμφανίσεις

Artificial Intelligence

Lecture 2

Dr. Bo Yuan, Professor

Department of Computer Science and Engineering

Shanghai
Jiaotong

University

boyuan@sjtu.edu.cn

Review of Lecture One

O
verview of AI

Knowledge
-
based rules in logics (expert system, automata, …) :
Symbolism in logics

Kernel
-
based heuristics (
n
eural network, SVM, …) :
Connection for nonlinearity

L
earning and inference (Bayesian, Markovian, …) :
To sparsely sample for convergence

Interactive and stochastic computing (Uncertainty, heterogeneity) :
To overcome the
limit of Turin Machine

Course Content

Focus mainly on
l
earning and
i
nference

Discuss current problems and research efforts

Perception and behavior (vision, robotic, NLP, bionics …) not included

Exam

Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS)

Course materials

Today’s Content

Overview of machine
l
earning

Linear regression

Least square fit

The normal equation

Applications

Basic Terminologies

x

=

Input variables/features

y

=

Output variables/target variables

(x, y) =

Training examples, the
i
th

training example = (x
(
i
)
, y
(
i
)
)

m
(j)
=

Number of training examples (1, …, m)

n
(
i
)

=

Number of input variables/features (0, …,n
)

h
(x) =

Hypothesis/function/model that outputs the predicative

value under a given input x

q

=

Parameter/weight, which parameterizes the mapping of

X

to its predictive value, thus

We define x
0

= 1 (the intercept), thus able to use a matrix representation:

T
he Cost Function is defined as:

Using the matrix to represent

the
training samples with respect to
q
:

The gradient decent is based on

the

partial derivatives with respect to
q
:

The algorithm is therefore: Loop {

} (for every j)

There is another alternative to iterate, called stochastic gradient decent:

Normal Equation

An explicit way to directly obtain
q

T
he Optimization Problem by the Normal Equation

We set the derivatives to zero, and obtain the Normal Equations:

Today’s Content

Linear Regression

Locally Weighted Regression (an adaptive method)

Probabilistic Interpretation

Maxima Likelihood Estimation vs. Least Square
(Gaussian Distribution)

Classification by Logistic Regression

LMS updating

A Perceptron
-
based Learning Algorithm

Linear Regression

1.
Number of Features

2.
Over
-
fitting and under
-
fitting Issue

3.
Feature selection problem (to be covered later)

4.

Some definitions:

P
arametric Learning
(fixed set of
q,

n

being constant)

Non
-
parametric Learning
(number of
q

m

linearly)

Locally
-
Weighted Regression (Loess/
Lowess

Regression) non
-
parametric

A bell
-
shape weighting (not a Gaussian)

Every time you need to use the entire training data set to train for a
given input to predict its output (computational complexity)

Extension of Linear Regression

-
line):

x
1
=1, x
2
=x

Polynomial:

x
1
=1, x
2
=x, …,
x
n
=x
n
-
1

Chebyshev

Orthogonal
P
olynomial:

x
1
=1, x
2
=x, …,
x
n
=2x(x
n
-
1
-
x
n
-
2
)

Fourier Trigonometric Polynomial:

x1=0.5, followed by
sin

and
cos

of

different frequencies of
x
n

Pairwise Interaction:

linear terms + x
k1,k2

(k =1, …, N)

The central problem underlying these representations are whether or not
the optimization processes for
q

are
convex
.

Probabilistic Interpretation

Why Ordinary Least Square (OLE)? Why not other power terms?

Assume

PDF for Gaussian is

This implies that

Or, ~

= Random Noises, ~

Why Gaussian for random variables? Central limit
t
heorem?

Consider training data are stochastic

A
ssume are
i.i.d
.
(independently identically distributed)

Likelihood of L(
q
) = the probability of y given
x parameterized by
q

What is Maximum Likelihood Estimation (MLE)?

Chose parameters
q

to maximize the function , so to
make the training data set as probable as possible;

Likelihood L(
q
) of the parameters, probability of the data.

Maximum Likelihood (updated)

The Equivalence of MLE and OLE

= J(
q
) !?

Sigmoid (Logistic) Function

Other functions that smoothly increase from 0 to 1 can also be found, but

f
or a couple of good reasons (we will see next time for the Generalize Linear

Methods) that the choice of the logistic function is a
natural

one.

Recall (Note the positive sign rather than negative)

Let’s working with just one training example (x, y), and to derive the
rule:

One Useful Property of the Logistic Function

Identical to Least Square Again?