Foundations of Machine Learning Lecture 1

Arya MirAI and Robotics

Oct 12, 2013 (3 years and 9 months ago)

229 views

Foundations of Machine Learning
Lecture 1
Mehryar Mohri
Courant Institute and Google Research
mohricims.nyu.edu
page
Mehryar Mohri - Foundations of Machine Learning
Logistics
Prerequisites
: basics in linear algebra, probability,
and analysis of algorithms.
Workload
: about 3-4 homework assignments 
project (topic of your choice).
Mailing list
: join as soon as possible.
2
page
Mehryar Mohri - Foundations of Machine Learning
Course Material
Textbook
3
Slides: course web page.
http://www.cs.nyu.edu/~mohri/ml13/
Introduction to Machine
Learning
page
Mehryar Mohri - Foundations of Machine Learning
Machine Learning
Definition
: computational methods using
experience to improve performance (typically to
make accurate predictions).
Experience
: data-driven task (thus statistics,
probability).
Example
: use height and weight to predict gender.
Computer science
: need to design efficient and
accurate algorithms, analysis of complexity,
theoretical guarantees.
5
page
Mehryar Mohri - Foundations of Machine Learning
Examples of Learning Tasks
Optical character recognition.
Text or document classification, spam detection.
Morphological analysis, part-of-speech tagging,
statistical parsing.
Speech recognition, speech synthesis, speaker
verification.
Image recognition, face recognition.
6
page
Mehryar Mohri - Foundations of Machine Learning
Examples of Learning Tasks
Fraud detection (credit card, telephone), network
intrusion.
Games (chess, backgammon).
Unassisted control of a vehicle (robots, navigation).
Medical diagnosis.
7
page
Mehryar Mohri - Foundations of Machine Learning
Some Broad Areas of ML
Classification
: assign a category to each object
(e.g., document classification; note: the number of
categories may be infinite in some difficult tasks).
Regression
: predict a real value for each object
(prediction of stock values, economic variables).
Ranking
: order objects according to some criterion
(relevant web pages returned by a search engine).
8
page
Mehryar Mohri - Foundations of Machine Learning
Some Broad Areas of ML
Clustering
: partition data into ‘
homogenous’

regions (analysis of very large data sets).
Dimensionality reduction
: find lower-dimensional
manifold preserving some properties of the data
(computer vision).
9
page
Mehryar Mohri - Foundations of Machine Learning
Objectives of Machine Learning
Algorithms
: design of efficient, accurate, and
general learning algorithms to

deal with large-scale problems.

make accurate predictions (unseen examples).

handle a variety of different learning problems.
Theoretical questions
:

what can be learned? Under what conditions?

what learning guarantees can be given?

what is the algorithmic complexity?
10
page
Mehryar Mohri - Foundations of Machine Learning
This Course
Algorithms
: main mathematically well-studied
ones(e.g., SVMs, kernel methods, boosting, on-line
learning, reinforcement learning, learning
automata).
Theoretical foundations
:

analysis of algorithms.

generalization bounds.
Applications
:

illustration of the use of these algorithms.
11
page
Mehryar Mohri - Foundations of Machine Learning
Topics
Probability, general bounds
PAC learning model, error bounds, VC-dimension, bounds on sample complexity
Support vector machines (SVMs), Perceptron
Kernel methods
Boosting, generalization error, margin
On-line learning, halving algorithm, weighted majority algorithm, mistake bounds
Ranking problems and algorithms
Empirical evaluation, confidence intervals, comparison of learning algorithms
Learning automata and transducers
Reinforcement learning
12
page
Mehryar Mohri - Foundations of Machine Learning
Definitions and Terminology
Example: an object, instance of the data used.
Features: the set of attributes, often represented as
a vector, associated to an example (e.g., height and
weight for gender prediction).
Labels: in classification, category associated to an
object (e.g., positive or negative in binary
classification); in regression real value.
Training data: data used for training learning
algorithm (often labeled data).
13
page
Mehryar Mohri - Foundations of Machine Learning
Definitions and Terminology
Test data: data used for testing learning algorithm
(unlabeled data).
Unsupervised learning: no labeled data.
Supervised learning: uses labeled data.
Semi-supervised learning and transduction:
intermediate scenarios.
14
page
Mehryar Mohri - Foundations of Machine Learning
Example - SPAM Detection
Problem
: classify each e-mail message as SPAM or
non-SPAM (binary classification problem)
Potential data
: large collection of SPAM and non-
SPAM messages (labeled examples)
15
page
Mehryar Mohri - Foundations of Machine Learning
Example - SPAM Detection
Learning stages
:

divide labeled collection into training and test
data.

associate relevant features to examples (e.g.,
presence or absence of some sequences;
importance of
prior knowledge
).

use training data and features to train machine
learning algorithm.

predict labels of examples in test data, evaluate
algorithm.
16
page
Mehryar Mohri - Foundations of Machine Learning
Generalization
Complex rules can be poor predictors:

very large hypothesis set.

very complex separation surfaces.
Generalization:

typically not minimizing error on the training set.

not memorization.
17
Probability Review
page
Mehryar Mohri - Foundations of Machine Learning
Probabilistic Model
Sample space
:
, set of all outcomes or
elementary
events
possible in a trial, e.g., casting a die or
tossing a coin.
Event
: subset of sample space. The set of all
events must be closed under complementation and
countable union and intersection.
Probability distribution
: mapping from the set of
all events to such that , and for all
mutually exclusive events,
19
!
A
!
!
Pr
[0
,
1]
Pr[!] = 1
Pr[
A
1
!
...
!
A
n
] =
n
!
i
=1
Pr[
A
i
]
.
page
Mehryar Mohri - Foundations of Machine Learning
Random Variables
Definition
: a
random variable
is a function
such that for any interval , the subset
of the sample space is an event.
Such a function is said to be
measurable
.
Example
: the sum of the values obtained when
casting a die.
Probability mass function
of random variable :
function
Joint probability mass function
of and :
20
X
:!
!
R
I
{
A
:
X
(
A
)
!
I
}
X
f
:
x
!"
f
(
x
) = Pr[
X
=
x
]
.
X
Y
f
:(
x,y
)
!"
f
(
x,y
) = Pr[
X
=
x
#
Y
=
y
]
.
page
Mehryar Mohri - Foundations of Machine Learning
Conditional Probability and
Independence
Conditional probability
of event given :
when
Independence
: two events and are
independent

when
Equivalently, when
21
A
B
Pr[
A
|
B
] =
Pr[
A
!
B
]
Pr[
B
]
,
Pr[
B
]
!
= 0
.
A
B
Pr[
A
!
B
] = Pr[
A
] Pr[
B
]
.
Pr[
A
|
B
] = Pr[
A
]
,
Pr[
B
]
!
= 0
.
page
Mehryar Mohri - Foundations of Machine Learning
Some Probability Formulae
Sum rule
:
Union bound
:
Bayes formula
:
22
Pr[
A
!
B
] = Pr[
A
] +Pr[
B
]
"
Pr[
A
#
B
]
.
Pr[
n
!
i
=1
A
i
]
!
n
"
i
=1
Pr[
A
i
]
.
Pr[
X
|
Y
] =
Pr[
Y
|
X
] Pr[
X
]
Pr[
Y
]
(Pr[
Y
]
!
= 0)
.
page
Mehryar Mohri - Foundations of Machine Learning
Application - Maximum a Posteriori
Formulation
: hypothesis set .
Example
: determine if a patient has a rare
disease , given laboratory test .
With , if the test
is positive, what should be the diagnosis?
23
ˆ
h
= argmax
h

H
Pr
[
h
|
O
] = argmax
h

H
Pr
[
O
|
h
]
Pr
[
h
]
Pr
[
O
]
= argmax
h

H
Pr
[
O
|
h
]
Pr
[
h
]
.
H
H
=
{
d,nd
}
O
=
{
pos,neg
}
Pr[
d
] =
.
005
,
Pr[
pos
|
d
] =
.
98
,
Pr[
neg
|
nd
] =
.
95
Pr[
pos
|
d
] Pr[
d
] =
.
98

.
005=
.
0049
.
Pr[
pos
|
nd
] Pr[
nd
] =(1

.
95)

.
(1

.
005)=
.
04975
>.
0049
.
page
Mehryar Mohri - Foundations of Machine Learning
More Probability Formulae
Chain rule
:
Theorem of total probability
: assume that
then for any event ,
24
Pr[
!
n
i
=1
X
i
] = Pr[
X
1
] Pr[
X
2
|
X
1
] Pr[
X
3
|
X
1
!
X
2
]
...
Pr[
X
n
|
!
n
!
1
i
=1
X
i
]
.
!=
A
1
!
A
2
!
...
!
A
n
,
wi t h
A
i
"
A
j
=
#
f or
i
$
=
j
;
B
Pr[
B
] =
n
!
i
=1
Pr[
B
|
A
i
] Pr[
A
i
]
.
page
Mehryar Mohri - Foundations of Machine Learning
Expectation
Definition
: the
expectation
(or
mean
) of a random
variable is
Properties
:

linearity:

if and are independent,
25
X
E[
X
] =

x
x
Pr[
X
=
x
]
.
E[
aX
+
bY
] =
a
E[
X
] +
b
E[
Y
]
.
X
Y
E[
XY
] = E[
X
]E[
Y
]
.
page
Mehryar Mohri - Foundations of Machine Learning
Markov’s Inequality
Theorem
: let be a non-negative random variable
with , then, for all ,
Proof
:
26
Pr[
X

t
E[
X
]] =

x

t
E[
X
]
Pr[
X
=
x
]


x

t
E[
X
]
Pr[
X
=
x
]
x
t
E[
X
]


x
Pr[
X
=
x
]
x
t
E[
X
]
= E

X
t
E[
X
]

=
1
t
.
X
E[
X
]
<

t>
0
Pr[
X

t
E[
X
]]

1
t
.
page
Mehryar Mohri - Foundations of Machine Learning
Borel-Cantelli Lemma
Lemma
: let be a sequence of events such
that . Then, almost surely, only
finitely many of these events occur.
Proof
: By Markov’s inequality, for any ,

Letting shows that almost
surely.
27
(
E
n
)
n

N


n
=0
Pr[
E
n
]
<

t>
0
t



n
=0
1
E
n
<

Pr



n
=0
1
E
n

t


1
t


n
=0
Pr[
E
n
]
.
page
Mehryar Mohri - Foundations of Machine Learning
Variance
Definition
: the
variance
of a random variable is
is called the
standard deviation
of the random
variable .
Properties
:



if and are independent,
28
X
Var[
X
] =

2
X
= E[(
X

E[
X
])
2
]
.

X
X
Var[
aX
] =
a
2
Var[
X
]
.
X
Y
Var[
X
+
Y
] = Var[
X
] +Var[
Y
]
.
page
Mehryar Mohri - Foundations of Machine Learning
Chebyshev’s Inequality
Theorem
: let be a random variable with
then, for all
Proof
: Observe that
The result follows Markov’s inequality.
29
Pr[
|
X

E[
X
]
|
t
X
]

1
t
2
.
X
t>
0
,
Pr[
|
X

E[
X
]
|
t
X
] = Pr[(
X

E[
X
])
2

t
2

2
X
]
.
Var[
X
]
<

page
Mehryar Mohri - Foundations of Machine Learning
Application
Experiment
: roll a pair of fair dice

times; can we
give a good estimate of the total value of the

rolls?
Mean
: ,
variance
: ; thus, by Chebyshev’s
inequality, the final sum will lie between
in at least
99%
of all experiments. The odds are better
than
99
to
1
that the sum be roughly between

and after

rolls.
30
n
n
7
n
35
6
n
7
n

10

35
6
n
and 7
n
+10

35
6
n,
6
.
976
M
7
.
024
M
1
M
page
Mehryar Mohri - Foundations of Machine Learning
Weak Law of Large Numbers
Theorem
: let be a sequence of independent
random variables with the same mean and
variance and let , then, for
any ,
Proof
: Since the variables are independent,
Thus, by Chebyshev’s inequality,
31
µ
Var[
X
n
] =
n

i
=1
Var

X
i
n

=
n
2
n
2
=

2
n
.
lim
n

Pr[
|
X
n

µ
|

] = 0
.
Pr[
|
X
n

µ
|

]


2
n
2
.
(
X
n
)
n

N

2
<

X
n
=
1
n

n
i
=1
X
i
>
0