Foundations of Machine Learning
Lecture 1
Mehryar Mohri
Courant Institute and Google Research
mohricims.nyu.edu
page
Mehryar Mohri  Foundations of Machine Learning
Logistics
Prerequisites
: basics in linear algebra, probability,
and analysis of algorithms.
Workload
: about 34 homework assignments
project (topic of your choice).
Mailing list
: join as soon as possible.
2
page
Mehryar Mohri  Foundations of Machine Learning
Course Material
Textbook
3
Slides: course web page.
http://www.cs.nyu.edu/~mohri/ml13/
Introduction to Machine
Learning
page
Mehryar Mohri  Foundations of Machine Learning
Machine Learning
Deﬁnition
: computational methods using
experience to improve performance (typically to
make accurate predictions).
Experience
: datadriven task (thus statistics,
probability).
Example
: use height and weight to predict gender.
Computer science
: need to design efﬁcient and
accurate algorithms, analysis of complexity,
theoretical guarantees.
5
page
Mehryar Mohri  Foundations of Machine Learning
Examples of Learning Tasks
Optical character recognition.
Text or document classiﬁcation, spam detection.
Morphological analysis, partofspeech tagging,
statistical parsing.
Speech recognition, speech synthesis, speaker
veriﬁcation.
Image recognition, face recognition.
6
page
Mehryar Mohri  Foundations of Machine Learning
Examples of Learning Tasks
Fraud detection (credit card, telephone), network
intrusion.
Games (chess, backgammon).
Unassisted control of a vehicle (robots, navigation).
Medical diagnosis.
7
page
Mehryar Mohri  Foundations of Machine Learning
Some Broad Areas of ML
Classiﬁcation
: assign a category to each object
(e.g., document classiﬁcation; note: the number of
categories may be inﬁnite in some difﬁcult tasks).
Regression
: predict a real value for each object
(prediction of stock values, economic variables).
Ranking
: order objects according to some criterion
(relevant web pages returned by a search engine).
8
page
Mehryar Mohri  Foundations of Machine Learning
Some Broad Areas of ML
Clustering
: partition data into ‘
homogenous’
regions (analysis of very large data sets).
Dimensionality reduction
: ﬁnd lowerdimensional
manifold preserving some properties of the data
(computer vision).
9
page
Mehryar Mohri  Foundations of Machine Learning
Objectives of Machine Learning
Algorithms
: design of efﬁcient, accurate, and
general learning algorithms to
•
deal with largescale problems.
•
make accurate predictions (unseen examples).
•
handle a variety of different learning problems.
Theoretical questions
:
•
what can be learned? Under what conditions?
•
what learning guarantees can be given?
•
what is the algorithmic complexity?
10
page
Mehryar Mohri  Foundations of Machine Learning
This Course
Algorithms
: main mathematically wellstudied
ones(e.g., SVMs, kernel methods, boosting, online
learning, reinforcement learning, learning
automata).
Theoretical foundations
:
•
analysis of algorithms.
•
generalization bounds.
Applications
:
•
illustration of the use of these algorithms.
11
page
Mehryar Mohri  Foundations of Machine Learning
Topics
Probability, general bounds
PAC learning model, error bounds, VCdimension, bounds on sample complexity
Support vector machines (SVMs), Perceptron
Kernel methods
Boosting, generalization error, margin
Online learning, halving algorithm, weighted majority algorithm, mistake bounds
Ranking problems and algorithms
Empirical evaluation, conﬁdence intervals, comparison of learning algorithms
Learning automata and transducers
Reinforcement learning
12
page
Mehryar Mohri  Foundations of Machine Learning
Deﬁnitions and Terminology
Example: an object, instance of the data used.
Features: the set of attributes, often represented as
a vector, associated to an example (e.g., height and
weight for gender prediction).
Labels: in classiﬁcation, category associated to an
object (e.g., positive or negative in binary
classiﬁcation); in regression real value.
Training data: data used for training learning
algorithm (often labeled data).
13
page
Mehryar Mohri  Foundations of Machine Learning
Deﬁnitions and Terminology
Test data: data used for testing learning algorithm
(unlabeled data).
Unsupervised learning: no labeled data.
Supervised learning: uses labeled data.
Semisupervised learning and transduction:
intermediate scenarios.
14
page
Mehryar Mohri  Foundations of Machine Learning
Example  SPAM Detection
Problem
: classify each email message as SPAM or
nonSPAM (binary classiﬁcation problem)
Potential data
: large collection of SPAM and non
SPAM messages (labeled examples)
15
page
Mehryar Mohri  Foundations of Machine Learning
Example  SPAM Detection
Learning stages
:
•
divide labeled collection into training and test
data.
•
associate relevant features to examples (e.g.,
presence or absence of some sequences;
importance of
prior knowledge
).
•
use training data and features to train machine
learning algorithm.
•
predict labels of examples in test data, evaluate
algorithm.
16
page
Mehryar Mohri  Foundations of Machine Learning
Generalization
Complex rules can be poor predictors:
•
very large hypothesis set.
•
very complex separation surfaces.
Generalization:
•
typically not minimizing error on the training set.
•
not memorization.
17
Probability Review
page
Mehryar Mohri  Foundations of Machine Learning
Probabilistic Model
Sample space
:
, set of all outcomes or
elementary
events
possible in a trial, e.g., casting a die or
tossing a coin.
Event
: subset of sample space. The set of all
events must be closed under complementation and
countable union and intersection.
Probability distribution
: mapping from the set of
all events to such that , and for all
mutually exclusive events,
19
!
A
!
!
Pr
[0
,
1]
Pr[!] = 1
Pr[
A
1
!
...
!
A
n
] =
n
!
i
=1
Pr[
A
i
]
.
page
Mehryar Mohri  Foundations of Machine Learning
Random Variables
Deﬁnition
: a
random variable
is a function
such that for any interval , the subset
of the sample space is an event.
Such a function is said to be
measurable
.
Example
: the sum of the values obtained when
casting a die.
Probability mass function
of random variable :
function
Joint probability mass function
of and :
20
X
:!
!
R
I
{
A
:
X
(
A
)
!
I
}
X
f
:
x
!"
f
(
x
) = Pr[
X
=
x
]
.
X
Y
f
:(
x,y
)
!"
f
(
x,y
) = Pr[
X
=
x
#
Y
=
y
]
.
page
Mehryar Mohri  Foundations of Machine Learning
Conditional Probability and
Independence
Conditional probability
of event given :
when
Independence
: two events and are
independent
when
Equivalently, when
21
A
B
Pr[
A

B
] =
Pr[
A
!
B
]
Pr[
B
]
,
Pr[
B
]
!
= 0
.
A
B
Pr[
A
!
B
] = Pr[
A
] Pr[
B
]
.
Pr[
A

B
] = Pr[
A
]
,
Pr[
B
]
!
= 0
.
page
Mehryar Mohri  Foundations of Machine Learning
Some Probability Formulae
Sum rule
:
Union bound
:
Bayes formula
:
22
Pr[
A
!
B
] = Pr[
A
] +Pr[
B
]
"
Pr[
A
#
B
]
.
Pr[
n
!
i
=1
A
i
]
!
n
"
i
=1
Pr[
A
i
]
.
Pr[
X

Y
] =
Pr[
Y

X
] Pr[
X
]
Pr[
Y
]
(Pr[
Y
]
!
= 0)
.
page
Mehryar Mohri  Foundations of Machine Learning
Application  Maximum a Posteriori
Formulation
: hypothesis set .
Example
: determine if a patient has a rare
disease , given laboratory test .
With , if the test
is positive, what should be the diagnosis?
23
ˆ
h
= argmax
h
H
Pr
[
h

O
] = argmax
h
H
Pr
[
O

h
]
Pr
[
h
]
Pr
[
O
]
= argmax
h
H
Pr
[
O

h
]
Pr
[
h
]
.
H
H
=
{
d,nd
}
O
=
{
pos,neg
}
Pr[
d
] =
.
005
,
Pr[
pos

d
] =
.
98
,
Pr[
neg

nd
] =
.
95
Pr[
pos

d
] Pr[
d
] =
.
98
.
005=
.
0049
.
Pr[
pos

nd
] Pr[
nd
] =(1
.
95)
⇥
.
(1
.
005)=
.
04975
>.
0049
.
page
Mehryar Mohri  Foundations of Machine Learning
More Probability Formulae
Chain rule
:
Theorem of total probability
: assume that
then for any event ,
24
Pr[
!
n
i
=1
X
i
] = Pr[
X
1
] Pr[
X
2

X
1
] Pr[
X
3

X
1
!
X
2
]
...
Pr[
X
n

!
n
!
1
i
=1
X
i
]
.
!=
A
1
!
A
2
!
...
!
A
n
,
wi t h
A
i
"
A
j
=
#
f or
i
$
=
j
;
B
Pr[
B
] =
n
!
i
=1
Pr[
B

A
i
] Pr[
A
i
]
.
page
Mehryar Mohri  Foundations of Machine Learning
Expectation
Deﬁnition
: the
expectation
(or
mean
) of a random
variable is
Properties
:
•
linearity:
•
if and are independent,
25
X
E[
X
] =
x
x
Pr[
X
=
x
]
.
E[
aX
+
bY
] =
a
E[
X
] +
b
E[
Y
]
.
X
Y
E[
XY
] = E[
X
]E[
Y
]
.
page
Mehryar Mohri  Foundations of Machine Learning
Markov’s Inequality
Theorem
: let be a nonnegative random variable
with , then, for all ,
Proof
:
26
Pr[
X
t
E[
X
]] =
x
t
E[
X
]
Pr[
X
=
x
]
x
t
E[
X
]
Pr[
X
=
x
]
x
t
E[
X
]
x
Pr[
X
=
x
]
x
t
E[
X
]
= E
X
t
E[
X
]
=
1
t
.
X
E[
X
]
<
t>
0
Pr[
X
t
E[
X
]]
1
t
.
page
Mehryar Mohri  Foundations of Machine Learning
BorelCantelli Lemma
Lemma
: let be a sequence of events such
that . Then, almost surely, only
ﬁnitely many of these events occur.
Proof
: By Markov’s inequality, for any ,
•
Letting shows that almost
surely.
27
(
E
n
)
n
N
n
=0
Pr[
E
n
]
<
t>
0
t
n
=0
1
E
n
<
Pr
n
=0
1
E
n
t
1
t
n
=0
Pr[
E
n
]
.
page
Mehryar Mohri  Foundations of Machine Learning
Variance
Deﬁnition
: the
variance
of a random variable is
is called the
standard deviation
of the random
variable .
Properties
:
•
•
if and are independent,
28
X
Var[
X
] =
2
X
= E[(
X
E[
X
])
2
]
.
X
X
Var[
aX
] =
a
2
Var[
X
]
.
X
Y
Var[
X
+
Y
] = Var[
X
] +Var[
Y
]
.
page
Mehryar Mohri  Foundations of Machine Learning
Chebyshev’s Inequality
Theorem
: let be a random variable with
then, for all
Proof
: Observe that
The result follows Markov’s inequality.
29
Pr[

X
E[
X
]

t
X
]
1
t
2
.
X
t>
0
,
Pr[

X
E[
X
]

t
X
] = Pr[(
X
E[
X
])
2
t
2
2
X
]
.
Var[
X
]
<
page
Mehryar Mohri  Foundations of Machine Learning
Application
Experiment
: roll a pair of fair dice
times; can we
give a good estimate of the total value of the
rolls?
Mean
: ,
variance
: ; thus, by Chebyshev’s
inequality, the ﬁnal sum will lie between
in at least
99%
of all experiments. The odds are better
than
99
to
1
that the sum be roughly between
and after
rolls.
30
n
n
7
n
35
6
n
7
n
10
35
6
n
and 7
n
+10
35
6
n,
6
.
976
M
7
.
024
M
1
M
page
Mehryar Mohri  Foundations of Machine Learning
Weak Law of Large Numbers
Theorem
: let be a sequence of independent
random variables with the same mean and
variance and let , then, for
any ,
Proof
: Since the variables are independent,
Thus, by Chebyshev’s inequality,
31
µ
Var[
X
n
] =
n
i
=1
Var
X
i
n
=
n
2
n
2
=
2
n
.
lim
n
Pr[

X
n
µ

] = 0
.
Pr[

X
n
µ

]
2
n
2
.
(
X
n
)
n
N
2
<
X
n
=
1
n
n
i
=1
X
i
>
0
Comments 0
Log in to post a comment