# Machine Learning Lecture 1: Intro + Decision Trees

Τεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 4 χρόνια και 7 μήνες)

251 εμφανίσεις

Machine Learning

Lecture
1
: Intro + Decision Trees

Moshe Koppel

Slides adapted from Tom Mitchell and from Dan Roth

Textbook: Machine Learning by Tom Mitchell

(optional)

Slides will be posted (possibly only after lecture)

50
% final exam;
50
% HW (mostly final)

Very loosely: We have lots of data and wish to
automatically learn concept definitions in order to
determine if new examples belong to the concept
or not.

INTRODUCTION

CS
446
-
Fall
10

8

Supervised Learning

Given:

Examples
(x,f
(
x))

of some unknown function

f

Find:

A good approximation of
f

x

provides some representation of the input

The process of mapping a domain element into a representation
is called
Feature Extraction. (Hard; ill
-
understood; important)

x

2

{
0
,
1
}
n

or x

2

<
n

The target function (label)

f(x)

2

{
-
1
,+
1
}

Binary Classification

f(x)

2

{
1
,
2
,
3
,.,k
-
1
}

Multi
-
class classification

f(x)

2

<

Regression

INTRODUCTION

CS
446
-
Fall
10

9

Supervised Learning : Examples

Disease diagnosis

x: Properties of patient (symptoms, lab tests)

f : Disease (or maybe: recommended therapy)

Part
-
of
-
Speech tagging

x: An English sentence (e.g., The
can

will rust)

f : The part of speech of a word in the sentence

Face recognition

x: Bitmap picture of person’s face

f : Name the person (or maybe: a property of)

Automatic Steering

x: Bitmap picture of road surface in front of car

f : Degrees to turn the steering wheel

INTRODUCTION

CS
446
-
Fall
10

10

y =

f
(x
1
, x
2
, x
3
, x
4
)

Unknown

function

x
1

x
2

x
3

x
4

Example

x
1

x
2

x
3

x
4
y

1

0 0 1 0 0

3

0 0 1 1 1

4 1 0 0 1 1

5

0 1 1 0 0

6

1 1 0 0 0

7

0 1 0 1 0

2

0 1 0 0 0

A Learning Problem

Can you learn this
function?

What is it?

INTRODUCTION

CS
446
-
Fall
10

11

Hypothesis Space

Complete Ignorance:

There are
2
16

=
65536
possible functions

over four input features.

We can’t figure out which one is

correct until we’ve seen every

po獳楢汥l楮p畴
-

After seven examples we still

have
2
9

possibilities for

f

Is Learning Possible?

Example

x
1

x
2

x
3

x
4
y

1 1 1 1
?

0 0 0 0
?

1 0 0 0
?

1 0 1 1
?

1 1 0 0 0

1 1 0 1
?

1 0 1 0
?

1 0 0 1 1

0 1 0 0 0

0 1 0 1 0

0 1 1 0 0

0 1 1 1
?

0 0 1 1 1

0 0 1 0 0

0 0 0 1
?

1 1 1 0
?

INTRODUCTION

CS
446
-
Fall
10

12

General strategies for Machine Learning

Develop limited hypothesis spaces

Serve to limit the expressivity of the target models

Decide (possibly unfairly) that not every function is possible.

Develop algorithms for finding a hypothesis
in our
hypothesis space
, that
fits

the data

And
hope

that they will generalize well

INTRODUCTION

CS
446
-
Fall
10

13

Terminology

Target function (concept):
The true function f :X

{
1
,
2
,…K
}.

The
possible value of f: {
1
,
2
,…K} are the classes or
class labels
.

Concept:
Boolean function. Example for which f (x)=
1
are
positive

examples; those for which f (x)=
0
are
negative

examples (instances)

Hypothesis:
A proposed function h, believed to be similar to f.

Hypothesis space:
The space of all hypotheses that can, in principle, be
output by the learning algorithm.

Classifier:
A function f. The output of our learning algorithm.

Training examples:
A set of examples of the form {(x, f (x))}

INTRODUCTION

CS
446
-
Fall
10

14

Representation Step: What’s Good?

Learning problem:

Find a function that

best

separates the data

What function?

What’s best?

(How to find it?)

A possibility: Define the learning problem to be:

Find a (linear) function that best separates the data

Linear = linear in the instance space

x
= data representation;
w
= the classifier

Y

= sgn {w
T
x}

INTRODUCTION

CS
446
-
Fall
10

15

Expressivity

f(x) = sgn {x
¢

w
-

} = sgn{

i=
1
n

w
i

x
i
-

}

Many functions are Linear

Conjunctions:

y = x
1

Æ

x
3

Æ

x
5

y = sgn{
1
¢

x
1

+
1
¢

x
3

+
1
¢

x
5

-

3
}

At least m of n:

y = at least
2
of {
x
1

,x
3
,

x
5

}

y = sgn{
1
¢

x
1

+
1
¢

x
3

+
1
¢

x
5

-

2
}

Many functions are not

Xor:
y = x
1

Æ

x
2
Ç

:
x
1

Æ

:
x
2

Non trivial DNF:
y = x
1

Æ

x
2
Ç

x
3

Æ

x
4

INTRODUCTION

CS
446
-
Fall
10

16

Exclusive
-
OR (XOR)

(x
1

Æ

x
2
)

Ç

(
:
{x
1
}
Æ

:
{x
2
})

In general: a parity function.

x
i

2

{
0
,
1
}

f(x
1
, x
2
,…, x
n
) =
1

iff

x
i

is even

This function is not

linearly separable
.

x
1

x
2

INTRODUCTION

CS
446
-
Fall
10

17

A General Framework for Learning

Goal:

predict an unobserved output value y
2

Y

based on an observed input vector x
2

X

Estimate a functional relationship
y~f(x)

from a set
{(x,y)
i
}
i=
1
,n

Most relevant
-

Classification
:
y

{
0
,
1
}

(or
y

{
1
,
2
,…k}

)

(But, within the same framework can also talk about
Regression, y
2

<

What do we want f(x) to satisfy?

We want to minimize the Loss (Risk):
L(f()) = E
X,Y
( [f(x)

y] )

Where:

E
X,Y
denotes the expectation with respect to the true distribution
.

Simply: # of mistakes

[…] is a indicator function

INTRODUCTION

CS
446
-
Fall
10

18

Summary: Key Issues in Machine Learning

Modeling

How to formulate application problems as machine learning
problems ? How to represent the data?

Learning Protocols (where is the data & labels coming from?)

Representation:

What are good hypothesis spaces ?

Any rigorous way to find these? Any general approach?

Algorithms:

What are good algorithms?

How do we define success?

Generalization Vs. over fitting

The computational problem