Lecture 4: Machine

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 6 μήνες)

83 εμφανίσεις

CSC 599: Computational
Scientific Discovery

Lecture 4: Machine
Learning and Model Search

Outline

Computational Reasoning in Science, cont'd

Computer Algebra

Bayesian nets

Brief introduction to Artificial Intelligence

Search space and search operators

Newell's model's of intelligence

Brief introduction to Machine Learning

Error, precision and accuracy

Overfitting

Computational scientific discovery vs.
Machine Learning

CSD vs ML: The take
-
home message

Computer Algebra

Forget numbers!

Q:

Have an ungodly amount of algebra to do?

Physics, engineering

A:

Try a Computer Algebra System (CAS)!

For algebraic symbol manipulation

Examples:

Mathematica

Maple

(Compare: Numerical methods & stats packages)

Do‏“number‏crunching”

Examples:

Matlab, Mathematica

SAS, SPSS

Bayesian Networks

Idea

Complexity:

Lots of variables

Non
-
deterministic environment

Simplicity:

Patterns of influence between variables

Bayesian net encodes influence patterns

Example:

Variables:

a)

Prof assigns homework? (true or false)

b)

TA assigns homework? (true or false)

c)

Will your weekend be busy? (true or false)

Bayesian Networks (
2
)

Example:

pr=prof, ta=TA, b=busy

p(pr) = .6 p(
-
pr) = 0.4

p(ta|pr) = 0.1 p(
-
ta|pr)=0.9

p(ta|
-
pr)= 0.9 p(
-
ta|
-
pr)=0.1

p(b|ta,pr) =0.99 p(
-
b|ta,pr)=0.01

p(b|
-
ta,pr)= 0.8 p(
-
b|
-
ta,pr)=0.2

p(b|ta,
-
pr)= 0.9 p(
-
b|ta,
-
pr)=0.1

p(b|
-
ta,
-
pr)=0.1 p(
-
b|
-
ta,
-
pr)=0.9

Bayesian Networks (3)

P(pr=T|b=T)

= P(b=T, pr=T) / P(b=T)

= P(b=T, ta=T/F, pr=T) / P(b=T, ta=T/F, pr=T/F)

= [(
0.99
*
0.1
*
0.6
=
0.0594
TTT
) +(
0.8
*
0.9
*
0.6
=
0.432
TFT
)] /

[
0.0594
TTT
+
0.432
TFT
+
0.324
TTF
+
0.004
TFF
]

=
0.599707103

Bayesian Networks (
4
)

Q:

That's a lot of work! Can't we get the network
to simplify things?

A:

Yes,
D
-
separation!

Two sets of nodes X,Y are d
-
separated given Z if:

1.

M is in Z and is the middle node (
chain
):

i
--
> M
--
> j

Intuition: if I know M, knowing i doesn't tell me any more about j

2.

M is in Z and is the middle node (
fork
):

i <
--

M
--
> j

Intuition: if I know common cause M, knowing
1
st

result i doesn't
2
nd

result j

3.

M is NOT in Z (and none of its descendants):

i
--
> M <
--

j

Intuition: if I
did

know i and common result M then that would
justify why I should
not

believe in j.

An A.I. researcher's worldview

Problems are divided into

1.

Those‏solvable‏by‏“algorithms”

Algorithm =
do these steps and you are
guaranteed

to get the answer in a “reasonable

time

Classic examples:
searching

and
sorting

2.

Those that aren't

No way to guarantee you will get an answer (in
polynomial time)

Q:
What do you do
?

A: Search for one!

A.I. Worldview (
2
)

Example‏of‏an‏“A.I.”‏problem:‏
Chess

Can you
guarantee

that you will
always win

at
chess?

Can you
guarantee

that you will (at least)
never
lose
?

No
?

Well, that makes it interesting!

Compare with Tic
-
Tac
-
Toe

You can guarantee that you will never lose

(That's why only children play it)

A.I. Worldview (
3
)

A.I. paradigm for searching for a solution

Need to search for one:

States:

Configurations of the world

Operators:

Define legal transitions from one state to another

Example:

white knight g
1
-
>f
3

white pawn c
2
-
>c
4

A.I. Worldview (
4
)

State space (or search space)

Space of states reachable by operators from
initial state

A.I. Worldview (
5
)

Goal state

One or more states that have the configuration
that you want

In Chess: Checkmate!

A.I. Worldview (
6
)

A.I. Pioneer Alan Newell's view of intelligence

A‏given‏level‏of‏“intelligence”‏achievable‏with

a)

Lots of knowledge and little search (Chess grandmaster)

b)

Little‏knowledge‏and‏lots‏of‏search‏(“stupid”‏program)

c)

Some‏knowledge‏and‏some‏search‏(“smart”‏program)

A.I. Worldview (
7
)

Idea

1.

Start at
initial

state

2.

Apply
operators

to traverse
search space

3.

Hope to arrive at
goal state

4.

Issues:

How quickly can you find the answer? (
time!
)

How much memory do you need? (
space
!)

How good is your goal state?

Optimal = shortest path?

Optimal = shortest arc cost?

A.I. Worldview (
8
)

Tools

Uninformed search

Depth
1
st

1
st

Uniform cost (best
1
st

where best = least cost so far)

Iterative deepening depth
1
st

Informed search

Heuristic‏function‏tells‏“desirability”‏of‏each‏node

Greedy (best
1
st

where best = least estimated cost to
goal),

A* (best
1
st

where best = uniform + greedy)

Search from:

Initial state to goal state(s)

Goal state to initial state(s)

Both directions

Machine Learning and A.I.

ML goals

Find some data structure that permits better
performance on some set of problems

Prediction

Conciseness

Some combination thereof

methods?

They're‏“algorithms”‏(in‏the‏A.I.‏Sense)!

1.

Stuff in the data

2.

Turn the crank

3.

In O(n^
3
) later out comes the answer

ML example Decision Tree
learning:

Task: Build a decision tree that predicts a class

Leaves = guessed class

Non
-
leaf nodes = tests on attribute variables

Each edge to child represents one or more attr. values

ML example Decision Tree
learning (
2
)

Approach

Greedy search

1.

Use information theory to find best attribute to split
data

2.

Split data on that attribute

3.

Recursively Continue until either:

a)

No more attributes to split on (label with majority class)

b)

All instances are in same class (label with that class)

ML example Decision Tree
learning (
3
)

A bit of information theory:

C
i

= some class value to guess

S = some set of examples

freq(C
i
,S) = how many C
i
's are in S

size(S) = size of S

Intuition

k choices C
1

. . C
k

How much information needed to specify one C
i
from S?

Not many C
i
's (

0
)?
On

average

few bits

Each occurrence costs more than
1
bit

but not many occurrences

Lots

of C
i
's (

size(S))?
Not many bits

Each occurrence less than
1
bit (good default guess)

Some C
i
's (

size(S)/
2
)?
1
bit

1
bit each, occur about ½ the time

ML example Decision Tree
learning (
4
)

Prob choose one class value from set:

freq(C
i
,S)/size(S)

Information to specify one C
i

in S:

-
lg( freq(C
i
,S)/size(S) ) bits

For
expected information

multiply by class
proportions

info(S) =

-

sum(i=
1
to k): freq(C
i
,S)/size(S) * lg(freq(C
i
,S)/size(S))

ML example Decision Tree
learning (
5
)

Let's get an intuition:

Case
1
: Every member of S is a C
1
, none of C
2

size(S) =
10
, freq(C
1
,S) =
10
, freq(C
2
,S) =
0
:

Therefore:

info(S)

=
-

sum(i=
1
,
2
): freq(C
i
,S)/size(S) * lg(freq(C
i
,S)/size(S))

=
-

[ (
10
/
10
) * lg(
10
/
10
)]
-

[(
0
/
10
) * lg(
0
/
10
)]

=
-
0
-

0
=
0

Intuition:

“If we know that we're dealing with S, then we know that all of
it's members are in C
1
. No need to specify that which is C
1

and which is C
2

ML example Decision Tree
learning (
6
)

Let's get an intuition (cont'd):

Case
2
: Half members of S is a C
1
, half of C
2

size(S) =
10
, freq(C
1
,S) =
5
, freq(C
2
,S) =
5
:

Therefore:

info(S)

=
-

sum(i=
1
,
2
): freq(C
i
,S)/size(S) * lg(freq(C
i
,S)/size(S))

=
-

[ (
5
/
10
) * lg(
5
/
10
)]
-

[(
5
/
10
) * lg(
5
/
10
)]

=
-
2
* (
0.5
*
-
1
) =
1

Intuition:

“If we know that we're dealing with S, then its a
50
-
50
guess
which members belong to C
1

and which to C
2
. Need to
specify which (no compression possible)”

ML example Decision Tree
learning (
7
)

Recall‏the‏plan:‏select‏“best”‏attr‏to‏partition‏on

“best”‏=‏best‏separator‏classes

Information gain for some attribute:

gain(attr) =

= (ave info needed to spec. a class)
-

(ave info needed to spec. a class after partition by attr)

= info(T)

info
attr
(T)

When info
attr
(T) small, classes well separated (
big gain!
)

where:

n = number attribute values

T
i

= set where all members have same attr value v
i

info
attr
(T) = sum(i=
1
,n): size(T
i
)/size(T) * info(T
i
)

ML example Decision Tree
learning (
8
)

Example data (should we play tennis?)

Outlook

Temp

Humidity

Windy

PlayTennis?

sunny

75

70

true

yes

sunny

80

90

true

no

sunny

85

85

false

no

sunny

72

95

false

no

sunny

69

70

true

yes

overcast

72

90

true

yes

overcast

83

78

false

yes

overcast

64

65

true

yes

overcast

81

75

false

yes

rain

71

80

true

no

rain

65

70

true

no

rain

75

80

false

yes

rain

68

80

false

yes

rain

70

96

false

yes

ML example Decision Tree
learning (
9
)

info(
PlayTennis
):

=
-
9
/
14
* lg(
9
/
14
)
-

5
/
14
* lg(
5
/
14
) =
0.940
bits

info
outlook
(
PlayTennis
):

=
5
/
14
* (
-
2
/
5
* lg(
2
/
5
)
-

3
/
5
* lg(
3
/
5
)) +

4
/
14
* (
-
4
/
4
* lg(
4
/
4
)
-

0
/
4
* lg(
0
/
4
)) +

5
/
14
* (
-
3
/
5
* lg(
3
/
5
)
-

2
/
5
* lg(
2
/
5
))

=
0.694
bits

gain(
outlook
) =
0.940

0.694
=
0.246
bits

ML example Decision Tree
learning (
10
)

info(
PlayTennis
):

=
-
9
/
14
* lg(
9
/
14
)
-

5
/
14
* lg(
5
/
14
) =
0.940
bits

info
windy
(
PlayTennis
):

=
6
/
14
* (
-
3
/
6
* lg(
3
/
6
)
-

3
/
6
* lg(
3
/
6
)) +

8
/
14
* (
-
6
/
8
* lg(
6
/
8
)
-

2
/
8
* lg(
2
/
8
)) +

=
0.892
bits

gain(
windy
) =
0.940

0.892
=
0.048
bits

gain(
outlook
) > gain(
windy
)

Test on
outlook!

ML example Decision Tree
learning (
11
)

Guarding against overfitting:

Cross
-
validation

Want to use all data, but using test data to train is cheating

Split data into k sets:

for (i =
0
; i < k; i++)

{

model = train_with_everything_but(i);

test_with(model,i);

}

Tenets of Machine Learning

Choose appropriate:

Training experience

Ex: Good to have about equal number of cases of each
class, even if some classes are more probable in real
data

Think about how you'll test too!

Target function:

Decision tree? Neural Net?

Representation:

Ex: how much data:

Windy in {true,false} vs. wind_speed in mph

Learning algorithm:

Ex: Greedy search? Genetic algorithm? Back
-
propagation?

Our Tenets of Scientific
Discovery

1.

Play to computers' strengths:

1.

Speed

2.

Accuracy (fingers crossed)

3.

Don't get bored

Do exhaustive search!

Q:
Hey

doesn't that ignore all that AI heuristic fnc
research?

2.

Use background knowledge

Predictive accuracy is not everything!

Revolutionary science ==> ?

What are the Differences?

1.

Background knowledge

CSD values background knowledge

ML considers background knowledge

What are the Differences? (cont)

2
. The process of knowledge discovery

The ML Process is iterative:

But the CSD is iterative, and starts all over again:

1
. Exhaustive Search

Tell computers to consider
everything!

Search space systematically

Simplest
--
> increasingly more complex

Issues:

1.

How do you search systematically?

States: models

Initial state = simplest model

Goal state = solution model

Operators: Go from one model to marginally more
complex

What‏is‏“everything”?

Q: With floating pt values every different coefficient
could be a new model (x, x+dx, x+
2
dx,
etc
.)

A: Generate next
qualitative state
, use
numerical
methods

to find best coefficients in that state

2
. Background knowledge as
inductive bias (
1
)

Inductive bias
is

necessary

N training cases

But N+
1
test case could be
anything

Want to assume
something

Inductive Bias = what you've assumed

Common inductive biases in ML:

Minimal cross
-
validation error (e.g. decision tree
learn)

Maximal conditional independence (Bayes nets)

Maximal boundary size between classes (Support
vector machines)

Minimal description length (Occam's razor)

Minimal feature usage (Ignore extraneous data)

Same class as nearest neighbor (Locality)

2
. Background knowledge as
inductive bias (
2
)

Biases we can add/refine in CSD

1.

Expressible in same language as paradigm?

Re
-
something‏“brand‏new”

Penalty for new objects

Penalty for new attributes

Penalty for new processes

Penalty for new relations/operations (?)

Penalty for new types of assertions (?)

2.

Uses same reasoning as done in paradigm

Penalty for new types of reasoning, even with old
assertions

Q: Does this mean we can
never

introduce a
new thing?

Penalty for new objects:
polywater

Polymer: a long molecule in a repetitive chain

Nikolai Fedyakin (
1962
USSR)

H
2
O condensed in and forced thru narrow quartz
capillary tubes

Measured boiling pt, freezing pt and viscosity

Similar to syrup

Boris Derjaguin

Popularized results (Moscow, then UK
1966
)

In West

Some could replicate findings

Some could not

Penalty for new objects:
polywater (
2
)

People concerned with contamination of H
2
0

But precautions taken against this

Denis Rousseau (Bell Labs)

Did same tests with his sweat

Easier to believe in an
old thing

(water +
organic pollutants) rather than
new thing

("
polywater
")

Penalty for new things:
Piltdown Man

Circa
1900
: looking for early human fossils

Neanderthals in Germany (
1863
)

Cro
-
Magnon in France (
1868
)

England??

Charles Dawson (
1912
)

“I‏was‏given‏a‏skull‏by‏men‏in‏at‏Piltdown‏gravel‏pit”

Later, got skull fragments and lower jaw

Excavating Piltdown gravels:

Dawson (r)

Smith Woodward (center)

Penalty for new things:
Piltdown Man (
2
)

Royal College of Surgeons (soon after discovery)

“Brain‏looks‏like‏modern‏man”

French paleontologist Marcellin Boule (
1915
)

“Jaw‏from‏ape”

American zoologist Gerrit Smith Miller (
1915
)

“Jaw‏from‏fossil‏ape”

German anatomist Franz Weidenreich (
1923
)

“Modern‏human‏cranium‏+‏orangutan‏jaw‏w/filed‏
teeth”

Oxford anthropologist Kenneth Page Oakley
(
1953
)

“Skull‏is‏medieval‏human,‏lower‏jaw‏is‏Sarawak‏
orangutan, teeth are fossil chimpanzee

Penalty for new attributes:

Inertia vs. gravitational mass

Inertia mass:

Resistance to motion

m

in
F = ma

Active gravitational mass

Ability to attract other masses

M

in
F = GMm/r
2

Passive gravitational mass:

Ability to be attracted by other masses

m

in
F = GMm/r
2

Penalty for new attributes (
2
)

Conceptually they are three different types of
mass

No experiment has ever distinguished between
them

People since Newton on
have

tried experiments

Assume they are all the same!

Penalty for new processes:
cold
-
fusion

Cold fusion

Novel combo of old processes: catalysis + fusion

Catalysis:

Hard:

A + B
-
> D

Easier (C = catalyst):

A + C
-
> AC (activated catalyst)

B + AC
-

ABC
-
> CD (easier reaction)

CD
-
> C + D

(catalyst ready to do another reaction)

Penalty for new processes:
cold
-
fusion (
2
)

Fusion: how it works

Get lots of energy fusing neutron
-
rich atoms

Need a lot of energy in to get more out

Penalty for new processes:
cold
-
fusion (
3
)

Fusion: Overcoming electrostatic force is hard:

Current technology: need a fission bomb to do it

This is the result:

Penalty for new processes:
cold
-
fusion (
4
)

Martin Fleischmann & Stanley Pons (
1989
)

“We‏can‏do‏fusion‏
at room temperature!

(No initiating nuclear bomb needed)

Electrolysis of heavy water (D
2
O)

“Excess‏heat”‏observed

Proposed mechanism

Pd + D
-
> Pd
-
D

Pd
-
D + D
-
> D
-
Pd
-
D

D
-
Pd
-
D
-
> He
-
PD +
energy!

He
-
PD
-
> He + Pd

Penalty for new processes:
cold
-
fusion (
5
)

Reported in
New York Times

Instantly a worldwide story among scientists

Replication

Some can

Others can't

Results:

Energy:

Some get excess energy

Other claim didn't calibrate/account for everything

Helium:

Not enough observed for energy said to be produced

(there is background Helium in the air)

Ramifications

1.

Science is conservative

Use the current paradigm to guide thinking

2.

Accuracy is not everything

Assertion‏has‏to‏“fit‏in”‏current‏model

Be explainable by model

Use same terms as model

ML and CSD?

From ML we can get:

Idea of learning as model search:

Training experience

Target function

Representation

Learning algorithm

Extra considerations for CSD:

Use computers' strengths:

Speed + Accuracy + Don't Get Bored

Simulation + Exhaustive search

Use of background knowledge

Down right conservative about introducing new terms

Not just iterative, never ends