Lecture 4: Machine

lettuceescargatoireΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

80 εμφανίσεις

CSC 599: Computational
Scientific Discovery

Lecture 4: Machine
Learning and Model Search

Outline


Computational Reasoning in Science, cont'd


Computer Algebra


Bayesian nets


Brief introduction to Artificial Intelligence


Search space and search operators


Newell's model's of intelligence


Brief introduction to Machine Learning


Error, precision and accuracy


Overfitting


Computational scientific discovery vs.
Machine Learning


Importance of sticking to paradigm


CSD vs ML: The take
-
home message

Computer Algebra

Forget numbers!

Q:

Have an ungodly amount of algebra to do?


Physics, engineering

A:

Try a Computer Algebra System (CAS)!


For algebraic symbol manipulation


Examples:


Mathematica


Maple

(Compare: Numerical methods & stats packages)



Do‏“number‏crunching”


Examples:


Matlab, Mathematica


SAS, SPSS

Bayesian Networks

Idea


Complexity:


Lots of variables


Non
-
deterministic environment


Simplicity:


Patterns of influence between variables


Bayesian net encodes influence patterns

Example:


Variables:

a)

Prof assigns homework? (true or false)


b)

TA assigns homework? (true or false)


c)

Will your weekend be busy? (true or false)


Bayesian Networks (
2
)


Example:

pr=prof, ta=TA, b=busy


p(pr) = .6 p(
-
pr) = 0.4


p(ta|pr) = 0.1 p(
-
ta|pr)=0.9

p(ta|
-
pr)= 0.9 p(
-
ta|
-
pr)=0.1


p(b|ta,pr) =0.99 p(
-
b|ta,pr)=0.01

p(b|
-
ta,pr)= 0.8 p(
-
b|
-
ta,pr)=0.2

p(b|ta,
-
pr)= 0.9 p(
-
b|ta,
-
pr)=0.1

p(b|
-
ta,
-
pr)=0.1 p(
-
b|
-
ta,
-
pr)=0.9

Bayesian Networks (3)

P(pr=T|b=T)


= P(b=T, pr=T) / P(b=T)


= P(b=T, ta=T/F, pr=T) / P(b=T, ta=T/F, pr=T/F)


= [(
0.99
*
0.1
*
0.6
=
0.0594
TTT
) +(
0.8
*
0.9
*
0.6
=
0.432
TFT
)] /


[
0.0594
TTT
+
0.432
TFT
+
0.324
TTF
+
0.004
TFF
]

=
0.599707103

Bayesian Networks (
4
)


Q:

That's a lot of work! Can't we get the network
to simplify things?


A:

Yes,
D
-
separation!

Two sets of nodes X,Y are d
-
separated given Z if:

1.

M is in Z and is the middle node (
chain
):

i
--
> M
--
> j

Intuition: if I know M, knowing i doesn't tell me any more about j

2.

M is in Z and is the middle node (
fork
):

i <
--

M
--
> j

Intuition: if I know common cause M, knowing
1
st

result i doesn't
tell me any more about
2
nd

result j

3.

M is NOT in Z (and none of its descendants):

i
--
> M <
--

j

Intuition: if I
did

know i and common result M then that would
justify why I should
not

believe in j.

An A.I. researcher's worldview

Problems are divided into

1.

Those‏solvable‏by‏“algorithms”


Algorithm =
do these steps and you are
guaranteed

to get the answer in a “reasonable


time


Classic examples:
searching

and
sorting


2.

Those that aren't


No way to guarantee you will get an answer (in
polynomial time)



Q:
What do you do
?


A: Search for one!

A.I. Worldview (
2
)


Example‏of‏an‏“A.I.”‏problem:‏
Chess


Can you
guarantee

that you will
always win

at
chess?


Can you
guarantee

that you will (at least)
never
lose
?


No
?


Well, that makes it interesting!


Compare with Tic
-
Tac
-
Toe


You can guarantee that you will never lose


(That's why only children play it)


A.I. Worldview (
3
)


A.I. paradigm for searching for a solution

Remember:‏no‏“algorithm”‏for‏obtaining‏answer


Need to search for one:

States:


Configurations of the world





Operators:


Define legal transitions from one state to another


Example:


white knight g
1
-
>f
3


white pawn c
2
-
>c
4

A.I. Worldview (
4
)


State space (or search space)



Space of states reachable by operators from
initial state


A.I. Worldview (
5
)


Goal state


One or more states that have the configuration
that you want


In Chess: Checkmate!

A.I. Worldview (
6
)


A.I. Pioneer Alan Newell's view of intelligence


A‏given‏level‏of‏“intelligence”‏achievable‏with

a)

Lots of knowledge and little search (Chess grandmaster)


b)

Little‏knowledge‏and‏lots‏of‏search‏(“stupid”‏program)


c)

Some‏knowledge‏and‏some‏search‏(“smart”‏program)


A.I. Worldview (
7
)


Idea

1.

Start at
initial

state

2.

Apply
operators

to traverse
search space

3.

Hope to arrive at
goal state

4.

Issues:


How quickly can you find the answer? (
time!
)



How much memory do you need? (
space
!)



How good is your goal state?


Optimal = shortest path?


Optimal = shortest arc cost?

A.I. Worldview (
8
)



Tools


Uninformed search


Depth
1
st


Breadth
1
st


Uniform cost (best
1
st

where best = least cost so far)



Iterative deepening depth
1
st


Informed search


Heuristic‏function‏tells‏“desirability”‏of‏each‏node


Greedy (best
1
st

where best = least estimated cost to
goal),


A* (best
1
st

where best = uniform + greedy)



Search from:


Initial state to goal state(s)



Goal state to initial state(s)



Both directions

Machine Learning and A.I.

ML goals


Find some data structure that permits better
performance on some set of problems


Prediction


Conciseness


Some combination thereof



What about coefficient finding numerical
methods?


They're‏“algorithms”‏(in‏the‏A.I.‏Sense)!

1.

Stuff in the data

2.

Turn the crank

3.

In O(n^
3
) later out comes the answer

ML example Decision Tree
learning:

Task: Build a decision tree that predicts a class

Leaves = guessed class

Non
-
leaf nodes = tests on attribute variables

Each edge to child represents one or more attr. values

ML example Decision Tree
learning (
2
)


Approach

Greedy search

1.

Use information theory to find best attribute to split
data

2.

Split data on that attribute

3.

Recursively Continue until either:

a)

No more attributes to split on (label with majority class)


b)

All instances are in same class (label with that class)




ML example Decision Tree
learning (
3
)


A bit of information theory:


C
i

= some class value to guess


S = some set of examples


freq(C
i
,S) = how many C
i
's are in S


size(S) = size of S

Intuition

k choices C
1

. . C
k

How much information needed to specify one C
i
from S?

Not many C
i
's (


0
)?
On

average

few bits

Each occurrence costs more than
1
bit

but not many occurrences

Lots

of C
i
's (


size(S))?
Not many bits

Each occurrence less than
1
bit (good default guess)


Some C
i
's (


size(S)/
2
)?
About
1
bit

About
1
bit each, occur about ½ the time

ML example Decision Tree
learning (
4
)


Prob choose one class value from set:

freq(C
i
,S)/size(S)



Information to specify one C
i

in S:

-
lg( freq(C
i
,S)/size(S) ) bits


For
expected information

multiply by class
proportions

info(S) =

-

sum(i=
1
to k): freq(C
i
,S)/size(S) * lg(freq(C
i
,S)/size(S))


ML example Decision Tree
learning (
5
)


Let's get an intuition:

Case
1
: Every member of S is a C
1
, none of C
2

size(S) =
10
, freq(C
1
,S) =
10
, freq(C
2
,S) =
0
:

Therefore:


info(S)


=
-

sum(i=
1
,
2
): freq(C
i
,S)/size(S) * lg(freq(C
i
,S)/size(S))


=
-

[ (
10
/
10
) * lg(
10
/
10
)]
-

[(
0
/
10
) * lg(
0
/
10
)]

=
-
0
-

0
=
0

Intuition:

“If we know that we're dealing with S, then we know that all of
it's members are in C
1
. No need to specify that which is C
1

and which is C
2


ML example Decision Tree
learning (
6
)


Let's get an intuition (cont'd):

Case
2
: Half members of S is a C
1
, half of C
2

size(S) =
10
, freq(C
1
,S) =
5
, freq(C
2
,S) =
5
:

Therefore:


info(S)


=
-

sum(i=
1
,
2
): freq(C
i
,S)/size(S) * lg(freq(C
i
,S)/size(S))


=
-

[ (
5
/
10
) * lg(
5
/
10
)]
-

[(
5
/
10
) * lg(
5
/
10
)]

=
-
2
* (
0.5
*
-
1
) =
1

Intuition:

“If we know that we're dealing with S, then its a
50
-
50
guess
which members belong to C
1

and which to C
2
. Need to
specify which (no compression possible)”

ML example Decision Tree
learning (
7
)


Recall‏the‏plan:‏select‏“best”‏attr‏to‏partition‏on

“best”‏=‏best‏separator‏classes


Information gain for some attribute:

gain(attr) =

= (ave info needed to spec. a class)
-


(ave info needed to spec. a class after partition by attr)


= info(T)


info
attr
(T)


When info
attr
(T) small, classes well separated (
big gain!
)


where:

n = number attribute values

T
i

= set where all members have same attr value v
i

info
attr
(T) = sum(i=
1
,n): size(T
i
)/size(T) * info(T
i
)


ML example Decision Tree
learning (
8
)


Example data (should we play tennis?)


Outlook


Temp

Humidity

Windy

PlayTennis?

sunny


75


70


true


yes

sunny


80


90


true


no

sunny


85


85


false


no

sunny


72


95


false


no

sunny


69


70


true


yes

overcast

72


90


true


yes

overcast

83


78


false


yes

overcast

64


65


true


yes

overcast

81


75


false


yes

rain


71


80


true


no

rain


65


70


true


no

rain


75


80


false


yes

rain


68


80


false


yes

rain


70


96


false


yes

ML example Decision Tree
learning (
9
)


info(
PlayTennis
):

=
-
9
/
14
* lg(
9
/
14
)
-

5
/
14
* lg(
5
/
14
) =
0.940
bits


info
outlook
(
PlayTennis
):

=
5
/
14
* (
-
2
/
5
* lg(
2
/
5
)
-

3
/
5
* lg(
3
/
5
)) +


4
/
14
* (
-
4
/
4
* lg(
4
/
4
)
-

0
/
4
* lg(
0
/
4
)) +


5
/
14
* (
-
3
/
5
* lg(
3
/
5
)
-

2
/
5
* lg(
2
/
5
))


=
0.694
bits


gain(
outlook
) =
0.940


0.694
=
0.246
bits

ML example Decision Tree
learning (
10
)


info(
PlayTennis
):

=
-
9
/
14
* lg(
9
/
14
)
-

5
/
14
* lg(
5
/
14
) =
0.940
bits


info
windy
(
PlayTennis
):

=
6
/
14
* (
-
3
/
6
* lg(
3
/
6
)
-

3
/
6
* lg(
3
/
6
)) +


8
/
14
* (
-
6
/
8
* lg(
6
/
8
)
-

2
/
8
* lg(
2
/
8
)) +

=
0.892
bits


gain(
windy
) =
0.940


0.892
=
0.048
bits


gain(
outlook
) > gain(
windy
)


Test on
outlook!

ML example Decision Tree
learning (
11
)


Guarding against overfitting:






Cross
-
validation

Want to use all data, but using test data to train is cheating

Split data into k sets:

for (i =
0
; i < k; i++)



{


model = train_with_everything_but(i);


test_with(model,i);


}

Tenets of Machine Learning

Choose appropriate:

Training experience

Ex: Good to have about equal number of cases of each
class, even if some classes are more probable in real
data

Think about how you'll test too!

Target function:

Decision tree? Neural Net?

Representation:

Ex: how much data:

Windy in {true,false} vs. wind_speed in mph

Learning algorithm:

Ex: Greedy search? Genetic algorithm? Back
-
propagation?

Our Tenets of Scientific
Discovery

1.

Play to computers' strengths:

1.

Speed

2.

Accuracy (fingers crossed)


3.

Don't get bored


Do exhaustive search!

Q:
Hey

doesn't that ignore all that AI heuristic fnc
research?


2.

Use background knowledge


Predictive accuracy is not everything!


Normal science ==> dominant paradigm


Revolutionary science ==> ?

What are the Differences?

1.

Background knowledge


CSD values background knowledge






ML considers background knowledge


What are the Differences? (cont)


2
. The process of knowledge discovery

The ML Process is iterative:





But the CSD is iterative, and starts all over again:

1
. Exhaustive Search

Tell computers to consider
everything!


Search space systematically


Simplest
--
> increasingly more complex

Issues:

1.

How do you search systematically?

States: models

Initial state = simplest model

Goal state = solution model

Operators: Go from one model to marginally more
complex



What‏is‏“everything”?

Q: With floating pt values every different coefficient
could be a new model (x, x+dx, x+
2
dx,
etc
.)


A: Generate next
qualitative state
, use
numerical
methods

to find best coefficients in that state

2
. Background knowledge as
inductive bias (
1
)


Inductive bias
is

necessary


N training cases


But N+
1
test case could be
anything


Want to assume
something

about target function


Inductive Bias = what you've assumed


Common inductive biases in ML:


Minimal cross
-
validation error (e.g. decision tree
learn)



Maximal conditional independence (Bayes nets)



Maximal boundary size between classes (Support
vector machines)



Minimal description length (Occam's razor)



Minimal feature usage (Ignore extraneous data)



Same class as nearest neighbor (Locality)


2
. Background knowledge as
inductive bias (
2
)


Biases we can add/refine in CSD

1.

Expressible in same language as paradigm?


Re
-
use paradigm elements instead of inventing
something‏“brand‏new”


Penalty for new objects


Penalty for new attributes


Penalty for new processes


Penalty for new relations/operations (?)



Penalty for new types of assertions (?)



2.

Uses same reasoning as done in paradigm


Penalty for new types of reasoning, even with old
assertions


Q: Does this mean we can
never

introduce a
new thing?

Penalty for new objects:
polywater

Polymer: a long molecule in a repetitive chain



Nikolai Fedyakin (
1962
USSR)


H
2
O condensed in and forced thru narrow quartz
capillary tubes

Measured boiling pt, freezing pt and viscosity

Similar to syrup

Boris Derjaguin

Popularized results (Moscow, then UK
1966
)


In West

Some could replicate findings

Some could not

Penalty for new objects:
polywater (
2
)


People concerned with contamination of H
2
0

But precautions taken against this


Denis Rousseau (Bell Labs)


Did same tests with his sweat

Had‏same‏properties‏as‏“polywater”


Easier to believe in an
old thing

(water +
organic pollutants) rather than
new thing

("
polywater
")


Penalty for new things:
Piltdown Man

Circa
1900
: looking for early human fossils


Neanderthals in Germany (
1863
)



Cro
-
Magnon in France (
1868
)



What about
England??


Charles Dawson (
1912
)


“I‏was‏given‏a‏skull‏by‏men‏in‏at‏Piltdown‏gravel‏pit”

Later, got skull fragments and lower jaw

Excavating Piltdown gravels:

Dawson (r)


Smith Woodward (center)


Penalty for new things:
Piltdown Man (
2
)


Royal College of Surgeons (soon after discovery)


“Brain‏looks‏like‏modern‏man”

French paleontologist Marcellin Boule (
1915
)


“Jaw‏from‏ape”

American zoologist Gerrit Smith Miller (
1915
)


“Jaw‏from‏fossil‏ape”

German anatomist Franz Weidenreich (
1923
)


“Modern‏human‏cranium‏+‏orangutan‏jaw‏w/filed‏
teeth”

Oxford anthropologist Kenneth Page Oakley
(
1953
)


“Skull‏is‏medieval‏human,‏lower‏jaw‏is‏Sarawak‏
orangutan, teeth are fossil chimpanzee

Penalty for new attributes:

Inertia vs. gravitational mass


Inertia mass:

Resistance to motion

m

in
F = ma


Active gravitational mass

Ability to attract other masses

M

in
F = GMm/r
2


Passive gravitational mass:

Ability to be attracted by other masses

m

in
F = GMm/r
2

Penalty for new attributes (
2
)

Conceptually they are three different types of
mass


No experiment has ever distinguished between
them


People since Newton on
have

tried experiments


Assume they are all the same!

Penalty for new processes:
cold
-
fusion

Cold fusion

Novel combo of old processes: catalysis + fusion


Catalysis:

Hard:

A + B
-
> D



Easier (C = catalyst):

A + C
-
> AC (activated catalyst)


B + AC
-
> ABC (ready to go)


ABC
-
> CD (easier reaction)


CD
-
> C + D


(catalyst ready to do another reaction)


Penalty for new processes:
cold
-
fusion (
2
)


Fusion: how it works


Get lots of energy fusing neutron
-
rich atoms


Need a lot of energy in to get more out


Penalty for new processes:
cold
-
fusion (
3
)


Fusion: Overcoming electrostatic force is hard:

Current technology: need a fission bomb to do it






This is the result:


Penalty for new processes:
cold
-
fusion (
4
)


Martin Fleischmann & Stanley Pons (
1989
)


“We‏can‏do‏fusion‏
at room temperature!


(No initiating nuclear bomb needed)



Electrolysis of heavy water (D
2
O)


“Excess‏heat”‏observed


Proposed mechanism

Palladium is catalyst

Pd + D
-
> Pd
-
D

Pd
-
D + D
-
> D
-
Pd
-
D

D
-
Pd
-
D
-
> He
-
PD +
energy!

He
-
PD
-
> He + Pd

Penalty for new processes:
cold
-
fusion (
5
)


Reported in
New York Times

Instantly a worldwide story among scientists

Replication

Some can

Others can't

Results:

Energy:

Some get excess energy

Other claim didn't calibrate/account for everything

Helium:

Not enough observed for energy said to be produced

(there is background Helium in the air)


Ramifications

1.

Science is conservative

Use the current paradigm to guide thinking


2.

Accuracy is not everything

Assertion‏has‏to‏“fit‏in”‏current‏model


Be explainable by model


Use same terms as model

ML and CSD?

From ML we can get:

Idea of learning as model search:


Training experience


Target function


Representation


Learning algorithm


Extra considerations for CSD:


Use computers' strengths:


Speed + Accuracy + Don't Get Bored


Simulation + Exhaustive search


Use of background knowledge


Down right conservative about introducing new terms


Not just iterative, never ends