Machine Learning Overview CS 260 (11/20/08)

habitualparathyroidsΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

103 εμφανίσεις

1
Doug Fisher 1
Machine Learning Overview
CS 260
(11/20/08)
Doug Fisher
Learning
Element
Knowledge
Base
Performance
Element
training or time or …
performance
“quality”
Environment
• ML Research Perspectives:Tools for agents or models of agents
• Arrows Everywhere:batch or online (incremental), empirical or analytic,
passive or active (budgeted)
• Coupling:the picture shows “file coupling”, which makes for an easy picture, but
tighter coupling possible/probable
Adapted from Dietterich, 1984
Doug Fisher 2
Machine Learning Overview
Empirical, Supervised Learning
Example: Naïve Bayesian Classifiers
Subclass: Supervised Rule Induction
Example: Decision tree induction
Example: Brute-force induction of decision rules
Empirical, Unsupervised Learning
Unsupervised Rule Induction
Association Rule Learning
Bayesian Network Learning
Clustering
Analytical Learning
Explanation-Based Learning
Empirical/Analytic Hybrids
Doug Fisher 3
S
t
a
t
i
s
t
i
c
s
P
s
y
c
h
o
l
o
g
y
Machine
Learning
Data
Mining
Database

We will be covering
Doug Fisher 4
Learning
Element
Knowledge
Base
Performance
Element
training set size …
Classification\Prediction
accuracy (or cost or …)
Environment
Batch, Empirical, Passive Supervised Learning (BEPSL)
Given:a set of classified objects (a training data set)
Find:a classifier (for predicting class membership of
unclassified data – a test set)
Index V
1
V
2
V
3
. . . . V
m
C
1 v
11
v
21
v
32
v
m2
c
1
2 v
12
v
22
v
32
v
m1
c
2
. . . .
n v
12
v
21
v
31
v
m2
c
1
A training set
Feature-vector BEPSL
v
11
v
21
v
32
…v
m2
c
?
v
11
v
21
v
32
…v
m2
c
1
Variants?
2
Doug Fisher 5
Example: Naive Bayesian Classifier
Given a vector V = {v11, v22, v31, . . . ., vm2, c?}
Compute:
P(c1| v11, v22, v31,…, vm2) proportional to
P(v11|v22, v31, …, vm2,c1)P(v22|v31,…,vm2,c1)….P(vm2|c1)P(c1)
equals (under assumption that Vi’s are independent conditioned
on C)
P(v11|c1)P(v22|c1)P(31|c1)….P(vm2|c1)P(c1)
P(c2| v11, v22, v31,…, vm2) proportional to
P(v11|v22, v31, …, vm2,c2)P(v22|v31,…,vm2,c2)….P(vm2|c2)P(c2)
equals (under assumption that Vi’s are independent conditioned
on C)
P(v11|c2)P(v22|c2)P(31|c2)….P(vm2|c2)P(c2)
Classify V as in c1 or c2, whichever yields higher probability
Doug Fisher 6
Given a vector V = {1, -1, 0, . . . ., 1}
Compute:
P(-1| 1, -1, 0,…, 1) as P(1|-1)P(-1|-1)P(0|-1)….P(1|-1)P(-1)
P(1| 1, -1, 0,…, 1) as P(1|1)P(-1|1)P(0|1)….P(1|1)P(1)
Classify V as in c1 or c2, whichever yields higher probability
Doug Fisher 7
Learning a Naïve Bayesian Classifier.
View probabilities as proportions computed over training set.
P(v11|c1)P(v22|c1)P(31|c1)….P(vm2|c1)P(c1)
= [v11,c1]/[c1] * [v22,c1]/[c1] * [v31,c1]/[c1] *…* [vm2,c1]/[c1] * [c1]/[]
where [conditions] is the number of objects/rows in the training set that satisfy all
the conditions. So [v11,c1] is the number of training data that are members of c1
and have V1=v11, [c1] is the number of training objects in c1, [] is the total number
of training objects.
Learning in this case, is a matter of counting the number of rows in training data
in which various conditions satisfied. What conditions? Each class/variable-value
pair, each class, total number of rows.
Doug Fisher 8
V1 V2 V3 ………… Vm
c1 [v11,c1] [v21,c1] [v31,c1] ………… [vm1,c1]
[v12,c1] [v22,c1] [v32,c1] ………… [vm2,c1]
. . .
c2 [v11,c2] [v21,c2] [v31,c2] ………… [vm1,c2]
[v12,c2] [v22,c2] [v32,c2] ………… [vm2,c2]
. . .
[c1]
[c2]
[v11] [v21] [v31] [vm1]
[v12] [v22] [v32] [vm2]
[ ]
Consider an (multidimensional) array implementation of int,
and estimate P(vij|ck) as ([vij,ck]+1) / ([ck]+2), and P(ck) as
([ck]+1) / ([]+2)
Number of Vi values
Number of classes
3
Doug Fisher 9
Supervised Rule Induction
Sample decision rules:
IF (STAGE = 1) AND (TMSB4X > 10145.5) THEN (RISK = Low) [cov][err]
IF (Age >= 61) AND (BloodLoss >=100) AND (BPVar >= 16.73)
THEN (ExtendedPhase1Recovery = Yes) [cov][err]
Classifiers: decision trees
and decision lists
STAGE
=1 =3
TMSB4X OAS2
<=10145.5 >101045.5 <=132.3 >132.3
High Low Low High
STAGE=1 AND TMSB4X<=10145.5 High
STAGE=1 AND TMSB4X > 10145.5 Low
STAGE=3 AND OAS2 <= 132.3 Low
High
Doug Fisher 10
Example: Decision tree classifiers
Ebert
Siskel
RomanceTerror
Rent-it
~Rent-it
SciFi
Spouse
Rent-it
BigStar
Rent-it~Rent-it
~Rent-it
-1 1
-1
1
-1
1
-1
Rent-it~Rent-it
1
-1
1
-1 1
-1
1
[ SciFi = -1, Terror = 1, Romance = -1, Ebert = 1, Siskel = 1,….]
(-1)
(1)
Doug Fisher 11
Ebert
Siskel
RomanceTerror
Rent-it
~Rent-it
SciFi
BigHit
Rent-it
BigStar
Rent-it~Rent-it
~Rent-it
-1 1
-1
1
-1
1
-1
Rent-it~Rent-it
1
-1
1
-1 1
-1
1
[ SciFi = -1, Terror = 1, Romance = 0, Ebert = 1, Siskel = 1,….]
Doug Fisher 12
Ebert
Siskel
RomanceTerror
Rent-it
~Rent-it
SciFi
BigHit
Rent-it
BigStar
Rent-it~Rent-it
~Rent-it
-1 1
-1
1
-1
1
-1
Rent-it~Rent-it
1
-1
1
-1 1
-1
1
[ SciFi = -1, Terror = 1, Romance = -1, Ebert = 1, Siskel = 1,….]
4
Doug Fisher 13
Ebert
Siskel
RomanceTerror
Rent-it
~Rent-it
SciFi
BigHit
Rent-it
BigStar
Rent-it~Rent-it
~Rent-it
-1 1
-1
1
-1
1
-1
Rent-it~Rent-it
1
-1
1
-1 1
-1
1
[ SciFi = -1, Terror = 1, Romance = -1, Ebert = 1, Siskel = 1,….]
Doug Fisher 14
Learning a decision tree.
The standard greedy (hill-climbing) approach
(Top-Down Induction of Decision Trees)
Node TDIDT (Set Data, int (* TerminateFn) (Set, Set, Set),
Variable (* SelectFn) (Set, Set, Set)) {
if ((* TerminateFn) (Data)) return ClassNode(Data);
BestVariable = (* SelectFn)(Data);
return ( TestNode(BestVariable) )
TDIDT({d | d in Data and TDIDT({d | d in Data and
Value(BestAttribute, d) Value(BestAttribute, d)
= v1}) = v2})
}
v1 v2
This is not the only way to learn a decision tree !!
Doug Fisher 15
Data: 1 -1 1 1 c1
-1 1 -1 1 c1 (or –1)
-1 -1 -1 1 c1
-1 -1 1 -1 c1
-1 1 1 -1 c2 (or 1)
-1 -1 1 1 c2
-1 1 -1 -1 c2
1 1 1 -1 c2
V1 V2 V3 V4 C
Best-attribute: V4
V4
TDIDT( [-1-11-11c1, -111-11c2, TDIDT([1-111c1, -11-11c1,
-11-1-1c2, 111-11c2]) -1-1-11c1, -1-111c2])
4
4
Assume left branch always
corresponds to -1
Assume right branch always
corresponds to 1
Number of data sent down
left and right branches,
respectively.
The test attribute might actually
be represented by an index into
the datum vector (e.g., instead of
V4, a 3 might be given here, indicating
a test of location 3 of a datum vector
indexed from 0 to 3.
dVector
dClass
A Set of Datum
A Datum, d
Doug Fisher 16
V4
TDIDT( [-1-11-1c1, -111-1c2, TDIDT([1-111c1,-11-11c1,
-11-1-1c2, 111-1c2]) -1-1-11c1, -1-111c2])
4
4
BestAttribute: V2
V4
TDIDT([1-111c1,-11-11c1,
-1-1-11c1, -1-111c2])
4
4
BestAttribute: V2
TDIDT( [-1-11-1c1])
V2
1
3
TDIDT([-111-1c2, -11-1-1c2, 111-1c2])
5
Doug Fisher 17
V4
TDIDT([1-111c1,-11-11c1,
-1-1-11c1, -1-111c2])
4
4
TDIDT( [-1-11-1c1])
V2
1
3
TDIDT([-111-1c2, -11-1-1c2, 111-1c2])
C1
0 1
Number of data at leaf in C1 (right entry) and
not in C1 (left entry)
i.e., -1
Doug Fisher 18
V4
TDIDT([1-111c1,-11-11c1,
-1-1-11c1, -1-111c2])
4
4
TDIDT( [-1-11-1c1])
V2
1
3
TDIDT([-111-1c2, -11-1-1c2, 111-1c2])
C1
C2
0
1
0 3
i.e., 1
Doug Fisher 19
V4
TDIDT([1-111c1,-11-11c1,
-1-1-11c1, -1-111c2])
4
4
V2
1
3
C1
C2
BestAttribute: V3
0 1 0 3
Doug Fisher 20
V4
TDIDT([-11-11c1,
-1-1-11c1])
4
4
V2
1
3
C1
C2
BestAttribute: V3
V3
2
2
TDIDT([1-111c1,
-1-111c2]
C1
0 1 0 3
0 2
6
Doug Fisher 21
V4
4
4
V2
1
3
C1
C2
BestAttribute: V1
V3
2
2
TDIDT([1-111c1,
-1-111c2])
C1
0 1
0
3
0 2
Doug Fisher 22
V4
4
4
V2
1
3
C1
C2
V3
2
2
TDIDT([-1-111c2])
C1
V1
1
1
TDIDT([1-111c1]
)
C2
C1
0
1
0 3
0 2
0
1
0 1
In general, it might appear that the left integer
field of a leaf will always be 0, but some
termination functions allow “non-pure” leaves
(e.g., no split changes the class distribution
significantly).
Doug Fisher 23
Selecting the best divisive attribute (SelectFN):
Attribute Vi that minimizes:
P(Vi = vij)
P(Ck | Vi = vij) | log P(Ck | Vi = vij) |
j k
#bits necessary to
encode Ck conditioned
on Vi = vij
Expected number of bits necessary to
encode C membership conditioned on
Vi = vij
Expected number of bits necessary to encode C conditioned on
knowledge of Vi value
treat 0 * log 0 as 0, else a runtime error
will be generated (log 0 is undefined)
Doug Fisher 24
Selecting the best divisive attribute (alternate):
Attribute that maximizes:
P(Vi = vij)
P(Ck | Vi = vij)^2
j k
The big picture on attribute selection:
• if Vi and C are statistically independent, value Vi least (“minimally”)
• if each value of Vi associated with exactly one C, value Vi most (“maximally”)
• most cases somewhere in between
7
Doug Fisher 25
Problems:
DT1:
suppose that you want to use DT induction to build a classifier that predicts (natural-) tree-height
from gps-location (assume insignificant error), tree-base-width, tree-species, soil type, watershed-
condition, etc. Describe the problems encountered when using TDIDT as described.
DT2:
Suppose that you have data defined over attributes A1 through A100. An oracle knows (you
don’t, which is why you are using DT induction) that each and only data labeled C1 satisfies A37 ≡
A91 and that remaining data is labeled C2 (i.e., defined as an exclusive or of A37 and A91). The
remaining variables (other than A37 and A91) have “random” values. Describe the problems
encountered when using TDIDT as described.
DT3:
Suppose data defined over attributes A1 through AM. An oracle knows that each and only data
that satisfy the conjunction Λ
i=1
Ai=aij are members of class C1 – all data not satisfying the
conjunction are members of C2. Describe the behavior of TDIDT as N increases.
DTProject:
Data mining and machine learning. Consider adapting TDIDT to run on massive data sets
(of the form at right) that reside in a DBMS.
If you want your TDIDT variant to minimize page
reads/writes, would you store the data in row
major or column major format in a single many
Page table? Does is matter? Different tables?
N
Index V1 V2 V3 . . . . Vm C
1 v11 v21 v32 vm2 c1
2 v12 v22 v32 vm1 c2
. . . .
n v12 v21 v31 vm2 c1
Doug Fisher 26
Supervised Rule Induction
Sample decision rules:
IF (STAGE = 1) AND (TMSB4X > 10145.5) THEN (RISK = Low) [cov][err]
IF (Age >= 61) AND (BloodLoss >=100) AND (BPVar >= 16.73)
THEN (ExtendedPhase1Recovery = Yes) [cov][err]
Classifiers: decision trees
and decision lists
STAGE
=1 =3
TMSB4X OAS2
<=10145.5 >101045.5 <=132.3 >132.3
High Low Low High
STAGE=1 AND TMSB4X<=10145.5 High
STAGE=1 AND TMSB4X > 10145.5 Low
STAGE=3 AND OAS2 <= 132.3 Low
High
Doug Fisher 27
Empirical, Supervised Learning
Example: Naïve Bayesian Classifiers
Subclass: Supervised Rule Induction
Example: Decision tree induction
Example: Brute-force induction of decision rules
Empirical, Unsupervised Learning
Unsupervised Rule Induction
Association Rule Learning
Bayesian Network Learning
Clustering
Analytical Learning
Explanation-Based Learning
Empirical/Analytic Hybrids
Doug Fisher 28
Data and knowledge representation issues
• attributes types – both independent attributes and
dependent attribute(s)
• missing, noisy, irrelevant attributes
• relational (structured, graphical) data
• bias
Performance issues
• classification, prediction, cost, probability
distributions over outcome space
• probably approximately correct learning
Knowledge representation and learning
principles
• bias
• recursive decomposition
• finding “simpler models” in higher dimensional spaces
• inherently low dimensional data (that appears high dimensional)
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
8
Doug Fisher 29
Empirical, Supervised Learning
Example: Naïve Bayesian Classifiers
Subclass: Supervised Rule Induction
Example: Decision tree induction
Example: Brute-force induction of decision rules
Empirical, Unsupervised Learning
Unsupervised Rule Induction
Association Rule Learning
Bayesian Network Learning
Clustering
Analytical Learning
Explanation-Based Learning
Empirical/Analytic Hybrids
Doug Fisher 30
Learning
Element
Knowledge
Base
Performance
Element
training set size …
Pattern-completion
accuracies (or costs or …)
Environment
Batch, Empirical, Passive UNSupervised Learning (BEPUL)
Given:a set of fully-described data (a training data set)
Find:a predictor (for pattern-completion of incomplete
data – a test set)
Index V
1
V
2
V
3
. . . . V
m
C
1 v
11
v
21
v
32
v
m2
c
1
2 v
12
v
22
v
32
v
m1
c
2
. . . .
n v
12
v
21
v
31
v
m2
c
1
A training set
Feature-vector BEPUL
v
11
v
2?
v
32
…v
m?
c
?
v
11
v
21
v
32
…v
m2
c
1
V
1
V
2
V
3
Variants?
Doug Fisher 31
Example: Unsupervised rule induction of Association Rules
In a nutshell:run brute force rule discovery for all possible consequents,
not simply single variable values (e.g., V1=v12), but consequents that are
conjunctions of variable values (e.g., V1=v12 & V4=v42 & V5=v51).
Retain rules A  C such that P(A & C) >= T1 and P(C|A) >= T2. These thresholds
enable pruning of the search space (A and C are themselves conjunctions).
Problem: a plethora of rules, most uninteresting, are produced.
Solutions:Organize/prune rules by
a) Interestingness (e.g., AC interesting if P(A, C) >> P(A)P(C) or << P(A)P(C)
b) confidence (a confidence interval around coverage and/or accuracy)
c) support for top-level goal
Doug Fisher 32
v1
P(v1)
v2
P(v2)
v3
P(v3|v1)
P(v3|~v1)
v4
P(v4|v2, v3),
P(v4|v2,~v3),
P(v4|~v2, v3),
P(v4|~v2, ~v3)
v5
P(v5|v3)
P(v5|~v3)
Components of a Bayesian Network: a topology (graph) that qualitatively indicates
displays the conditional independencies, and probability tables at each node
Semantics of graphical component: for each variable, v, v is independent of all
of its non-descendents conditioned on its parents
A Bayesian Network is a graphical representation of a joint probability distribution
with (conditional) independence relationships made explicit
Example (Empirical, Unsupervised): Learning Bayesian Networks
9
Doug Fisher 33
Recall the chain rule:
P(v1 and v2 and ~v3 and v4 and ~v5)
= P(v1)P(v2|v1)P(~v3|v1,v2)P(v4|v1,v2,~v3)P(~v5|v1,v2,~v3,v4)
P(v1,v2)
P(v1,v2,~v3)
P(v1,v2,~v3,v4)
P(v1,v2,~v3,v4,~v5)
P(v1 and v2 and ~v3 and v4 and ~v5)
= P(v4)P(v2|v4)P(~v3|v4,v2)P(v1|v4,v2,~v3)P(~v5|v4,v2,~v3,v1)
Assume Vi a binary valued variable (T or F)
A factorization ordering
An alternative ordering
Doug Fisher 34
P(v1 and v2 and ~v3 and v4 and ~v5)
= P(v1)P(v2|v1)P(~v3|v1,v2)P(v4|v1,v2,~v3)P(~v5|v1,v2,~v3,v4)
Assume the following conditional independencies:
P(v1)
P(v2|v1) = P(v2)
and P(v2|~v1) = P(v2), P(~v2|v1) = P(~v2), P(~v2|~v1) = P(~v2)
P(~v3|v1,v2) = P(~v3|v1)
and P(~v3|v1,~v2) = P(~v3|v1), P(~v3|~v1,v2) = P(~v3|~v1), P(~v3|~v1,~v2) = P(~v3|~v1),
P(v3|v1,v2) = P(v3|v1), P(v3|v1,~v2) = P(v3|v1), P(v3|~v1,v2) = P(v3|~v1),
P(v3|~v1,~v2) = P(v3|~v1)
P(v4|v1,v2,~v3) = P(v4|v2, ~v3)
and ……
P(~v5|v1,v2,~v3,v4) = P(~v5|~v3)
and …..
A factorization ordering
v2 independent of v1
v3 independent of v2 conditioned on v1
Doug Fisher 35
P(v1 and v2 and ~v3 and v4 and ~v5)
= P(v1)P(v2|v1)P(~v3|v1,v2)P(v4|v1,v2,~v3)P(~v5|v1,v2,~v3,v4)
= P(v1)P(v2)P(~v3|v1)P(v4|v2,~v3)P(~v5|~v3)
How many probabilities need be stored?
P(v1), P(~v1)
2 probabilities (actually only one, since P(~v1) = 1 – P(v1))
2 probabilities (or 1) instead of 4 (or 2)
P(v2|v1) = P(v2)
and P(v2|~v1) = P(v2), P(~v2|v1) = P(~v2), P(~v2|~v1) = P(~v2)
P(~v3|v1) = 1 – P(v3|v1)
P(~v3|v1,v2) = P(~v3|v1)
4 probabilities (or 2) instead of 8 (or 4)
and P(~v3|v1,~v2) = P(~v3|v1), P(~v3|~v1,v2) = P(~v3|~v1), P(~v3|~v1,~v2) = P(~v3|~v1),
P(v3|v1,v2) = P(v3|v1), P(v3|v1,~v2) = P(v3|v1), P(v3|~v1,v2) = P(v3|~v1),
P(v3|~v1,~v2) = P(v3|~v1)
8 probabilities (or 4) instead of 16 (or 8)
P(v4|v1,v2,~v3) = P(v4|v2, ~v3)
and ……
4 probabilities (or 2) instead of 32 (or 16)
P(~v5|v1,v2,~v3,v4) = P(~v5|~v3)
and …..
Doug Fisher 36
For a particular factorization ordering
, construct a Bayesian network as follows:
P(v1), P(~v1)
v1
P(v1) = 0.75
P(~v1) = 0.25 = 1 – P(v1)
Since P(v2|v1) = P(v2) ….
v1 a “root”
v2 is second variable in ordering. If v2 independent of a subset of its predecessors
(possibly the empty set) in ordering conditioned on a disjoint subset of predecessors
(including possibly all its predecessors), then the latter subset is its parents, else
if latter subset is empty then v2 is a “root”
v1
P(v1) v2
P(v2)
10
Doug Fisher 37
v3 is third variable in ordering. Since P(v3|v1,v2) = P(v3|v1), …:
v1
P(v1) v2
P(v2)
v3
P(v3|v1)
P(v3|~v1)
P(~v3|v1) = 1 – P(v3|v1)
P(~v3|~v1) = 1 – P(v3|~v1)
Since P(v4| v1, v2, v3) = P(v4 | v2, v3), …
v1
P(v1)
v2
P(v2)
v3
P(v3|v1)
P(v3|~v1)
v4
P(v4|v2, v3), P(~v4|v2,v3) = 1-P(v4|v2,v3
)
P(v4|v2,~v3), …
P(v4|~v2, v3), …
P(v4|~v2, ~v3), …
Doug Fisher 38
Since P(v5|v1,v2, v3, v4) = P(v5|v3),…:
v1
P(v1)
v2
P(v2)
v3
P(v3|v1)
P(v3|~v1)
v4
P(v4|v2, v3),
P(v4|v2,~v3),
P(v4|~v2, v3),
P(v4|~v2, ~v3)
v5
P(v5|v3)
P(v5|~v3)
Components of a Bayesian Network: a topology (graph) that qualitatively indicates
displays the conditional independencies, and probability tables at each node
Semantics of graphical component: for each variable, v, v is independent of all
of its non-descendents conditioned on its parents
Doug Fisher 39
Where does knowledge of conditional independence come from?
a) From data.Consider congressional voting records. Suppose that we have data
on House votes (and political party). Suppose variables are ordered
Party, Immigration, StarWars, ….
Party P(Republican) = 0.52 (226/435 Republicans
209/435 Democrats)
To determine relationship between Party and Immigration, we count
Actual Counts Predicted Counts (if Immigration and
Immigration Party independent)
Yes No Yes No
Republican 17 209 Republican 92 134
Democrat 160 49 Democrat 85 124
P(Rep)*P(Yes) * 435
= 0.52 * (17+160)/435 * 435
Very different distributions – conclude dependent
Doug Fisher 40
Party P(Republican) = 0.52 (226/435 Republicans
209/435 Democrats)
Actual Counts
Immigration
Yes No
Republican 17 209
Democrat 160 49
Immigration
P(Yes| Rep) = 0.075
P(Yes|Dem) = 0.765
17/226
Consider StarWars
Is StarWars independent of Party and Immigration?
(i.e., is P(StarWars|Party, Immigration) approx equal P(StarWars)
for all combinations of variable values?)
if yes, then stop and make StarWars a “root”, else continue
Is StarWars independent of Immigration conditioned on Party?
if yes, then stop and make Immigration a child of Party, else continue
Is StarWars independent of Party conditioned on Immigration?
if yes, then stop and make Immigration a child of Immigration, else continue
Make StarWars a child of both Party and Immigration
11
Doug Fisher 41
Party P(Republican) = 0.52 (226/435 Republicans
209/435 Democrats)
Actual Counts Actual Counts
Immigration StarWars
Yes No Yes No
Republican 17 209 219 7
Democrat 160 49 24 185
Immigration
P(Yes| Rep) = 0.075
P(Yes|Dem) = 0.765
17/226
Consider StarWars
Is StarWars independent of Party and Immigration?
Actual Counts
Immigration
Yes No
Republican
Democrat
Yes No Yes No
StarWars
14 3 205 4
8 152 16 33
Predicted Counts
Immigration
Yes No
Republican
Democrat
Yes No Yes No
StarWars
9.5 7.5 117 92
89 71 27 22
different – not independent
P(Rep & Imm=Y)P(SW=Y)435
Doug Fisher 42
Party
Immigration StarWars
Further tests might indicate
i.e., Immigration and StarWars are independent conditioned on Party
Doug Fisher 43
Where does knowledge of conditional independence come from?
b) “First principles”
For example, suppose that the grounds keeper sets sprinkler timers
to a fixed schedule that depends on the season (Summer, Winter,
Spring, Fall), and suppose that the probability that it rains or not
is dependent on season. We might write:
This model might differ from one in which a homeowner manually
turns on a sprinkler
Season
Rains
Sprinkler
Season
Rains
Sprinkler
Doug Fisher 44
Limitations of Bayesian Networks
• Little work with continuous variables
12
Doug Fisher 45
Example (Empirical, Unsupervised): Clustering
Given data (vectors of variable values)
Compute a partition (clusters) of the vectors, such that vectors within
a cluster tend to be similar, and vectors across clusters tend to be
dissimilar
For example,
V1 V2 V3 V4 ………….. VM
1 0.3 0.7 0.1 -0.2 ………….-0.5
2 0.4 0.8 0.01 0.1 …………. -0.4
…………………..
N-1 -0.3 0.1 1.01 0.8 …………. 1.3
N -0.5 0.03 1.1 0.9 …………. 0.9
1,2… …,N-1,N
Doug Fisher 46
Cluster summary representations (e.g., the centroid)
V1 V2 V3 V4 ………….. VM
1 0.3 0.7 0.1 -0.2 …………. -0.5
2 0.4 0.8 0.01 0.1 …………. -0.4
…………………..
N-1 -0.3 0.1 1.01 0.8 …………. 1.3
N -0.5 0.03 1.1 0.9 …………. 0.9
1,2… …,N-1,N
0.35 0.75 0.05 -0.05 …. –0.45 -0.4 0.05 1.05 0.85 …. 1.1
(centroid for C1) (centroid for C2)
C1 C2
Doug Fisher 47
Using summary representations for inference
1,2… …,N-1,N
0.35 0.75 0.05 -0.05 …. –0.45 -0.4 0.05 1.05 0.85 …. 1.1
(centroid for C1) (centroid for C2)
C1 C2
0.5 ?0.01 -0.12 …. ?
0.5 0.75 0.01 –0.12 …. –0.45
Doug Fisher 48
K-means
Clustering K-Means (Data, K) {
ClusterCentroids = K randomly selected vectors from Data
for each d in Data
assign d to cluster with closest centroid
do {
compute new cluster centroids
for each d in Data
assign d to cluster with closest centroid
} while NOT termination condition
}
“closest”: Euclidean distance
13
Doug Fisher 49
Empirical, Supervised Learning
Example: Naïve Bayesian Classifiers
Subclass: Supervised Rule Induction
Example: Decision tree induction
Example: Brute-force induction of decision rules
Empirical, Unsupervised Learning
Unsupervised Rule Induction
Association Rule Learning
Bayesian Network Learning
Clustering
Analytical Learning
Explanation-Based Learning
Empirical/Analytic Hybrids
Doug Fisher 50
Learning
Element
Knowledge
Base
Performance
Element
training set size …
“Reward”
(or costs or …)
Environment
Batch, Empirical, Passive Sequential Learning (BEPSeL)
Given:a set of fully-described sequences (i.e., to some
“termination” state (a training data set)
Find:an action-taker (for generating subsequent state(s) in an
incomplete sequence – a test set)
Index
1 s
1
s
2
s
5
s
m
t
1
2 s
2
s
3
s
m-3
t
5
. . . .
n s
1
s
3
s
7
t
3
A training set
Primitive-State BEPSeL
s
i
s
i+1
s
i+2
… s
i+m
Variants?
s
i
s
i+1
s
i+2
… s
i+m
s
i+m+1

Doug Fisher 51
Data and knowledge representation issues
• Uncertainty in effects of actions?
• Do states have internal description?
What kind -- features or relational?
• Only rewards at “terminal” states?
Performance issues
• goals or more generalized rewards
Learning principles
X
X
X
N-i
N
Bridging nondeterministic regions
versus frequent associations
Tension between exploitation and exploration
Doug Fisher 52
Analog of Finite Automata
Give an FA that recognizes strings of 0s and 1s where the 3
rd
symbol from the
left (third from first symbol) is a ‘1’.
Give an FA that recognizes strings of 0s and 1s where the 3
rd
symbol from the
Right (third from last symbol) is a ‘1’.
0,1 0,1 1
0,1
0,1
0,1
1 0,1
0
0
11 1
1
1
0
0
.


.


.
2
3
states
Nondeterministic FA
Deterministic FA
(AI program)
Translation procedure
(speedup machine learning)
14
Doug Fisher 53
Learning macros: Given a plan, generalize the plan so that the generalized plan
can be applied in a greater number of situations
Objective: reusing previously-developed generalized plans (aka macro-operators)
will reduce the cost (improve the “speed”) of subsequent planning
A
B
C
B
A
Start State
GoalSpec
Unstack(A,B) Putdown(A) Unstack(B,C) Stack(B,A)
(Generalize) 
Unstack(?x1, ?y1) Putdown(?x1) Unstack(?y1, ?z1) Stack(?y1, ?x1)
Doug Fisher 54
A
B
C
B
A
Start State GoalSpec
Unstack(A,B) Putdown(A) Unstack(B,C) Stack(B,A)
Unstack(?x1, ?y1) Putdown(?x2) Unstack(?y2, ?z1) Stack(?y3, ?x3)
On(?x1,?y1) On(?x1,?y1) Holding(?x2) Holding(?x2) On(?y2,?z1) On(?y2,?z1) Holding(?y3) Holding(?y3)
Clear(?x1) Clear(?x1) Clear(?x2) Clear(?y2) Clear(?y2) Clear(?x3) Clear(?x3)
Handemp() Handemp() Handemp() Handemp() Handemp() Handemp()
Holding(?x1) OnTab(?x2) Holding(?y2) Clear(?y3)
Clear(?y1) Clear(?z1) On(?y3,?x3)
x
x
x
x
Doug Fisher 55
Learning macros:
A
B
C
B
A
Start State
GoalSpec
Unstack(A,B) Putdown(A) Unstack(B,C) Stack(B,A)
Unstack(?x1, ?y1) Putdown(?x2) Unstack(?y2, ?z1) Stack(?y3, ?x3)
On(?x1,?y1) On(?x1,?y1) Holding(?x2) Holding(?x2) On(?y2,?z1) On(?y2,?z1) Holding(?y3) Holding(?y3)
Clear(?x1) Clear(?x1) Clear(?x2) Clear(?y2) Clear(?y2) Clear(?x3) Clear(?x3)
Handemp() Handemp() Handemp() Handemp() Handemp() Handemp()
Holding(?x1) OnTab(?x2) Holding(?y2) Clear(?y3)
Clear(?y1) Clear(?z1) On(?y3,?x3)
x
x
x
x
{?y3/?y2}
{?x3/?x2}
Doug Fisher 56
Learning macros:
A
B
C
B
A
Start State GoalSpec
Unstack(A,B) Putdown(A) Unstack(B,C) Stack(B,A)
Unstack(?x1, ?y1) Putdown(?x2) Unstack(?y2, ?z1) Stack(?y2, ?x2)
On(?x1,?y1) On(?x1,?y1) Holding(?x2) Holding(?x2) On(?y2,?z1) On(?y2,?z1) Holding(?y2) Holding(?y2)
Clear(?x1) Clear(?x1) Clear(?x2) Clear(?y2) Clear(?y2) Clear(?x2) Clear(?x2)
Handemp() Handemp() Handemp() Handemp() Handemp() Handemp()
Holding(?x1) OnTab(?x2) Holding(?y2) Clear(?y2)
Clear(?y1) Clear(?z1) On(?y2,?x2)
x
x
x
x
{?y3/?y2}
{?x3/?x2}
15
Doug Fisher 57
Learning macros:
A
B
C
B
A
Start State
GoalSpec
Unstack(A,B) Putdown(A) Unstack(B,C) Stack(B,A)
Unstack(?x1, ?y1) Putdown(?x2) Unstack(?y2, ?z1) Stack(?y2, ?x2)
On(?x1,?y1) On(?x1,?y1) Holding(?x2) Holding(?x2) On(?y2,?z1) On(?y2,?z1) Holding(?y2) Holding(?y2)
Clear(?x1) Clear(?x1) Clear(?x2) Clear(?y2) Clear(?y2) Clear(?x2) Clear(?x2)
Handemp() Handemp() Handemp() Handemp() Handemp() Handemp()
Holding(?x1) OnTab(?x2) Holding(?y2) Clear(?y2)
Clear(?y1) Clear(?z1) On(?y2,?x2)
x
x
x
x
Unstack(?x1, ?y1) Putdown(?x2) Unstack(?y1, ?z1) Stack(?y1, ?x2)
On(?x1,?y1) On(?x1,?y1) Holding(?x2) Holding(?x2) On(?y1,?z1) On(?y1,?z1) Holding(?y1) Holding(?y1)
Clear(?x1) Clear(?x1) Clear(?x2) Clear(?y1) Clear(?y1) Clear(?x2) Clear(?x2)
Handemp() Handemp() Handemp() Handemp() Handemp() Handemp()
Holding(?x1) OnTab(?x2) Holding(?y1) Clear(?y1)
Clear(?y1) Clear(?z1) On(?y1,?x2)
xx
x
x
{?y2/?y1}
Doug Fisher 58
Learning macros:
A
B
C
B
A
Start State
GoalSpec
Unstack(?x1, ?y1) Putdown(?x2) Unstack(?y1, ?z1) Stack(?y1, ?x2)
On(?x1,?y1) On(?x1,?y1) Holding(?x2) Holding(?x2) On(?y1,?z1) On(?y1,?z1) Holding(?y1) Holding(?y1)
Clear(?x1) Clear(?x1) Clear(?x2) Clear(?y1) Clear(?y1) Clear(?x2) Clear(?x2)
Handemp() Handemp() Handemp() Handemp() Handemp() Handemp()
Holding(?x1) OnTab(?x2) Holding(?y1) Clear(?y1)
Clear(?y1) Clear(?z1) On(?y1,?x2)
x
x
x
Unstack(?x1, ?y1) Putdown(?x1) Unstack(?y1, ?z1) Stack(?y1, ?x1)
On(?x1,?y1) On(?x1,?y1) Holding(?x1) Holding(?x1) On(?y1,?z1) On(?y1,?z1) Holding(?y1) Holding(?y1)
Clear(?x1) Clear(?x1) Clear(?x1) Clear(?y1) Clear(?y1) Clear(?x1) Clear(?x1)
Handemp() Handemp() Handemp() Handemp() Handemp() Handemp()
Holding(?x1) OnTab(?x1) Holding(?y1) Clear(?y1)
Clear(?y1) Clear(?z1) On(?y1,?x1)
x
x
x
{?x2/?x1}
x
x
Doug Fisher 59
Learning macros:
A
B
C
B
A
Start State
GoalSpec
Unstack(?x1, ?y1) Putdown(?x1) Unstack(?y1, ?z1) Stack(?y1, ?x1)
On(?x1,?y1) On(?x1,?y1) Holding(?x1) Holding(?x1) On(?y1,?z1) On(?y1,?z1) Holding(?y1) Holding(?y1)
Clear(?x1) Clear(?x1) Clear(?x1) Clear(?y1) Clear(?y1) Clear(?x1) Clear(?x1)
Handemp() Handemp() Handemp() Handemp() Handemp() Handemp()
Holding(?x1) OnTab(?x1) Holding(?y1) Clear(?y1)
Clear(?y1) Clear(?z1) On(?y1,?x1)
x
x
x
On(?x1, ?y1) On(?x1, ?y1)
On(?y1, ?z1) Clear(?x1)
Clear(?x1) On(?y1, ?z1)
Handemp() Clear(?y1)
OnTab(?x1)
Clear(?z1)
On(?y1, ?x1)
x
Macrop(?x1, ?y1, ?z1)
Doug Fisher 60
The intent of this form of learning is typically to speed up problem
solving by reducing the effective depth of search
Issues, variants, etc with analytic learning in problem solving
• The problem solving fan effect – the more operators that are learned
the greater the branching factor during search !!
• Redundant search in failure case
• These costs can outweigh the benefits
Solutions focus on
• retaining “high utility” rules (macros)
• improving the selectivity of operator application
16
Doug Fisher 61
Analytic Learning Problems
AL1: Specify procedures for automatically constructing the PRE, ADD, and DEL
lists of a macro operator
AL2: Suggest ways that you might break up the macro operator used in the
example into smaller operators, each of which might be “applicable” in
a larger class of problems
AL3: What factors do you include in your definition of “applicability”?
AL4: Do any of these factors include “costs” that should be taken into
account in evaluation the utility of a macro operator
AL5: What are the pitfalls in generalizing by replacing constants by variables
?
Doug Fisher 62
Empirical, Supervised Learning
Example: Naïve Bayesian Classifiers
Subclass: Supervised Rule Induction
Example: Decision tree induction
Example: Brute-force induction of decision rules
Empirical, Unsupervised Learning
Unsupervised Rule Induction
Association Rule Learning
Bayesian Network Learning
Clustering
Analytical Learning
Explanation-Based Learning
Empirical/Analytic Hybrids
Lots and lots