Machine Learning
based on
Attribute Interactions
Aleks Jakulin
Advised by Acad. Prof. Dr. Ivan Bratko
2003

2005
Learning =
Modelling
Utility =

Loss
MODEL
Learning Algorithm
Data
Hypothesis
Space
B
{
A
:
“
A
bounds
B
”
The fixed data sample
restricts the model to
be consistent with it.

data
shapes the
model

model
is made of
possible hypotheses

model
is generated by an
algorithm

utility
is the goal of a
model
Our Assumptions about Models
•
Probabilistic Utility:
logarithmic loss
(alternatives: classification accuracy, Brier
score, RMSE)
•
Probabilistic Hypotheses:
multinomial
distribution, mixture of Gaussians
(alternatives: classification trees, linear
models)
•
Algorithm:
maximum likelihood (greedy),
Bayesian integration (exhaustive)
•
Data:
instances + attributes
Expected Minimum Loss = Entropy
C
Entropy
given
C
’s empirical probability distribution (
p
= [0.2, 0.8]).
A
H(A)
Information
which came with
the knowledge of
A
I(A;C)=H(A)+H(C)

H(AC)
Mutual information
or information gain

How much have
A
and
C
in common?
H(CA) = H(C)

I(A;C)
Conditional entropy

Remaining uncertainty
in
C
after learning
A
.
H
(
AC
)
Joint entropy
The diagram is a visualization of a probabilistic model P(
A
,
C
)
2

Way Interactions
•
Probabilistic models take the form of
P
(
A
,
B
)
•
We have two models:
–
Interaction allowed
:
P
Y
(
a
,
b
) :=
F
(
a
,
b
)
–
Interaction disallowed
:
P
N
(
a
,
b
) :=
P
(
a
)
P
(
b
) =
F
(
a
)
G
(
b
)
•
The error that
P
N
makes when approximating
P
Y
:
D
(
P
Y

P
N
) := E
x
~
Py
[L(
x
,
P
N
)] =
I
(
A
;
B
)
(mutual information)
•
Also applies for predictive models:
•
Also applies for Pearson’s correlation coefficient:
P
is a bivariate Gaussian,
obtained via max. likelihood
Rajski’s Distance
•
The attributes that have more in common can be
visualized as closer in some imaginary
Euclidean space.
•
How to avoid the influence of many/few

valued
attributes? (Complex attributes seem to have
more in common.)
•
Rajski’s distance:
•
This is a metric (e.g.: the triangle inequality)
Interactions between
US Senators
dark:
strong interaction
,
high mutual information
light: weak interaction
low mutual information
Interaction matrix
A Taxonomy of
Machine Learning Algorithms
CMC dataset
Interaction dendrogram
3

Way Interactions
C
B
A
label
attribute
attribute
importance of attribute
B
importance of attribute
A
3

Way Interaction:
What is common to
A
,
B
and
C
together;
and cannot be inferred from any subset of attributes.
attribute correlation
2

Way Interactions
Interaction Information
I(A;B;C) :=
I(AB;C)

I(B
;
C)

I(A;C)
= I(B;CA)

I(B;C)
= I(A;CB)

I(A;C)
(Partial) history of
independent
reinventions:
Quastler ‘53 (Info. Theory in Biology)

measure of specificity
McGill ‘54 (Psychometrika)

interaction information
Han ‘80 (Information & Control)

multiple mutual information
Yeung ‘91 (IEEE Trans. On Inf. Theory)

mutual information
Grabisch&Roubens ‘99 (I. J. of Game Theory)

Banzhaf interaction index
Matsuda ‘00 (Physical Review E)

higher

order mutual inf.
Brenner et al. ‘00 (Neural Computation)

average synergy
Dem
šar
’02 (A thesis in machine learning)

relative information gain
Bell ‘03 (NIPS02, ICA2003)

co

information
Jakulin ‘02

interaction gain
How informative are A and B together?
Interaction
Dendrogram
Useful attributes
Useless attributes
farming
soil
vegetation
In classification tasks
we are only interested in
those interactions that
involve the label
Interaction Graph
•
The Titanic data set
–
Label
: survived?
–
Attributes
: describe the
passenger or crew member
•
2

way interactions:
–
Sex
then
Class
;
Age
not as
important
•
3

way interactions:
–
negative
:
‘Crew’
dummy is
wholly contained within
‘Class’
;
‘Sex’
largely explains the death
rate among the crew.
–
positive
:
•
Children from the first and
second class were prioritized.
•
Men from the second class
mostly died (third class men and
the crew were better off)
•
Female crew members had
good odds of survival.
blue
:
redundancy, negative int.
red
:
synergy, positive int.
An Interaction Drilled
Data for ~600 people
What’s the loss assuming no
interaction between
eyes
in
hair
?
Area corresponds to probability
:
•
black square
:
actual probability
•
colored square: predicted
probability
Colors encode the type of error.
The more saturated the color, the
more “significant” the error. Codes:
•
blue
:
overestimate
•
red
:
underestimate
•
white
:
correct estimate
KL

d:
0.178
Rules = Constraints
•
Rule 1:
Blonde hair is
connected with
blue or green
eyes.
•
Rule 2:
Black hair is
connected with
brown eyes.
KL

d:
0.045
KL

d:
0.134
KL

d:
0.0
22
Both rules
:
KL

d:
0.178
No interaction
:
ADULT/CENSUS
Attribute Value Taxonomies
Interactions can also be computed between pairs
of attribute (or label) values. This way, we can
structure attributes with many values (e.g.,
Cartesian products
☺
).
Attribute Selection with Interactions
•
2

way interactions
I
(
A
;
Y
)
are the staple of
attribute selection
–
Examples: information gain, Gini ratio, etc.
–
Myopia! We ignore both positive and negative
interactions.
•
Compare this with controlled 2

way interactions:
I
(
A
;
Y  B,C,D,E,
…)
–
Examples: Relief, regression coefficients
–
We have to build a model on all attributes anyway,
making many assumptions… What does it buy us?
–
We add another attribute, and the usefulness of a
previous attribute is reduced?
Attribute Subset Selection with NBC
The calibration of the classifier (expected likelihood of
an instance’s label) first improves then deteriorates
as we add attributes. The optimal number is ~8
attributes. The first few attributes are important, the
rest is noise?
Attribute Subset Selection with NBC
NO! We sorted the attributes from the worst to
the best. It is some of the
best
attributes that
ruin
the performance! Why? NBC gets
confused by redundancies.
Accounting for Redundancies
At each step, we pick the next best attribute,
accounting for the attributes
already
in the
model:
–
Fleuret’s procedure:
–
Our procedure:
Example:
the naïve
Bayesian
Classifier
myopic
→
↑
Interaction

proof
Predicting with Interactions
•
Interactions are meaningful self

contained views of the
data.
•
Can we use these views for prediction?
•
It’s easy if the views do not overlap: we just multiply
them together, and normalize:
P
(
a,b
)
P
(
c
)
P
(
d,e,f
)
•
If they do overlap:
•
In a general overlap situation, Kikuchi approximation
efficiently handles the intersections between interactions,
and intersections

of

intersections.
•
Algorithm: select interactions, use Kikuchi approximation
to fuse them into a joint prediction, use this to classify.
Interaction
Models
•
Transparent and intuitive
•
Efficient
•
Quick
•
Can be improved by replacing
Kikuchi with Conditional MaxEnt,
and Cartesian product with
something better.
Summary of the Talk
•
Interactions are a good metaphor for
understanding models and data. They can be a
part of the hypothesis space, but do not have to.
•
Probability is crucial for real

world problems.
•
Watch your assumptions (utility, model,
algorithm, data)
•
Information theory provides solid notation.
•
The Bayesian approach to modelling is very
robust (naïve Bayes and Bayes nets are
not
Bayesian approaches)
Summary of Contributions
Practice
•
A number of novel
visualization methods.
•
A heuristic for efficient
non

myopic attribute
selection.
•
An interaction

centered
machine learning method,
Kikuchi

Bayes
•
A family of Bayesian
priors for consistent
modelling with
interactions.
Theory
•
A meta

model of machine
learning.
•
A formal definition of a
k

way interaction,
independent of the utility
and hypothesis space.
•
A thorough historic
overview of related work.
•
A novel view on
interaction significance
tests.
Comments 0
Log in to post a comment