Machine Learning Based on Attribute Interactions

journeycartAI and Robotics

Oct 15, 2013 (3 years and 7 months ago)

56 views

Machine Learning

based on

Attribute Interactions


Aleks Jakulin


Advised by Acad. Prof. Dr. Ivan Bratko

2003
-
2005

Learning =
Modelling

Utility =
-
Loss

MODEL

Learning Algorithm

Data

Hypothesis

Space

B

{
A
:


A

bounds
B


The fixed data sample
restricts the model to
be consistent with it.

-

data

shapes the

model

-

model
is made of

possible hypotheses

-

model
is generated by an

algorithm

-

utility
is the goal of a
model

Our Assumptions about Models


Probabilistic Utility:

logarithmic loss
(alternatives: classification accuracy, Brier
score, RMSE)


Probabilistic Hypotheses:
multinomial
distribution, mixture of Gaussians
(alternatives: classification trees, linear
models)


Algorithm:
maximum likelihood (greedy),
Bayesian integration (exhaustive)


Data:
instances + attributes

Expected Minimum Loss = Entropy

C

Entropy

given
C
’s empirical probability distribution (
p

= [0.2, 0.8]).

A

H(A)

Information

which came with

the knowledge of
A

I(A;C)=H(A)+H(C)
-
H(AC)

Mutual information

or information gain
---

How much have
A

and
C

in common?

H(C|A) = H(C)
-
I(A;C)

Conditional entropy

-


Remaining uncertainty

in
C

after learning
A
.

H
(
AC
)

Joint entropy

The diagram is a visualization of a probabilistic model P(
A
,
C
)

2
-
Way Interactions


Probabilistic models take the form of
P
(
A
,
B
)


We have two models:


Interaction allowed
:


P
Y
(
a
,
b
) :=
F
(
a
,
b
)


Interaction disallowed
:
P
N
(
a
,
b
) :=
P
(
a
)
P
(
b
) =
F
(
a
)
G
(
b
)


The error that
P
N

makes when approximating
P
Y
:

D
(
P
Y
||

P
N
) := E
x
~
Py
[L(
x
,
P
N
)] =
I
(
A
;
B
)






(mutual information)


Also applies for predictive models:



Also applies for Pearson’s correlation coefficient:

P

is a bivariate Gaussian,

obtained via max. likelihood

Rajski’s Distance


The attributes that have more in common can be
visualized as closer in some imaginary
Euclidean space.


How to avoid the influence of many/few
-
valued
attributes? (Complex attributes seem to have
more in common.)


Rajski’s distance:





This is a metric (e.g.: the triangle inequality)



Interactions between

US Senators

dark:

strong interaction
,


high mutual information


light: weak interaction


low mutual information

Interaction matrix

A Taxonomy of

Machine Learning Algorithms

CMC dataset

Interaction dendrogram

3
-
Way Interactions

C

B

A

label

attribute

attribute

importance of attribute
B

importance of attribute
A

3
-
Way Interaction:


What is common to
A
,
B

and
C

together;

and cannot be inferred from any subset of attributes.

attribute correlation

2
-
Way Interactions

Interaction Information

I(A;B;C) :=

I(AB;C)

-

I(B
;
C)

-

I(A;C)

= I(B;C|A)
-

I(B;C)

= I(A;C|B)
-

I(A;C)




(Partial) history of
independent

reinventions:


Quastler ‘53 (Info. Theory in Biology)


-

measure of specificity


McGill ‘54 (Psychometrika)



-

interaction information


Han ‘80 (Information & Control)


-

multiple mutual information


Yeung ‘91 (IEEE Trans. On Inf. Theory)
-

mutual information


Grabisch&Roubens ‘99 (I. J. of Game Theory)
-

Banzhaf interaction index


Matsuda ‘00 (Physical Review E)


-

higher
-
order mutual inf.


Brenner et al. ‘00 (Neural Computation)
-

average synergy


Dem
šar

’02 (A thesis in machine learning)
-

relative information gain


Bell ‘03 (NIPS02, ICA2003)



-

co
-
information


Jakulin ‘02




-

interaction gain



How informative are A and B together?

Interaction

Dendrogram

Useful attributes

Useless attributes

farming

soil

vegetation

In classification tasks

we are only interested in

those interactions that

involve the label

Interaction Graph


The Titanic data set


Label
: survived?


Attributes
: describe the
passenger or crew member


2
-
way interactions:


Sex

then
Class
;
Age
not as
important


3
-
way interactions:


negative
:
‘Crew’
dummy is
wholly contained within
‘Class’
;
‘Sex’
largely explains the death
rate among the crew.


positive
:


Children from the first and
second class were prioritized.


Men from the second class
mostly died (third class men and
the crew were better off)


Female crew members had
good odds of survival.


blue
:
redundancy, negative int.


red
:
synergy, positive int.

An Interaction Drilled

Data for ~600 people


What’s the loss assuming no
interaction between

eyes

in
hair
?


Area corresponds to probability
:



black square
:
actual probability



colored square: predicted
probability


Colors encode the type of error.
The more saturated the color, the
more “significant” the error. Codes:



blue
:
overestimate



red
:
underestimate



white
:
correct estimate

KL
-
d:
0.178

Rules = Constraints


Rule 1:

Blonde hair is
connected with
blue or green
eyes.


Rule 2:

Black hair is
connected with
brown eyes.

KL
-
d:
0.045

KL
-
d:

0.134

KL
-
d:
0.0
22

Both rules
:

KL
-
d:
0.178

No interaction
:

ADULT/CENSUS

Attribute Value Taxonomies

Interactions can also be computed between pairs
of attribute (or label) values. This way, we can
structure attributes with many values (e.g.,
Cartesian products

).

Attribute Selection with Interactions


2
-
way interactions
I
(
A
;
Y
)

are the staple of
attribute selection


Examples: information gain, Gini ratio, etc.


Myopia! We ignore both positive and negative
interactions.


Compare this with controlled 2
-
way interactions:
I
(
A
;
Y | B,C,D,E,
…)


Examples: Relief, regression coefficients


We have to build a model on all attributes anyway,
making many assumptions… What does it buy us?


We add another attribute, and the usefulness of a
previous attribute is reduced?

Attribute Subset Selection with NBC

The calibration of the classifier (expected likelihood of
an instance’s label) first improves then deteriorates
as we add attributes. The optimal number is ~8
attributes. The first few attributes are important, the
rest is noise?

Attribute Subset Selection with NBC

NO! We sorted the attributes from the worst to
the best. It is some of the
best

attributes that
ruin
the performance! Why? NBC gets
confused by redundancies.

Accounting for Redundancies

At each step, we pick the next best attribute,
accounting for the attributes
already

in the
model:


Fleuret’s procedure:




Our procedure:

Example:

the naïve
Bayesian
Classifier


myopic




Interaction
-
proof

Predicting with Interactions


Interactions are meaningful self
-
contained views of the
data.


Can we use these views for prediction?


It’s easy if the views do not overlap: we just multiply
them together, and normalize:
P
(
a,b
)
P
(
c
)
P
(
d,e,f
)


If they do overlap:



In a general overlap situation, Kikuchi approximation
efficiently handles the intersections between interactions,
and intersections
-
of
-
intersections.


Algorithm: select interactions, use Kikuchi approximation
to fuse them into a joint prediction, use this to classify.

Interaction

Models


Transparent and intuitive


Efficient


Quick


Can be improved by replacing
Kikuchi with Conditional MaxEnt,
and Cartesian product with
something better.

Summary of the Talk


Interactions are a good metaphor for
understanding models and data. They can be a
part of the hypothesis space, but do not have to.


Probability is crucial for real
-
world problems.


Watch your assumptions (utility, model,
algorithm, data)


Information theory provides solid notation.


The Bayesian approach to modelling is very
robust (naïve Bayes and Bayes nets are
not
Bayesian approaches)

Summary of Contributions

Practice


A number of novel
visualization methods.


A heuristic for efficient
non
-
myopic attribute
selection.


An interaction
-
centered
machine learning method,
Kikuchi
-
Bayes


A family of Bayesian
priors for consistent
modelling with
interactions.

Theory


A meta
-
model of machine
learning.


A formal definition of a
k
-
way interaction,
independent of the utility
and hypothesis space.


A thorough historic
overview of related work.


A novel view on
interaction significance
tests.