slides - Institute of Computer Science

jabgoldfishΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

71 εμφανίσεις

Factor Analysis Using
Delta
-
Rule Wake
-
Sleep
Learning

Presented by Shaun Savani and Bill
Wicker

The problem


Basic assumption of factor analysis:



Variables that significantly correlate
with each other do so because they are
measuring the same "thing".


The problem:



What is the "thing" that correlated
variables are measuring in common?

The Solution


Factor Analysis seeks to discover if
the observed variables can be
explained largely in terms of a much
smaller number of variables called
factors
.

An intercorrelation Matrix

X
1

X
2

X
3

X
4

X
5

X
6

X
7

X
8

X
9

X
1

1.00

0.80

0.70

0.95

0.01

0.20

0.18

0.16

0.03

X
2

1.00

0.63

0.75

0.08

0.11

0.13

0.04

0.09

X
3

1.00

0.84

0.02

0.12

0.07

0.15

0.05

X
4

1.00

0.01

0.11

0.06

0.02

0.13

X
5

1.00

0.93

0.02

0.05

0.03

X
6

1.00

0.11

0.09

0.02

X
7

1.00

0.95

0.90

X
8

1.00

0.93

X
9

1.00

Patterns of
intercorrelation


Variables 1, 2, 3 & 4 correlate highly with
each other, but not with the rest of the
variables


Variables 5 & 6 correlate highly with each
other, but not with the rest of the variables


Variables 7, 8, & 9 correlate highly with each
other, but not with the rest of the variables

Deduction


The nine variables seem to be
measuring 3 "things" or underlying
factors.


What are these three factors?


To what extent does each variable
measure each of these three factors?

In a nutshell


The purpose of factor analysis is to
reduce multiple variables to a lesser
number of underlying factors that
are being measured by the variables.

Applications


Psychology and social sciences


Given an observed behavior, do
multiple measures of the behavior
reduce to a few underlying factors?


Useful in the construction of behavioral
theories


Example


Given criminal data, can delinquency be
attributed to a few underlying factors?


Criminal Profile Example


Sentence (sentence)


Number of prior convictions(pr_conv)


Intelligence (iq)


Drug dependency (dr_score)


Chronological age (age)


Age at 1st arrest (age_firs)


Time to case disposition (tm_disp)


Pre
-
trial jail time (jail_tm)


Time served on sentence (tm_serv)


Educational equivalency (educ_eqv)


Level of work skill (skl_indx)

Intercorrelation Matrix

What Do the Extracted
Factors Measure?



The key to determining what the
factors measure is the factor
loadings


For example


Which variables load (correlate) highest
on Factor I and low on the other two
factors?

What is a Factor
Loading?


A factor loading is the correlation between a
variable and a factor that has been extracted
from the data.


Example



Factor loadings for variable X1.


Interpretation


Variable X1 is highly correlated with Factor I,
but negligibly correlated with Factors II and
III

Variables

Factor I

Factor II

Factor III

X
1

0.932

0.013

0.250

Factor 1


Factor I


sentence (.933)



tm_serv (.907)


age (.853)



jail_tm (.659)


age_firs (
-
.581)



pr_conv (.548)


dr_score (.404)


Naming Factor I




What do these seven variables have in common,
particularly those that have the highest loading on
Factor I?


Degree of criminality? Career criminal history?

Factor 2


Factor II


educ_equ (.935)


skl_indx (.887)


iq (.808)






Naming Factor II



Educational level or job skill level?

Factor 3


Factor III


Tm_disp (.896)


Naming Factor III



This factor accounts for less variance in
the observed variables, since only one
variable loaded highest on it.


Need more variables loading on this
factor for it to have relevant
meaning.

Cattell's Scree Plot


How many underlying factors should be extracted?

Cattell's Scree Plot


One way to view the diminishing
utility of each additional factor


Plots explanatory importance of the
factors with respect to the variables


Cattell's Scree Plot

How many factors should
there be?


The fewer factors the simpler the
theory; the more factors the better
the theory fits the data.

Occam’s razor

Occam’s razor:

Prefer the simplest hypothesis that (approximately) fits
the data.

too simple hypothesis



large bias

desired hypothesis

too complex hypothesis



large variance

©

Barbara Hammer inc. 2003

Summary of factor
analysis


The 11 variables were reduced to 3 factors


These three factors account for 67.44%
of the covariance among the variables


Factors I appears to measure criminal
history


Factor II appears to measure
educational/skill level


Factor III is ambiguous at best

Methods of Factor
Analysis



A variety of methods have been developed to extract
hidden factors from an intercorrelation matrix.


Principle components method (probably the most commonly
used method)


Maximum likelihood method (a commonly used method)


Principal axis method also know as common factor analysis


Unweighted least
-
squares method


Generalized least squares method


Alpha method


Image factoring


Wake
-
Sleep

Learning a Factor
Analysis model


How would Barbara Hammer’s neurons
perform factor analysis?...

Factor analysis using the
EM algorithm

However,


“The matrix operations involved do
not appear particularly plausible as a
model of learning in the brain.”

Wake
-
sleep approach


How do the rest of us learn?


“…To obtain a learning procedure of
greater potential neurobiological
interest”


A plausible method of modeling
image
-
processing in the cortex

Wake
-
Sleep Algorithm


Purpose


To provide greater potential neurobiological
interest than the EM method


Produce conditional variances using a
recognition model trained with a
generative model


Alternate using “wake” and “sleep” phases

Wake
-
Sleep Learning


Wake phase


Using visible variables,
x
, randomly
choose values for the hidden factors,
y
,
based on the current recognition model


Update the generative model to make
this case more possible.

Wake
-
Sleep Learning


Sleep phase


Randomly choose values for the hidden values,
y
, and based on the current generative model,
randomly select “fantasy” values for the visible
variables,
x
.


Update the parameters of the recognition
model to make the fantasy case more likely

Helmholtz Model for a
Single Hidden Factor


Connections of the generative model are shown
using solid lines, those of the recognition
model using dashed lines. The weights for
these connections are given by the and the

Generative Model

x

= μ

+
g
y
+ ε

μ

=
vector of overall means (assumed 0 here)

g

= vector of generative weights

y = hidden factor

ε = noise, Gaussian with diagonal covariance matrix

Delta Rule


Where
η

is a small
positive learning rate
parameter.


Eg. 0.0002



Where
α

is a learning
rate parameter that
is slightly less than
one.


Eg. 0.999

Recognition Model

y

=
r
T
x

+
ν


y = hidden factor


r

= vector of recognition weights

x

= vector of visible variables

ν

has Gaussian distribution with mean 0
and variance
σ
2

Delta Rule


Where
η

is a small
positive learning
rate parameter.


Eg. 0.0002



Where
α

is a learning
rate parameter that
is slightly less than
one.


Eg. 0.999

Experiment

Wake
-
sleep learning of a
single
-
factor model for six
variables. The graph shows
the progress of the
generative weights over
the course of two million
presentations of input
vectors, drawn sequentially
from a training set of size
500, with learning
parameters of
η

= .0002
and
α

= .999 being used for
both the wake and the
sleep phases.

Helmholtz Model for
Multiple Hidden Factors

Everitt Crime Data


When the wake
-
sleep algorithm was applied to
the Everitt Crime Data (1984), there were
some discrepancies found.


When a correlation
-
inducing connection was
present the data found maximum likelihood
estimates 9 out of 15 times


When the correlation
-
inducing connection was
not present the data found maximum
likelihood estimates in only 1 of the 15 runs

Everitt Crime Data


One of the runs using correlation
-
inducing connection that did not find
maximum likelihood estimates.

Everitt Crime Data


One of the runs using correlation
-
inducing connection that did find
maximum likelihood estimates.

Everitt Crime Data


The run that found maximum
likelihood estimates without using a
correlation
-
inducing connection.

Improvements


Reduce the learning rate for the
generative parameter


This resulted in eight of eight runs
being successful with a correlation
inducing connection present, but no
successes without the connection.

Improvements


Impose a constraint that prevents
the generative variances from falling
below 0.01


This resulted in eight of eight runs
being successful with a correlation
inducing connection present, and three
successes without the connection.

Bias


When the data is not normalized, but rather
uses a bias, the wake
-
sleep algorithm “fails
rather spectacularly.”


The generative variances become large which
resulted in the recognition variances
becoming large as well.


All weights and variances diverged due to
feedback.


Corrected when a very low learning rate was used

Theoretical guarantees?


Is Wake
-
Sleep guaranteed to always
converge to the correct factor
analysis model?


When the learning rate is small enough,
the wake
-
sleep algorithm generally
results in locally maximizing the
likelihood.

Theoretical guarantees?


Discrepancies seen in the
experiments


due to a “false” basin of attraction?


an effect of finite learning rates?


“A proof of convergence may not be
possible at all”



Theoretical guarantees?


“Factor analysis is probably better
implemented on a computer using …
EM”

Relevance


Why use the Wake
-
Sleep algorithm if EM
is accepted to converge, but Wake
-
Sleep
isn’t completely guaranteed?


“…To obtain a learning procedure of greater
potential neurobiological interest”


A plausible method of modeling image
-
processing in the cortex

Factor analysis and you


The structure of biological systems is
often hierarchical.


Rather than observing optical signals as a list
of billions of pixel variables, as a machine vision
system would, the brain’s image
-
processing
facilities are more likely to focus on certain
salient aspects of the field of vision.


An attractive mating partner


A bus about to run you over

Factor analysis and you


Factor analysis shows potential for
modeling such biological processes as
vision.