Factor Analysis Using
Delta

Rule Wake

Sleep
Learning
Presented by Shaun Savani and Bill
Wicker
The problem
•
Basic assumption of factor analysis:
–
Variables that significantly correlate
with each other do so because they are
measuring the same "thing".
•
The problem:
–
What is the "thing" that correlated
variables are measuring in common?
The Solution
•
Factor Analysis seeks to discover if
the observed variables can be
explained largely in terms of a much
smaller number of variables called
factors
.
An intercorrelation Matrix
X
1
X
2
X
3
X
4
X
5
X
6
X
7
X
8
X
9
X
1
1.00
0.80
0.70
0.95
0.01
0.20
0.18
0.16
0.03
X
2
1.00
0.63
0.75
0.08
0.11
0.13
0.04
0.09
X
3
1.00
0.84
0.02
0.12
0.07
0.15
0.05
X
4
1.00
0.01
0.11
0.06
0.02
0.13
X
5
1.00
0.93
0.02
0.05
0.03
X
6
1.00
0.11
0.09
0.02
X
7
1.00
0.95
0.90
X
8
1.00
0.93
X
9
1.00
Patterns of
intercorrelation
•
Variables 1, 2, 3 & 4 correlate highly with
each other, but not with the rest of the
variables
•
Variables 5 & 6 correlate highly with each
other, but not with the rest of the variables
•
Variables 7, 8, & 9 correlate highly with each
other, but not with the rest of the variables
Deduction
•
The nine variables seem to be
measuring 3 "things" or underlying
factors.
–
What are these three factors?
–
To what extent does each variable
measure each of these three factors?
In a nutshell
•
The purpose of factor analysis is to
reduce multiple variables to a lesser
number of underlying factors that
are being measured by the variables.
Applications
•
Psychology and social sciences
•
Given an observed behavior, do
multiple measures of the behavior
reduce to a few underlying factors?
–
Useful in the construction of behavioral
theories
•
Example
–
Given criminal data, can delinquency be
attributed to a few underlying factors?
Criminal Profile Example
•
Sentence (sentence)
•
Number of prior convictions(pr_conv)
•
Intelligence (iq)
•
Drug dependency (dr_score)
•
Chronological age (age)
•
Age at 1st arrest (age_firs)
•
Time to case disposition (tm_disp)
•
Pre

trial jail time (jail_tm)
•
Time served on sentence (tm_serv)
•
Educational equivalency (educ_eqv)
•
Level of work skill (skl_indx)
Intercorrelation Matrix
What Do the Extracted
Factors Measure?
•
The key to determining what the
factors measure is the factor
loadings
•
For example
–
Which variables load (correlate) highest
on Factor I and low on the other two
factors?
What is a Factor
Loading?
•
A factor loading is the correlation between a
variable and a factor that has been extracted
from the data.
•
Example
–
Factor loadings for variable X1.
•
Interpretation
–
Variable X1 is highly correlated with Factor I,
but negligibly correlated with Factors II and
III
Variables
Factor I
Factor II
Factor III
X
1
0.932
0.013
0.250
Factor 1
•
Factor I
–
sentence (.933)
tm_serv (.907)
–
age (.853)
jail_tm (.659)
–
age_firs (

.581)
pr_conv (.548)
–
dr_score (.404)
•
Naming Factor I
–
What do these seven variables have in common,
particularly those that have the highest loading on
Factor I?
–
Degree of criminality? Career criminal history?
Factor 2
•
Factor II
–
educ_equ (.935)
skl_indx (.887)
–
iq (.808)
•
Naming Factor II
–
Educational level or job skill level?
Factor 3
•
Factor III
–
Tm_disp (.896)
•
Naming Factor III
–
This factor accounts for less variance in
the observed variables, since only one
variable loaded highest on it.
•
Need more variables loading on this
factor for it to have relevant
meaning.
Cattell's Scree Plot
•
How many underlying factors should be extracted?
Cattell's Scree Plot
•
One way to view the diminishing
utility of each additional factor
•
Plots explanatory importance of the
factors with respect to the variables
Cattell's Scree Plot
How many factors should
there be?
•
The fewer factors the simpler the
theory; the more factors the better
the theory fits the data.
Occam’s razor
Occam’s razor:
Prefer the simplest hypothesis that (approximately) fits
the data.
too simple hypothesis
large bias
desired hypothesis
too complex hypothesis
large variance
©
Barbara Hammer inc. 2003
Summary of factor
analysis
•
The 11 variables were reduced to 3 factors
•
These three factors account for 67.44%
of the covariance among the variables
•
Factors I appears to measure criminal
history
•
Factor II appears to measure
educational/skill level
•
Factor III is ambiguous at best
Methods of Factor
Analysis
•
A variety of methods have been developed to extract
hidden factors from an intercorrelation matrix.
–
Principle components method (probably the most commonly
used method)
–
Maximum likelihood method (a commonly used method)
–
Principal axis method also know as common factor analysis
–
Unweighted least

squares method
–
Generalized least squares method
–
Alpha method
–
Image factoring
–
Wake

Sleep
Learning a Factor
Analysis model
•
How would Barbara Hammer’s neurons
perform factor analysis?...
Factor analysis using the
EM algorithm
However,
•
“The matrix operations involved do
not appear particularly plausible as a
model of learning in the brain.”
Wake

sleep approach
•
How do the rest of us learn?
•
“…To obtain a learning procedure of
greater potential neurobiological
interest”
•
A plausible method of modeling
image

processing in the cortex
Wake

Sleep Algorithm
•
Purpose
–
To provide greater potential neurobiological
interest than the EM method
•
Produce conditional variances using a
recognition model trained with a
generative model
•
Alternate using “wake” and “sleep” phases
Wake

Sleep Learning
•
Wake phase
–
Using visible variables,
x
, randomly
choose values for the hidden factors,
y
,
based on the current recognition model
–
Update the generative model to make
this case more possible.
Wake

Sleep Learning
•
Sleep phase
–
Randomly choose values for the hidden values,
y
, and based on the current generative model,
randomly select “fantasy” values for the visible
variables,
x
.
–
Update the parameters of the recognition
model to make the fantasy case more likely
Helmholtz Model for a
Single Hidden Factor
•
Connections of the generative model are shown
using solid lines, those of the recognition
model using dashed lines. The weights for
these connections are given by the and the
Generative Model
x
= μ
+
g
y
+ ε
μ
=
vector of overall means (assumed 0 here)
g
= vector of generative weights
y = hidden factor
ε = noise, Gaussian with diagonal covariance matrix
Delta Rule
•
Where
η
is a small
positive learning rate
parameter.
–
Eg. 0.0002
•
Where
α
is a learning
rate parameter that
is slightly less than
one.
–
Eg. 0.999
Recognition Model
y
=
r
T
x
+
ν
y = hidden factor
r
= vector of recognition weights
x
= vector of visible variables
ν
has Gaussian distribution with mean 0
and variance
σ
2
Delta Rule
•
Where
η
is a small
positive learning
rate parameter.
–
Eg. 0.0002
•
Where
α
is a learning
rate parameter that
is slightly less than
one.
–
Eg. 0.999
Experiment
Wake

sleep learning of a
single

factor model for six
variables. The graph shows
the progress of the
generative weights over
the course of two million
presentations of input
vectors, drawn sequentially
from a training set of size
500, with learning
parameters of
η
= .0002
and
α
= .999 being used for
both the wake and the
sleep phases.
Helmholtz Model for
Multiple Hidden Factors
Everitt Crime Data
•
When the wake

sleep algorithm was applied to
the Everitt Crime Data (1984), there were
some discrepancies found.
•
When a correlation

inducing connection was
present the data found maximum likelihood
estimates 9 out of 15 times
•
When the correlation

inducing connection was
not present the data found maximum
likelihood estimates in only 1 of the 15 runs
Everitt Crime Data
•
One of the runs using correlation

inducing connection that did not find
maximum likelihood estimates.
Everitt Crime Data
•
One of the runs using correlation

inducing connection that did find
maximum likelihood estimates.
Everitt Crime Data
•
The run that found maximum
likelihood estimates without using a
correlation

inducing connection.
Improvements
•
Reduce the learning rate for the
generative parameter
–
This resulted in eight of eight runs
being successful with a correlation
inducing connection present, but no
successes without the connection.
Improvements
•
Impose a constraint that prevents
the generative variances from falling
below 0.01
–
This resulted in eight of eight runs
being successful with a correlation
inducing connection present, and three
successes without the connection.
Bias
•
When the data is not normalized, but rather
uses a bias, the wake

sleep algorithm “fails
rather spectacularly.”
•
The generative variances become large which
resulted in the recognition variances
becoming large as well.
•
All weights and variances diverged due to
feedback.
–
Corrected when a very low learning rate was used
Theoretical guarantees?
•
Is Wake

Sleep guaranteed to always
converge to the correct factor
analysis model?
–
When the learning rate is small enough,
the wake

sleep algorithm generally
results in locally maximizing the
likelihood.
Theoretical guarantees?
•
Discrepancies seen in the
experiments
–
due to a “false” basin of attraction?
–
an effect of finite learning rates?
•
“A proof of convergence may not be
possible at all”
Theoretical guarantees?
•
“Factor analysis is probably better
implemented on a computer using …
EM”
Relevance
•
Why use the Wake

Sleep algorithm if EM
is accepted to converge, but Wake

Sleep
isn’t completely guaranteed?
–
“…To obtain a learning procedure of greater
potential neurobiological interest”
•
A plausible method of modeling image

processing in the cortex
Factor analysis and you
•
The structure of biological systems is
often hierarchical.
–
Rather than observing optical signals as a list
of billions of pixel variables, as a machine vision
system would, the brain’s image

processing
facilities are more likely to focus on certain
salient aspects of the field of vision.
•
An attractive mating partner
•
A bus about to run you over
Factor analysis and you
•
Factor analysis shows potential for
modeling such biological processes as
vision.
Comments 0
Log in to post a comment