# slides - Institute of Computer Science

Τεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

82 εμφανίσεις

Factor Analysis Using
Delta
-
Rule Wake
-
Sleep
Learning

Presented by Shaun Savani and Bill
Wicker

The problem

Basic assumption of factor analysis:

Variables that significantly correlate
with each other do so because they are
measuring the same "thing".

The problem:

What is the "thing" that correlated
variables are measuring in common?

The Solution

Factor Analysis seeks to discover if
the observed variables can be
explained largely in terms of a much
smaller number of variables called
factors
.

An intercorrelation Matrix

X
1

X
2

X
3

X
4

X
5

X
6

X
7

X
8

X
9

X
1

1.00

0.80

0.70

0.95

0.01

0.20

0.18

0.16

0.03

X
2

1.00

0.63

0.75

0.08

0.11

0.13

0.04

0.09

X
3

1.00

0.84

0.02

0.12

0.07

0.15

0.05

X
4

1.00

0.01

0.11

0.06

0.02

0.13

X
5

1.00

0.93

0.02

0.05

0.03

X
6

1.00

0.11

0.09

0.02

X
7

1.00

0.95

0.90

X
8

1.00

0.93

X
9

1.00

Patterns of
intercorrelation

Variables 1, 2, 3 & 4 correlate highly with
each other, but not with the rest of the
variables

Variables 5 & 6 correlate highly with each
other, but not with the rest of the variables

Variables 7, 8, & 9 correlate highly with each
other, but not with the rest of the variables

Deduction

The nine variables seem to be
measuring 3 "things" or underlying
factors.

What are these three factors?

To what extent does each variable
measure each of these three factors?

In a nutshell

The purpose of factor analysis is to
reduce multiple variables to a lesser
number of underlying factors that
are being measured by the variables.

Applications

Psychology and social sciences

Given an observed behavior, do
multiple measures of the behavior
reduce to a few underlying factors?

Useful in the construction of behavioral
theories

Example

Given criminal data, can delinquency be
attributed to a few underlying factors?

Criminal Profile Example

Sentence (sentence)

Number of prior convictions(pr_conv)

Intelligence (iq)

Drug dependency (dr_score)

Chronological age (age)

Age at 1st arrest (age_firs)

Time to case disposition (tm_disp)

Pre
-
trial jail time (jail_tm)

Time served on sentence (tm_serv)

Educational equivalency (educ_eqv)

Level of work skill (skl_indx)

Intercorrelation Matrix

What Do the Extracted
Factors Measure?

The key to determining what the
factors measure is the factor

For example

on Factor I and low on the other two
factors?

What is a Factor

variable and a factor that has been extracted
from the data.

Example

Interpretation

Variable X1 is highly correlated with Factor I,
but negligibly correlated with Factors II and
III

Variables

Factor I

Factor II

Factor III

X
1

0.932

0.013

0.250

Factor 1

Factor I

sentence (.933)

tm_serv (.907)

age (.853)

jail_tm (.659)

age_firs (
-
.581)

pr_conv (.548)

dr_score (.404)

Naming Factor I

What do these seven variables have in common,
Factor I?

Degree of criminality? Career criminal history?

Factor 2

Factor II

educ_equ (.935)

skl_indx (.887)

iq (.808)

Naming Factor II

Educational level or job skill level?

Factor 3

Factor III

Tm_disp (.896)

Naming Factor III

This factor accounts for less variance in
the observed variables, since only one

factor for it to have relevant
meaning.

Cattell's Scree Plot

How many underlying factors should be extracted?

Cattell's Scree Plot

One way to view the diminishing

Plots explanatory importance of the
factors with respect to the variables

Cattell's Scree Plot

How many factors should
there be?

The fewer factors the simpler the
theory; the more factors the better
the theory fits the data.

Occam’s razor

Occam’s razor:

Prefer the simplest hypothesis that (approximately) fits
the data.

too simple hypothesis

large bias

desired hypothesis

too complex hypothesis

large variance

Barbara Hammer inc. 2003

Summary of factor
analysis

The 11 variables were reduced to 3 factors

These three factors account for 67.44%
of the covariance among the variables

Factors I appears to measure criminal
history

Factor II appears to measure
educational/skill level

Factor III is ambiguous at best

Methods of Factor
Analysis

A variety of methods have been developed to extract
hidden factors from an intercorrelation matrix.

Principle components method (probably the most commonly
used method)

Maximum likelihood method (a commonly used method)

Principal axis method also know as common factor analysis

Unweighted least
-
squares method

Generalized least squares method

Alpha method

Image factoring

Wake
-
Sleep

Learning a Factor
Analysis model

How would Barbara Hammer’s neurons
perform factor analysis?...

Factor analysis using the
EM algorithm

However,

“The matrix operations involved do
not appear particularly plausible as a
model of learning in the brain.”

Wake
-
sleep approach

How do the rest of us learn?

“…To obtain a learning procedure of
greater potential neurobiological
interest”

A plausible method of modeling
image
-
processing in the cortex

Wake
-
Sleep Algorithm

Purpose

To provide greater potential neurobiological
interest than the EM method

Produce conditional variances using a
recognition model trained with a
generative model

Alternate using “wake” and “sleep” phases

Wake
-
Sleep Learning

Wake phase

Using visible variables,
x
, randomly
choose values for the hidden factors,
y
,
based on the current recognition model

Update the generative model to make
this case more possible.

Wake
-
Sleep Learning

Sleep phase

Randomly choose values for the hidden values,
y
, and based on the current generative model,
randomly select “fantasy” values for the visible
variables,
x
.

Update the parameters of the recognition
model to make the fantasy case more likely

Helmholtz Model for a
Single Hidden Factor

Connections of the generative model are shown
using solid lines, those of the recognition
model using dashed lines. The weights for
these connections are given by the and the

Generative Model

x

= μ

+
g
y
+ ε

μ

=
vector of overall means (assumed 0 here)

g

= vector of generative weights

y = hidden factor

ε = noise, Gaussian with diagonal covariance matrix

Delta Rule

Where
η

is a small
positive learning rate
parameter.

Eg. 0.0002

Where
α

is a learning
rate parameter that
is slightly less than
one.

Eg. 0.999

Recognition Model

y

=
r
T
x

+
ν

y = hidden factor

r

= vector of recognition weights

x

= vector of visible variables

ν

has Gaussian distribution with mean 0
and variance
σ
2

Delta Rule

Where
η

is a small
positive learning
rate parameter.

Eg. 0.0002

Where
α

is a learning
rate parameter that
is slightly less than
one.

Eg. 0.999

Experiment

Wake
-
sleep learning of a
single
-
factor model for six
variables. The graph shows
the progress of the
generative weights over
the course of two million
presentations of input
vectors, drawn sequentially
from a training set of size
500, with learning
parameters of
η

= .0002
and
α

= .999 being used for
both the wake and the
sleep phases.

Helmholtz Model for
Multiple Hidden Factors

Everitt Crime Data

When the wake
-
sleep algorithm was applied to
the Everitt Crime Data (1984), there were
some discrepancies found.

When a correlation
-
inducing connection was
present the data found maximum likelihood
estimates 9 out of 15 times

When the correlation
-
inducing connection was
not present the data found maximum
likelihood estimates in only 1 of the 15 runs

Everitt Crime Data

One of the runs using correlation
-
inducing connection that did not find
maximum likelihood estimates.

Everitt Crime Data

One of the runs using correlation
-
inducing connection that did find
maximum likelihood estimates.

Everitt Crime Data

The run that found maximum
likelihood estimates without using a
correlation
-
inducing connection.

Improvements

Reduce the learning rate for the
generative parameter

This resulted in eight of eight runs
being successful with a correlation
inducing connection present, but no
successes without the connection.

Improvements

Impose a constraint that prevents
the generative variances from falling
below 0.01

This resulted in eight of eight runs
being successful with a correlation
inducing connection present, and three
successes without the connection.

Bias

When the data is not normalized, but rather
uses a bias, the wake
-
sleep algorithm “fails
rather spectacularly.”

The generative variances become large which
resulted in the recognition variances
becoming large as well.

All weights and variances diverged due to
feedback.

Corrected when a very low learning rate was used

Theoretical guarantees?

Is Wake
-
Sleep guaranteed to always
converge to the correct factor
analysis model?

When the learning rate is small enough,
the wake
-
sleep algorithm generally
results in locally maximizing the
likelihood.

Theoretical guarantees?

Discrepancies seen in the
experiments

due to a “false” basin of attraction?

an effect of finite learning rates?

“A proof of convergence may not be
possible at all”

Theoretical guarantees?

“Factor analysis is probably better
implemented on a computer using …
EM”

Relevance

Why use the Wake
-
Sleep algorithm if EM
is accepted to converge, but Wake
-
Sleep
isn’t completely guaranteed?

“…To obtain a learning procedure of greater
potential neurobiological interest”

A plausible method of modeling image
-
processing in the cortex

Factor analysis and you

The structure of biological systems is
often hierarchical.

Rather than observing optical signals as a list
of billions of pixel variables, as a machine vision
system would, the brain’s image
-
processing
facilities are more likely to focus on certain
salient aspects of the field of vision.

An attractive mating partner

A bus about to run you over

Factor analysis and you

Factor analysis shows potential for
modeling such biological processes as
vision.