slides (data mining)

sentencehuddleData Management

Nov 20, 2013 (3 years and 11 months ago)

88 views

DATA
-
MINING

Artificial Neural Networks

Alexey Minin, Jass 2006

ANN forms it’s output itself
,

according to the
information, presented for input
.
We have to minimize
some functional
.
After we have found this functional we
have to minimize it. It

is the main task, and according to
this functional the input vector will be changed
.

In practice
,
adaptive networks

code input information

in the most
compact

way, of course according to some predefined requirements.


Teaching without the tutor
:
introduction

Reducing the dimension of data

with min loss

Teaching without the tutor
:
redundancy of
data

The length of data
description
:


D d b

Dimension of data

=
number of components of input
vector


d

b

Capacity of data

=
number of bits
,
defining the possible variety

of
all values


x
Two ways of coding


(
reducing) the information


Reducing the variety
of data by detecting
the prototypes


finding of

independent

features

Clustering and quantifying

Two ways to reduce the data



x
Reducing the dimension
allows us to describe the data
with less components

Clustering allows us to reduce the
variety of data
,
reducing the
number of bits
,
we need to
describe the data
.


We can unite both types of algorithms
.
We can use Kohonen maps
,
when prototypes

regulate in the space of low dimension
.
For example
,
input data

can be

reflected

on to 2
-
dimensional

grid of prototypes the
way
,
you can visualize the data you have
.


NB

x
Main idea: neuron
-

indicator



x

y
w
x
j
j
d




w
x
1
Neuron

has

one

output

and

it’s

teaching

upon

a

d
-
dimension

data


Lets say that the activation function is linear
.

The output therefore is the

linear combination of it’s outputs
:


x
1
x
d
y
w
x
j
j
j
d



1
The

amplitude

after

the

training

is

finished


can

be

the

indicator

for

the

data
.

Showing

rather

the

data

corresponds

for

training

patterns

or

not
.

Hebb training algorithm


w
y
x
j
j





According

to

Hebb
:

If

we

will

reformulate

the

task

as

the

optimization

task

we

will

get

the

property

of

such

neuron

and

rule

how

to

define

functional

we

have

to

min
:






w
w
w
x
w
x











E
E
y
,
,
1
2
2
1
2
2
NB!

If

we

wont

to

have

minimum

of

the

E

than

we

will

have

an

output

amplitude

equals

to

infinity

Oja training rule

x
w
1
The member interfering was added to stop unlimited growth of
weights

Rule Oja maximizes sensitivity of an output neuron at the limited amplitude of
weights. It is easy to be convinced of it, having equated average change of
weights to zero.

Having increased then the right part of equality on w. We are convinced, that in
balance


Thus, weights of trained neuron are located on hyper sphere
:

At training on Oja, a vector of weights
settles down on hyper sphere, In a
direction maximizing Projection of
input vectors.



j j j
w y x y w
    

  


2
1 0
y
 
2
w
1.

w
SUMMARY:

Neuron is trying to reproduce the value of it’s input for


known output
.

It means that it’s trying to maximize the sensitivity


of it’s output neurons
-
indicators

for many dimensional input

information
,
doing compression this way.


Oja training rule

y
w
y
w
k
kj
k
k
ij
k
i






1
NB! The output of the Oja output layer is the
linear combination of main components. If you
want to receive main components you should
change sum of all outputs:

The analysis of main components



x

y
w
x
w
x
i
ij
j
j
d
ij
j
i
j
d








1
1
w
x


i
m

1
,
.
.
.
,
Lets

say

that

we

have


d
-
dimensional

data



we are training

m

linear neurons
:




.

x
1
x
d
y
w
x
i
ij
j
j
d



1
We

want

an

amplitude

to

be

independent

indicators

of

all

output

neurons
,


fully

reflecting

information

about

many
-
dimensional

data

we

have
.


THE TASK IS
:


The requirement:


Neurons must interact somehow (if we will train them
independently we will receive the same result for all of them)

In simple case:

Lets

take

perceptron

with

linear

neuron

for

hidden

layer
,

in

which


the

number

of

inputs

and

outputs

equals
,

and

the

weights

with

the


same

indexes

in

both

layers

are

the

same
.

Lets

try

to

teach

ANN

to


reproduce

the

input

on

the

output
.

Training

rule

therefore
:






w
x
x
x
w











y
y
y
~
x
x
d
1
.
.
.
~
.
.
.
~
x
x
d
1
Looks like Oya training rule!

Self training layer:

In

our

formulation

the

training

of

separate

neuron
,

is

trying

to

reproduce


the

inputs

according

to

its

outputs
.

Generalizing

this

note
,

it

is

logical

to


suggest

a

rule
,
according

to

which


the

value

of

outputs

restoring

according


to

whole

output

information
.

Doing

this

way

we

can

get

Oja

training

rule


for

one

layer

network
:






w
y
x
x
y
x
y
w
ij
i
j
j
i
j
k
kj
k














~
x
x
d
1
.
.
.
~
.
.
.
~
x
x
d
1
The

hidden

layer

of

such

ANN
,

the

same

as

Oya

layer
,
makes

optimal


coding

of

input

data
,

and

contains

maximum

variety

of

data

according


to

existing

restrictions
.

Example:

Lets change activation function on the sigmoid in the training rule
:









w
x
w
i
i
k
k
k
f
y
f
y








Brings new property

(Oja, et al, 1991).
Such algorithm
,

in particular
,

was used for the decomposition

of mixed signals with an unknown way


(
i
.
e
.
blind signal separation
).


For example this task we have when we

want to separate human voice and noise
.


Competition of neurons
:
the winner gets all



#:
i
i
i i


   
w x w x
x
1
x
d
y
w
x
i
ij
j
j
d



1


i i k k
k
y y
   

  

w x w
Basis algorithm

The training of competition layer remains constant
:


The winner
:

1
i i
i
if i i


      
w w x w x
i


# of neuron winner

The winner will be the neuron,


which has the maximum response

Training of winner
:



i i
 

 
  
w x w
1,0,
i
i
y y i i


   
The winner takes away not all


One of variants of updating of a base rule of training of a competitive layer

Consists in training not only the neuron
-
winner, but also its "neighbors", though and with

In the smaller speed. Such approach
-

"pulling up" of the nearest to the winner neuron
-


It is applied in topographical
Kohonen
cards


#:
min
i
i
i
i i


 
   
 
 
w x w x




( 1) ( 1) ( ),( ) ( )
t t t t t t
    


         
i i i i
w w w i i x w


,
t

  
i i
Function of the neighborhood is equal to unit for the neuron
-

-
winner with an index And gradually falls down at removal


from the neuron
-
winner


i

Training on Kohonen reminds stretching an elastic grid of prototypes on

Data file from training sample







a
a


exp
2
2

Schematic representation
of self
-
organizing
network

Methodology
of self
-
organizing

cards

Neurons in the target layer are ordered and

correspond to cells of a bi
-
dimensional card


which can be painted by a principle of


affinity of attributes

Training on Kohonen reminds stretching an elastic grid of prototypes on

Data file from training sample


x
i
The convenient tool of visualization

Data is coloring topographical

Cards, it is similar to how it do on

Usual geographical cards. All

attribute of data generates the coloring

Cells of a card
-

on size of average value

This attribute at the data who have got in given

Cell.

Visualization a topographical card, Induced by i
-
th
component of entrance data


Having collected together cards of all interesting

Us of attributes, we shall receive topographical

The atlas, giving integrated representation

About structure of multivariate data.

Classified SOM

for

NASDAQ
100

index for the period
from

10
-
Nov
-
19
97
till

27
-
Aug
-
20
01

Methodology
of self
-
organizing

cards

Complexity of the algorithm

When it’s better to use reducing of dimension
,
and when



quantifying of the input

information
?


Reducing the dim

2
1
~
C PW
Number of training
patterns


# of operations
:

quantifying


W dm
 
K d m

4 2
1
C Pd K

number of syn weights

of 1 layer ANN

with

d inputs

&

m

output neurons

Compression coef
:

2
log
K db m

Compression coef


(
b



capacity data
)


# of operations:

2
~
C PW
2
2
db K
C Pd

Complexity
:

Complexity
:

P

2
2
3
1
2
~
db K
C
K
C d
With the same compression coef
:


JPEG example

d



8
8
64
2
256
8

b

8
d
b
2
1

Image is divided on to

8
x
8
pixels
,
which should be input vectors
,
we want to reduce
.


In our case


Lets propose that image contains


gradation of the gray accuracy


of the represented data


But if d=64x64 than K>10
3

Any questions?