Cellular Neural Networks Training and

muscleblouseΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 4 χρόνια και 25 μέρες)

71 εμφανίσεις

Cellular Neural Networks Training and
Implementation on a GPU Cluster

Bipul

Luitel
, Cameron Johnson &
Sinchan

Roychowdhury


Real
-
Time Power and Intelligent Systems
Lab, Missouri S&T, Rolla, MO

CS387 Spring 2011, April 26, 2011

1

Outline


Cellular

Neural

Networks


Architecture


Application



Wide

Area

Monitoring


Training

a

neural

network

using

particle

swarm

optimization


PSO

-

Introduction


Implementation


Parallelization



Concurrencies

in

CNN

and

PSO



GPU

Computing

in

RTPIS

Lab


Architecture

of

the

RTPIS

GPU

cluster


Some

timing

comparisons


Discussion

2

Cellular Neural Networks

3


Cellular

Neural

Networks

(CNN)


A

class

of

neural

networks

architecture

apart

from

feedforward

and

feedback

architecture
.


Traditionally,

each

element

of

the

network

(cell)

is

a

computational

unit
.


A

different

variation

consists

of

neural

networks

(NN)

on

each

cell

connected

to

each

other

in

different

fashion

depending

on

the

problem
.


Each

cell

is

a

feedforward

or

feedback

neural

network
.


Output

of

one

NN

(cell)

may

be

connected

to

the

input(s)

of

one

or

more

neighboring

cells,

thus

forming

an

iterative

feedback

loop

during

training
.

Application



Wide Area Monitoring in Power Systems

4


Smart

grid

is

a

complex

distributed

cyber
-
physical

power

and

energy

system


Communication,

computation

and

control


Remote

monitoring

of

distributed

systems

is

necessary

to

assess

the

status

of

the

system


Wide

area

monitoring



Monitor

status

of

generators



speed

deviation


Monitor

status

of

buses



bus

voltage


Assist

in

predictive

state

estimation

in

order

to

provide

real
-
time

control


Becomes

a

challenge

when

the

size

of

the

network

is

large

with

many

parameters

to

be

monitored



Generator

speed

deviation

predictions

using

cellular

neural

networks

Implementation of CNN based WAM

5

Wide Area Monitoring
System


for


Two
-
Area Four
-
Machine
System

Wide Area Monitoring
System


for


Twelve
-
bus System

Implementation

6


Architecture

of

the

CNN

can

be

problem

dependent
.


Training

7


Training

of

CMLP

is

a

challenge

due

to

iterative

feedback
.


Initialize weights

Train the neural
network

For each training
sample

For each cell

Training

8


Training

in

parallel

using

particle

swarm

optimization
.


Outline


Cellular

Neural

Networks


Architecture


Application



Wide

Area

Monitoring


Training

a

neural

network

using

particle

swarm

optimization


PSO

-

Introduction


Implementation


Parallelization



Concurrencies

in

CNN

and

PSO



GPU

Computing

in

RTPIS

Lab


Architecture

of

the

RTPIS

GPU

cluster


Some

timing

comparisons


Discussion

9

Particle Swarm Optimization


Population
-
based search algorithm


Many points tested


Form of guided search


Sound familiar?


GAs are also population
-
based


Darwinian evolution


PSO assumes flock or swarm, instead


Goals scattered about a search space


More searchers means they’re found faster

PSO Algorithm


Define the objective function


Identify the independent variables


These make up the coordinate system of the search hyperspace


Initialize a population of
n

particles with random coordinates
and velocities


Find their fitnesses


Record personal best


Record global best


Update velocities


Update locations


Find new fitnesses


Update personal best


Update global best


Run until termination conditions reached


Minimum acceptable fitness reached


Maximum number of fitness tests reached


Maximum real
-
world time allotted passed


f=5

f=3

f=0

f=1

f=0

f=1

f=6

f=0

x

y

Parallel PSO Implementation


Obvious concurrency


each particle on a node


Each particle’s fitness is independent


Particles can update velocities and positions
concurrently


What’s the catch?


Communication overhead


Have to share fitness information to determine new
gbest

PSO Topology and Cluster Topology


How do the particles
communicate?


Consider your hardware


Particles on the same
node (if #particles >
#nodes) can have full
connection


Particles on adjacent
nodes can communicate in
O(1)


Particles on further nodes
require more hops


Local Best: best amongst
all neighbors

star

topology

Global bests

wheel

topology

Multi
-
step to find

Global best

ring topology

Finds local bests

Training a CNN with a PSO Using Parallel Processing


Concurrency: each particle of the PSO


Implement the entire CNN on each node, and treat
each node as a particle


Concurrency: NNs that make up the CNN


Implement the CNN in parallel


One NN per node; operate independently during a single
time step


Communication between NNs is sparse


Have each candidate weight set tested sequentially
on the node holding a given NN


With arbitrary nodes?


Take advantage of both concurrencies!


Each node receives one cell


A CNN takes
c

nodes, where
c

is the number of cells


For
n

particles, the PSO then uses
n
x

c
nodes

P5

P6

P1

P2

P7

P8

P3

P4

P14

P9

P10

P15

P16

P11

P12

P13

NN5

NN6

NN
1

NN2

NN7

NN8

NN3

NN4

NN

14

NN9

NN

10

NN

15

NN

16

NN

11

NN

12

NN

13

Outline


Cellular

Neural

Networks


Architecture


Application



Wide

Area

Monitoring


Training

a

neural

network

using

particle

swarm

optimization


PSO

-

Introduction


Implementation


Parallelization



Concurrencies

in

CNN

and

PSO



GPU

Computing

in

RTPIS

Lab


Architecture

of

the

RTPIS

GPU

cluster


Some

timing

comparisons


Discussion

15

Architecture of RTPIS Cluster


Hardware

Configuration


Nodes
:

1

(master

)+

16

(nodes)

=

17


CPU
:

17

(nodes)

x

2

(CPUs/node)

x

4

(cores/CPU)

x

2

(threads/core)

=

272


GPU
:

16

(nodes)

x

2

(NVIDIA

Tesla

C
2050

GPUs/node)

x

448

(CUDA

cores/GPU)

=

14336


Memory
:

17

(nodes)

x

12

GB

=

204

GB


Storage
:

2

x

500

GB

(OS)

+

10

x

2

TB

(Master)

+

16

x

500

GB

(nodes)

=

29

TB


Software
:


Operating

System
:

OpenSUSE

11

Linux


Others
:

Torque,

Maui

scheduler,

CUDA

toolkit/libraries

and

GNU

compilers,

C/C++

with

MPI

libraries,

MATLAB

Distributed

Computing

Server

GPU Computing
-

Matlab


GPU

computing

is

built

in

MATLAB

R
2010
b


Run

part

of

the

code

in

GPU

using

MATLAB


Use

gpuArray

or

arrayfun

commands

in

MATLAB


Use

compiled

CUDA

code

as

PTX

file

to

use

in

MATLAB


Only

useful

when

computing

time

in

CPU

exceeds

the

communication

time

for

transferring

variables

between

CPU

and

GPU
.



Create

a

variable

in

CPU

and

move

to

GPU



gpuVar

=

gpuArray
(
cpuVar
)
;


Create

a

variable

in

GPU

directly



gpuVar

=

gpu
.
parallel
.
GPUArray
.
zeros
(
5
,
10
)
;


Any

function

that

uses

GPU

variables

is

performed

on

a

GPU



gpuVar

=

fft
(
gpuVar
)
;

gpuVar

=

abs(
gpuVar
1
*
gpuVar
2
)
;


Use

arrayfun

to

perform

operation

on

a

GPU



gpuVar

=
arrayfun
(@min,
gpuVar1
);



gpuVar

=
arrayfun
(@
customFn,
gpuVar1
,
cpuVar
);



cpuVar

=
arrayfun
(@customFn,
cpuVar1,cpuVar2
,
cpuVar3
)


Move

a

variable

from

GPU

to

CPU
:




cpuVar

=

gather(
gpuVar
)
;

Matlab

Commands for working on GPU

MATLAB:

labindex

labBroadcast
(
source,value
),
labBroadcast
(source)

gop
(@
function,value
)

labBarrier
;

19

Parallel Programming Introduction


MATLAB

MPI_C

vs

MATLAB

parallel

computing


C:


MPI_Comm_rank
(
MPI_COMM_WORLD,&
myid
);

MPI_Bcast
(
buffer,count,datatype,root,comm
);

MPI_Reduce
(
sendbuf,recvbuf,count,datatype,op,root,comm
);

MPI_Barrier
(
comm
)

lab

lab

lab

lab

lab

lab

lab

lab

Matlab

Job
Manager

MATLAB PCT

MATLAB PCT

MATLAB PCT

Examples
-

implementation



Sequential

Neural

Networks

training

using

PSO

20

Initialization

For each particle

P5

P6

Fitness
evaluation

Update position
and velocity

For each iteration

P1

P2

P7

P8

P3

P4

P13

P14

P9

P10

P15

P16

P11

P12

gbest

update

Dimensions = number of
weights


Fitness = F(NN output)

Examples
-

implementation



Parallel

Neural

Network

training

using

PSO

21

Initialization

P5

P6

Fitness
evaluation

Update position
and velocity

For each iteration

P1

P2

P7

P8

P3

P4

P14

P9

P10

P15

P16

P11

P12

Synchronize

P13

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

gbest_pos

gbest_fit

gbest

update

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

x,v
,

pbest_pos
,

pbest_fit

Examples


comparison of platforms

22

Examples


comparison of platforms

23

Time

Iterations

1000

2000

4000

8000

10000

Parallel
Matlab

18.48

36.83

73.9

143.5

181.5

Sequential
Matlab

145.98

287.9

573.4

1163.2

1445.5

Parallel
Matlab

GPU

41.13

52.29

102.87

202.58

248.45

MPI C

0.028

0.055

0.112

0.209

0.26

Sequential C

0.18

0.43

0.93

1.88

2.55

MSE

1000

2000

4000

8000

10000

4.69E
-
05

3.07E
-
06

8.85E
-
05

8.70E
-
06

1.24E
-
05

7.30E
-
04

1.76E
-
04

1.02E
-
04

1.20E
-
04

1.40E
-
05


6.00E
-
05

3.07E
-
06

8.85E
-
05

8.74E
-
06

1.24E
-
05

1.65E
-
04

1.20E
-
05

6.40E
-
05

5.00E
-
05

3.90E
-
05

2.50E
-
04

2.30E
-
04

5.90E
-
05

2.30E
-
05

1.00E
-
05

0.028

0.055

0.112

0.209

0.26

0.18

0.43

0.93

1.88

2.55

0
0.5
1
1.5
2
2.5
3
1000
2000
4000
8000
10000
Time

Sequential C
MPI C
18.48

36.83

73.9

143.5

181.5

145.98

287.9

573.4

1163.2

1445.5

0
200
400
600
800
1000
1200
1400
1600
1800
1000
2000
4000
8000
10000
Time

Sequential Matlab
Parallel Matlab
Matlab

Speedup

7.899351

7.816997

7.759134

8.105923

7.964187

C Speedup

6.428571

7.818182

8.303571

8.995215

9.807692

Matlab

vs

C speedup

660

669.6364

659.8214

686.6029

698.0769

Parallel
Programming


Computation

vs
.

communication


Data

parallelization

Vs
.

task

parallelization

24

Processor

Local
Cache

Local
Memory

Shared
Memory

Message Passing

Discussions

25

Thank you!