Simulation Code from Equation

Τεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

148 εμφανίσεις

Peter Aronsson

Automatic Parallelization of
Simulation Code from Equation
Based Simulation Languages

Peter Aronsson,

Industrial phd student, PELAB SaS IDA

Based on Licentiate presentation & CPC’03
Presentation

Peter Aronsson

Outline

Introduction

Related work on Scheduling & Clustering

Parallelization Tool

Contributions

Results

Conclusion & Future Work

Peter Aronsson

Introduction

Modelica

Object Oriented, Equation Based, Modeling Language

Modelica enable modeling and simulation of large
and complex multi
-
domain systems

Large need for parallel computation

To decrease time of executing simulations

To make
large models

possible to simulate at all.

To meet
hard

real time demands in hardware
-
in
-
the
-
loop simulations

Peter Aronsson

Examples of large complex
systems in Modelica

Peter Aronsson

Modelica Example
-

DCmotor

Peter Aronsson

Modelica example

model

DCMotor

import

Modelica.Electrical.Analog.Basic.*;

import

Modelica.Electrical.Sources.StepVoltage;

Resistor R1(R=10);

Inductor I1(L=0.1);

EMF emf(k=5.4);

Ground ground;

StepVoltage step(V=10);

equation

connect
(R1.n, I1.p);

connect
(I1.n, emf.p);

connect
(emf.n, ground.p);

connect

connect
(step.p, R1.p);

connect
(step.n, ground.p);

end

DCMotor;

Peter Aronsson

Example

Flat set of Equations

R1.v =
-
R1.n.v+R1.p.v 0 = R1.n.i+R1.p.i R1.i = R1.p.i R1.i*R1.R = R1.v

I1.v =
-
I1.n.v+I1.p.v 0 = I1.n.i+I1.p.i I1.i = I1.p.i I1.L*I1.der(i) = I1.v

emf.v =
-
emf.n.v+emf.p.v 0 = emf.n.i+emf.p.i emf.i = emf.p.i

emf.w = emf.flange_b.der(phi) emf.k*emf.w = emf.v

emf.flange_b.tau =
-
emf.i*emf.k ground.p.v = 0 step.v =
-
step.n.v+step.p.v

0 = step.n.i+step.p.i step.i = step.p.i

step.signalSource.outPort.signal[1] = (if time < step.signalSource.p_startTime[1]

then 0

else step.signalSource.p_height[1])+step.signalSource.p_offset[1]

R1.n.v = I1.p.v I1.p.i+R1.n.i = 0

I1.n.v = emf.p.v emf.p.i+I1.n.i = 0 emf.n.v = step.n.v step.n.v = ground.p.v

emf.n.i+ground.p.i+step.n.i = 0 emf.flange_b.phi = load.flange_a.phi

emf.flange_b.tau+load.flange_a.tau = 0 step.p.v = R1.p.v

R1.p.i+step.p.i = 0 load.flange_b.tau = 0

step.signalSource.y = step.signalSource.outPort.signal

Peter Aronsson

0.5
1
1.5
2
1
2
3
4
5

Plot of Simulation result

Peter Aronsson

Directed Acyclic Graph (DAG)

G = (V,E,
t
Ᵽ)

V

Set of nodes, representing computational tasks

E

Set of edges, representing communication of data

t
(
v
)

Execution cost for node
v

c(
i,j
)

Communication cost for edge (
i,j
)

Referred to as the
delay model

(macro dataflow
model)

Peter Aronsson

1

2

3

2

2

1

4

1

5

2

6

2

7

1

8

1

5

10

5

5

5

10

10

10

Peter Aronsson

Multiprocessor Scheduling Problem

Starting time

Processor assignment (P
1
,...P
N
)

Goal:
minimize

execution time, given

Precedence constraints

Execution cost

Communication cost

Algorithms in literature

List Scheduling approaches (ERT, FLB)

Critical Path scheduling approaches (TDS, MCP)

Categories:

Fixed No. of Proc, fixed c and/or
t
, ...

Peter Aronsson

Granularity

Granularity g = min(
t
(v))
/
max(c(i,j))

Affects scheduling result

E.g. TDS works best for high values of g, i.e. low
communication cost

Solutions:

Clustering algorithms

IDEA: build
clusters

of nodes where nodes in the same cluster
are executed on the same processor

Merging algorithms

Merge tasks to increase computational cost.

Peter Aronsson

Algorithms

Build clusters of nodes such that parallel time decreases

PT(n) = tlevel(n)+blevel(n)

By
zeroing

edges, i.e. putting several nodes into the
same cluster => zero communication cost.

Literature:

Sarkars Internalization alg., Yangs DSC alg.

Transform the Task Graph by merging nodes

Literature: E.g. Grain Packing alg.

Peter Aronsson

Clustering v.s. Merging

1

2

3

2

2

1

4

1

5

2

6

2

7

1

8

1

5

0

0

0

0

0

0

10

1

2

3

2

2

1

4

1

5

2

6

2

7

1

8

1

5

10

5

5

5

10

10

10

1

2

3,6

6

2,5,6

4

7

1

8

1

5

10

10

10

10

Peter Aronsson

DSC algorithm

1.
Initially, put each node a separate cluster.

2.

Merge clusters as long as Parallel Time does
not increase.

Low complexity O((n+e) log n)

ObjectMath (PELAB)

Peter Aronsson

Modelica Compilation

a...
a...
a...
a...
a...
a...
inertial
x
y
r...
r...
r...
r...
r...
r...
b0
c...
b...
b...
b...
b...
b...
b...
b...
c...
b...
b...
l...
r...
Modelica model (.mo)

Modelica

semantics

Equation

system

(DAE)

Opt
.

Rhs

calculations

Flat modelica (.mof)

Numerical

solver

C code

Structure of simulation code:

for t=0;t<stopTime;t+=stepSize {

x_dot[t+1] =
f
(x_dot[t],x[t],t);

x[t+1] =
ODESolver
(x_dot[t+1]);

}

Peter Aronsson

Optimizations on equations

Simplification of equations

E.g. a=b, b=c eliminate => b

BLT transformation, i.e. topological sorting into strongly
connected components

(BLT = Block Lower Triangular form)

Index reduction,
Index

is how many times an equation
needs to be differentiated in order to solve the equation
system.

Mixed Mode /Inline Integration, methods of optimizing
equations by reducing size of equation systems

a

b

c

d

e

0

Peter Aronsson

Generated C Code Content

Assignment statements

Arithmetic expressions (+,
-
,*,/), if
-
expressions

Function calls

Standard Math functions

Sin, Cos, Log

Modelica Functions

User defined, side effect free

External Modelica Functions

In External lib, written in Fortran or C

Call function for solving subsystems of equations

Linear or non
-
linear

Example Application

Robot simulation has 27 000 lines of generated C code

Peter Aronsson

Parallelization Tool Overview

Modelica

Compiler

C compiler

Model

.mo

C code

C compiler

Parallelizer

Parallel

C code

Solver

lib

MPI

lib

Seq exe

Parallel exe

Peter Aronsson

Parallelization Tool Internal
Structure

Parser

Builder

Symbol

Table

Scheduler

Code Generator

Debug & Statistics

Sequential C code

Parallel C code

Peter Aronsson

First graph
: corresponds to individual
arithmetic operations, assignments, function
calls and variable definitions in the C code

Second graph
graph

Example:

+

-

*

foo

-

/

+

*

a

b

c

d

defs

+,
-
,*

+,*

foo

/,
-

Peter Aronsson

Investigated
Scheduling
Algorithms

Parallelization Tool

Pre

Clustering Method

Experimental Framework (
Mathematica)

ERT

DSC

TDS

(Graph Rewrite Systems)

Peter Aronsson

Method 1:Pre Clustering
algorithm

buildCluster(
n
:node,
l
:list of nodes,
size
:Integer)

n

to a new cluster

size(cluster)=
size

Children to n

One in
-
degree children to cluster

Siblings to n

Parents to n

Arbitrary nodes

Peter Aronsson

Managing cycles

When adding a node to a
cluster the resulting graph
might have cycles

Resulting graph when
clustering
a

and
b

is cyclic
since you can reach {a,b}
from
c

Resulting graph not a DAG

Can not use standard scheduling
algorithms

a

b

c

d

e

Peter Aronsson

Pre Clustering Results

Did not produce Speedup

Introduced far too many dependencies in

Sequentialized schedule

Conclusion:

Need task duplication in such algorithm to succeed

Peter Aronsson

For each node:
n

with successor(
n
)={}

Put
all
pred(n) in one cluster

Repeat for all nodes in cluster

Rationale: If depth of graph limited, task duplication
will be kept at reasonable level and cluster size
reasonable small.

Works well when communication cost >> execution
cost

Peter Aronsson

Merging clusters

1.
Merge clusters with load balancing strategy,
without increasing maximum cluster size

2.
Merge clusters with greatest number of
common nodes

Repeat (2) until number of processors
requirement is met

Peter Aronsson

Computed measurements

Execution cost of largest cluster +
communication cost

Measured speedup

Executed on PC Linux

cluster SCI network interface,

using SCAMPI

Peter Aronsson

Robot Example Computed
Speedup

Mixed Mode / Inline Integration

1
2
4
9
#
Proc
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Speedup
c
10
c
100
c
1000
1
2
#
Proc
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Speedup
c
10
c
100
c
1000
With MM/II

Without MM/II

Peter Aronsson

Thermofluid pipe executed on PC
Cluster

Pressurewavedemo in Thermofluid package
50 discretization points

1
2
4
8
16
#
Proc
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Speedup
Peter Aronsson

Thermofluid pipe executed on PC
Cluster

Pressurewavedemo in Thermofluid package
100 discretization points

1
2
4
8
16
#
Proc
0.5
1
1.5
2
2.5
3
Speedup
Peter Aronsson

Idea:

A set of simple rules to transform a
task graph to increase its granularity (and
decrease Parallel Time)

Use top level (and bottom level) as metric:

Parallel Time = max tlevel + max blevel

Peter Aronsson

Rule 1

Merging a
single child

with only one parent.

Motivation:

The merge does not decrease amount
of parallelism in the task graph. And granularity
can possibly increase.

p

c

p’

Peter Aronsson

Rule 2

Merge
all parents of a node

together with the
node itself.

Motivation
: If the top level does not increase by
the merge the resulting task will increase in size,
potentially increasing granularity.

p
1

c

c’

p
2

p
n

Peter Aronsson

Rule 3

Duplicate parent

and merge

into each child node

Motivation
: As long as each child’s tlevel does not
increase, duplicating p into the child will reduce the
number of nodes and increase granularity.

c
2

p

c
n

c
1

c
2

c
n

c
1

Peter Aronsson

Rule 4

Merge siblings

into a single node as long as a
parameterized

maximum execution cost is not exceeded.

Motivation
: This rule can be useful if several small
predecessor nodes exist
and

a larger predecessor node
which prevents a complete merge. Does not guarantee
decrease of PT.

p
1

c

p
2

p
n

p
´

c

P
k+1

p
n

Peter Aronsson

Results

Example

Modelica simulation
code

Small

example from the
mechanical domain.

expression level,
originating from 84
equations & variables

Peter Aronsson

B=1, L=1

Peter Aronsson

B=1, L=10

B=1, L=100

Peter Aronsson

Conclusions

Pre Clustering approach did not work well
for the fine grained task graphs produced by
our parallelization tool

FTD Method

Works reasonable well for some examples

However, in general:

Need for better scheduling/clustering
algorithms for fine grained task graphs

Peter Aronsson

Conclusions (2)

Simple delay model may not be enough

More advanced model require more complex
scheduling and clustering algorithms

Simulation code from equation based
models

Hard to extract parallelism from

Need new optimization methods on DAE:s or
ODE:s to increase parallelism

Peter Aronsson

Conclusions

GRS

A task merging algorithm using GRS have been
proposed

Four rules with simple patterns => fast pattern
matching

Can easily be integrated in existing scheduling
tools.

Bandwidth & Latency

Merging criterion: decrease Parallel Time
, by
decreasing tlevel (PT)

Tested on examples from simulation code

Peter Aronsson

Future Work

Designing and Implementing Better
Scheduling and Clustering Algorithms

Work better for high granularity values

Try
larger

examples

Test on different architectures

Shared Memory machines

Dual processor machines

Peter Aronsson

Future Work (2)

Heterogeneous multiprocessor systems

Mixed DSP processors, RISC,CISC, etc.

Enhancing Modelica language with data
parallelism

e.g. parallel loops, vector operations

Parallelize e.g. combined PDE and ODE
problems in Modelica.

Using e.g. SCALAPACK for solving subsystems
of linear equations. How to integrate into
scheduling algorithms?