Simulation Code from Equation

coachkentuckyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

86 εμφανίσεις

Peter Aronsson

Automatic Parallelization of
Simulation Code from Equation
Based Simulation Languages

Peter Aronsson,

Industrial phd student, PELAB SaS IDA

Linköping University, Sweden

Based on Licentiate presentation & CPC’03
Presentation

Peter Aronsson

Outline


Introduction


Task Graphs


Related work on Scheduling & Clustering


Parallelization Tool


Contributions


Results


Conclusion & Future Work

Peter Aronsson

Introduction


Modelica


Object Oriented, Equation Based, Modeling Language


Modelica enable modeling and simulation of large
and complex multi
-
domain systems


Large need for parallel computation


To decrease time of executing simulations


To make
large models

possible to simulate at all.


To meet
hard

real time demands in hardware
-
in
-
the
-
loop simulations

Peter Aronsson

Examples of large complex
systems in Modelica

Peter Aronsson

Modelica Example
-

DCmotor

Peter Aronsson

Modelica example

model

DCMotor


import

Modelica.Electrical.Analog.Basic.*;


import

Modelica.Electrical.Sources.StepVoltage;


Resistor R1(R=10);


Inductor I1(L=0.1);


EMF emf(k=5.4);


Ground ground;


StepVoltage step(V=10);


Modelica.Mechanics.Rotational.Inertia load(J=2.25);

equation


connect
(R1.n, I1.p);


connect
(I1.n, emf.p);


connect
(emf.n, ground.p);


connect
(emf.flange_b, load.flange_a);


connect
(step.p, R1.p);


connect
(step.n, ground.p);

end

DCMotor;

Peter Aronsson

Example


Flat set of Equations

R1.v =
-
R1.n.v+R1.p.v 0 = R1.n.i+R1.p.i R1.i = R1.p.i R1.i*R1.R = R1.v

I1.v =
-
I1.n.v+I1.p.v 0 = I1.n.i+I1.p.i I1.i = I1.p.i I1.L*I1.der(i) = I1.v

emf.v =
-
emf.n.v+emf.p.v 0 = emf.n.i+emf.p.i emf.i = emf.p.i

emf.w = emf.flange_b.der(phi) emf.k*emf.w = emf.v

emf.flange_b.tau =
-
emf.i*emf.k ground.p.v = 0 step.v =
-
step.n.v+step.p.v

0 = step.n.i+step.p.i step.i = step.p.i

step.signalSource.outPort.signal[1] = (if time < step.signalSource.p_startTime[1]


then 0


else step.signalSource.p_height[1])+step.signalSource.p_offset[1]

step.v = step.signalSource.outPort.signal[1] load.flange_a.phi = load.phi

load.flange_b.phi = load.phi load.w = load.der(phi)

load.a = load.der(w) load.a*load.J = load.flange_a.tau+load.flange_b.tau

R1.n.v = I1.p.v I1.p.i+R1.n.i = 0

I1.n.v = emf.p.v emf.p.i+I1.n.i = 0 emf.n.v = step.n.v step.n.v = ground.p.v

emf.n.i+ground.p.i+step.n.i = 0 emf.flange_b.phi = load.flange_a.phi

emf.flange_b.tau+load.flange_a.tau = 0 step.p.v = R1.p.v

R1.p.i+step.p.i = 0 load.flange_b.tau = 0


step.signalSource.y = step.signalSource.outPort.signal

Peter Aronsson

0.5
1
1.5
2
1
2
3
4
5

load.flange_a.tau


load.w

Plot of Simulation result

Peter Aronsson

Task Graphs


Directed Acyclic Graph (DAG)

G = (V,E,
t
Ᵽ)

V


Set of nodes, representing computational tasks

E


Set of edges, representing communication of data
between tasks

t
(
v
)


Execution cost for node
v


c(
i,j
)


Communication cost for edge (
i,j
)


Referred to as the
delay model

(macro dataflow
model)

Peter Aronsson

Small Task Graph Example

1

2

3

2

2

1

4

1

5

2

6

2

7

1

8

1

5

10

5

5

5

10

10

10

Peter Aronsson

Task Scheduling Algorithms


Multiprocessor Scheduling Problem


For each task, assign


Starting time


Processor assignment (P
1
,...P
N
)


Goal:
minimize

execution time, given


Precedence constraints


Execution cost


Communication cost


Algorithms in literature


List Scheduling approaches (ERT, FLB)


Critical Path scheduling approaches (TDS, MCP)


Categories:

Fixed No. of Proc, fixed c and/or
t
, ...

Peter Aronsson

Granularity


Granularity g = min(
t
(v))
/
max(c(i,j))


Affects scheduling result


E.g. TDS works best for high values of g, i.e. low
communication cost


Solutions:



Clustering algorithms


IDEA: build
clusters

of nodes where nodes in the same cluster
are executed on the same processor


Merging algorithms


Merge tasks to increase computational cost.

Peter Aronsson

Task Clustering/Merging
Algorithms


Task Clustering Problem:


Build clusters of nodes such that parallel time decreases


PT(n) = tlevel(n)+blevel(n)


By
zeroing

edges, i.e. putting several nodes into the
same cluster => zero communication cost.


Literature:


Sarkars Internalization alg., Yangs DSC alg.


Task Merging Problem


Transform the Task Graph by merging nodes


Literature: E.g. Grain Packing alg.

Peter Aronsson

Clustering v.s. Merging

1

2

3

2

2

1

4

1

5

2

6

2

7

1

8

1

5

0

0

0

0

0

0

10

Clustered Task Graph

1

2

3

2

2

1

4

1

5

2

6

2

7

1

8

1

5

10

5

5

5

10

10

10

Merged Task Graph

1

2

3,6

6

2,5,6

4

7

1

8

1

5

10

10

10

10

Peter Aronsson

DSC algorithm

1.
Initially, put each node a separate cluster.

2.
Traverse Task Graph


Merge clusters as long as Parallel Time does
not increase.


Low complexity O((n+e) log n)


Previously used by Andersson in
ObjectMath (PELAB)

Peter Aronsson

Modelica Compilation

a...
a...
a...
a...
a...
a...
inertial
x
y
r...
r...
r...
r...
r...
r...
b0
c...
b...
b...
b...
b...
b...
b...
b...
c...
b...
b...
l...
r...
Modelica model (.mo)

Modelica

semantics

Equation


system

(DAE)

Opt
.

Rhs

calculations

Flat modelica (.mof)

Numerical

solver

C code

Structure of simulation code:

for t=0;t<stopTime;t+=stepSize {


x_dot[t+1] =
f
(x_dot[t],x[t],t);


x[t+1] =
ODESolver
(x_dot[t+1]);

}

Peter Aronsson

Optimizations on equations


Simplification of equations

E.g. a=b, b=c eliminate => b


BLT transformation, i.e. topological sorting into strongly
connected components


(BLT = Block Lower Triangular form)




Index reduction,
Index

is how many times an equation
needs to be differentiated in order to solve the equation
system.


Mixed Mode /Inline Integration, methods of optimizing
equations by reducing size of equation systems

a

b

c

d

e

0

Peter Aronsson

Generated C Code Content


Assignment statements


Arithmetic expressions (+,
-
,*,/), if
-
expressions


Function calls


Standard Math functions


Sin, Cos, Log


Modelica Functions


User defined, side effect free


External Modelica Functions


In External lib, written in Fortran or C


Call function for solving subsystems of equations


Linear or non
-
linear


Example Application


Robot simulation has 27 000 lines of generated C code

Peter Aronsson

Parallelization Tool Overview

Modelica

Compiler

C compiler

Model

.mo

C code

C compiler

Parallelizer

Parallel


C code

Solver


lib

MPI

lib

Seq exe

Parallel exe

Peter Aronsson

Parallelization Tool Internal
Structure

Parser

Task Graph


Builder

Symbol

Table

Scheduler

Code Generator

Debug & Statistics

Sequential C code

Parallel C code

Peter Aronsson

Task Graph building


First graph
: corresponds to individual
arithmetic operations, assignments, function
calls and variable definitions in the C code


Second graph
: Clusters of tasks from first task
graph

Example:

+

-

*

foo

-

/

+

*

a

b

c

d

defs

+,
-
,*

+,*

foo

/,
-

Peter Aronsson

Investigated
Scheduling
Algorithms


Parallelization Tool


TDS (Task Duplications Scheduling Algorithm)


Pre


Clustering Method


Full Task Duplication Method


Experimental Framework (
Mathematica)


ERT


DSC


TDS


Full Task Duplication Method


Task Merging approaches

(Graph Rewrite Systems)

Peter Aronsson

Method 1:Pre Clustering
algorithm


buildCluster(
n
:node,
l
:list of nodes,
size
:Integer)


Adds
n

to a new cluster


Repeatedly adds nodes until the
size(cluster)=
size



Children to n


One in
-
degree children to cluster


Siblings to n


Parents to n


Arbitrary nodes

Peter Aronsson

Managing cycles


When adding a node to a
cluster the resulting graph
might have cycles


Resulting graph when
clustering
a

and
b

is cyclic
since you can reach {a,b}
from
c


Resulting graph not a DAG


Can not use standard scheduling
algorithms


a

b

c

d

e

Peter Aronsson

Pre Clustering Results


Did not produce Speedup


Introduced far too many dependencies in
resulting task graph


Sequentialized schedule


Conclusion:


For fine grained task graphs:


Need task duplication in such algorithm to succeed

Peter Aronsson

Method 2: Full Task Duplication


For each node:
n

with successor(
n
)={}


Put
all
pred(n) in one cluster


Repeat for all nodes in cluster








Rationale: If depth of graph limited, task duplication
will be kept at reasonable level and cluster size
reasonable small.


Works well when communication cost >> execution
cost

Peter Aronsson

Full Task Duplication (2)


Merging clusters

1.
Merge clusters with load balancing strategy,
without increasing maximum cluster size

2.
Merge clusters with greatest number of
common nodes


Repeat (2) until number of processors
requirement is met



Peter Aronsson

Full Task Duplication Results


Computed measurements


Execution cost of largest cluster +
communication cost


Measured speedup


Executed on PC Linux


cluster SCI network interface,


using SCAMPI

Peter Aronsson

Robot Example Computed
Speedup


Mixed Mode / Inline Integration

1
2
4
9
#
Proc
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Speedup
c
10
c
100
c
1000
1
2
#
Proc
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Speedup
c
10
c
100
c
1000
With MM/II

Without MM/II

Peter Aronsson

Thermofluid pipe executed on PC
Cluster


Pressurewavedemo in Thermofluid package
50 discretization points


1
2
4
8
16
#
Proc
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Speedup
Peter Aronsson

Thermofluid pipe executed on PC
Cluster


Pressurewavedemo in Thermofluid package
100 discretization points


1
2
4
8
16
#
Proc
0.5
1
1.5
2
2.5
3
Speedup
Peter Aronsson

Task Merging using GRS


Idea:

A set of simple rules to transform a
task graph to increase its granularity (and
decrease Parallel Time)


Use top level (and bottom level) as metric:




Parallel Time = max tlevel + max blevel

Peter Aronsson

Rule 1


Merging a
single child

with only one parent.







Motivation:

The merge does not decrease amount
of parallelism in the task graph. And granularity
can possibly increase.

p

c


p’

Peter Aronsson

Rule 2


Merge
all parents of a node

together with the
node itself.







Motivation
: If the top level does not increase by
the merge the resulting task will increase in size,
potentially increasing granularity.

p
1

c


c’

p
2

p
n



Peter Aronsson

Rule 3


Duplicate parent

and merge

into each child node









Motivation
: As long as each child’s tlevel does not
increase, duplicating p into the child will reduce the
number of nodes and increase granularity.

c
2

p

c
n

c
1

c
2


c
n


c
1






Peter Aronsson

Rule 4


Merge siblings

into a single node as long as a
parameterized

maximum execution cost is not exceeded.








Motivation
: This rule can be useful if several small
predecessor nodes exist
and

a larger predecessor node
which prevents a complete merge. Does not guarantee
decrease of PT.

p
1

c

p
2

p
n

p
´

c

P
k+1

p
n





Peter Aronsson

Results


Example


Task graph from
Modelica simulation
code


Small

example from the
mechanical domain.


About 100 nodes built on
expression level,
originating from 84
equations & variables

Peter Aronsson

Result Task Merging example


B=1, L=1

Peter Aronsson

Result Task Merging example


B=1, L=10







B=1, L=100

Peter Aronsson

Conclusions


Pre Clustering approach did not work well
for the fine grained task graphs produced by
our parallelization tool


FTD Method


Works reasonable well for some examples


However, in general:


Need for better scheduling/clustering
algorithms for fine grained task graphs

Peter Aronsson

Conclusions (2)


Simple delay model may not be enough


More advanced model require more complex
scheduling and clustering algorithms


Simulation code from equation based
models


Hard to extract parallelism from


Need new optimization methods on DAE:s or
ODE:s to increase parallelism

Peter Aronsson

Conclusions

Task Merging using
GRS


A task merging algorithm using GRS have been
proposed


Four rules with simple patterns => fast pattern
matching


Can easily be integrated in existing scheduling
tools.


Successfully merges tasks considering


Bandwidth & Latency


Task duplication


Merging criterion: decrease Parallel Time
, by
decreasing tlevel (PT)


Tested on examples from simulation code

Peter Aronsson

Future Work


Designing and Implementing Better
Scheduling and Clustering Algorithms


Support for more advanced task graph models


Work better for high granularity values


Try
larger

examples


Test on different architectures


Shared Memory machines


Dual processor machines

Peter Aronsson

Future Work (2)


Heterogeneous multiprocessor systems


Mixed DSP processors, RISC,CISC, etc.


Enhancing Modelica language with data
parallelism


e.g. parallel loops, vector operations


Parallelize e.g. combined PDE and ODE
problems in Modelica.


Using e.g. SCALAPACK for solving subsystems
of linear equations. How to integrate into
scheduling algorithms?