MESSAGE PASSING MULTIPROCESSING SYSTEM SIMULATION

shrewdnessfreedomSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

242 views



M
ESSAGE PASSING MULTIPROCESSING
SYSTEM
SIMULATION

USING SIMICS



A Project


Presented to the faculty of the D
epartment of Computer Science

California State University, Sacramento



Submitted in

partial

satisfaction of

the
r
equirements for the degree

of


MASTER OF
SCIENCE

i
n

Computer

S
cience

b
y

Sandra Guija



FALL

2012





ii


















©
2012


Sandra Guija


ALL RIGHTS RESERVED



iii

MESSAGE PASSING MULTIPROCESSING
SYSTEM
SIMULATION

USING SIMICS




A
Project



by



Sandra

Guija












Approved by:


__________________________________, Committee Chair

Nikrouz Faroughi, Ph.D.



__________________________________, Second Reader

William Mitchell, Ph.D.



____________________________

Date




iv










Student:
Sandra Guija



I certify that this student has met the requirements for format contained in the University
format manual, and that this
project

is suitable for shelving in the Library and c
redit is to
be awarded for the project
.





__________________________, Graduate Coordinator

___________________

Nikrouz Faroughi







Date



Department
of

Computer Science


v

Abstract

of

MESSAGE PASSING MULTIPROCESSING
SYSTEM
SIMULATION

USING SIMICS


b
y


Sandra Guija





Parallel processing uses multiple processors to compute a large computer problem. Two
main multiprocessing programming models are
shared memory

and
message passing
. In
the latter model, processes communicate
by exchanging messages using the network.



T
h
e

project

consisted of two parts:

1)

To investigate the performance of a multithreaded matrix multiplication program,
and

2)

To create a user guide for how to setup a message passing multiprocessor
simulation environment using Simics including MPI (message passin
g interface)
installation, OS craff file creation, memory caches addition and
python

scripts
usage
.




vi

The simulation results and performance analysis
indicate as matrix size
increases

and

the
number of

processing nodes increases
,

the rate at which bytes c
ommu
nicated

and

the

number
of packets increase is faster than the rates at which processing time per node

decreases
.








_______________________, Committee Chair

Nikrouz Faroughi, Ph.D.


_______________________

Date




vii


ACKNOWLEDG
E
MENTS


Success

is the ability to go from one
failure

to another with no loss of enthusiasm.

[Sir Winston Churchill]


To God
who

g
ave

me strength,
enthusiasm,
and
health

to be able to complete my project.


To my husband Allan who said, “you can do this”
,

I would like to thank him for being
there for me. I would like to thank my parents Lucho and Normita and my sister Giovana
for their love and support despite the di
stance.


I would like to thank Dr. Nikrouz Faroughi for
his guidance
during

this project, his
knowledge,
time

and
constant
feedback.

I would
also
like to thank Dr. William Mitchell,
who was kind enough to be my second reader.


I would like to thank t
h
e
s
e
special

people
:

Cara
for her
always
-
sincere

advice
,
Tom Pratt

for his kindness, dedication, patien
ce

and time

and Sandhya for being
helpful
,

my
manager Jay Rossi and my co
-
workers for their support.

I truly believe their help has
had
a significant and
positive impact on my project.





viii

TABLE OF CONTENTS











Page


A
cknowledgments
................................
................................
................................
.............

vii

List of Tables


................................
................................
................................
......................

x

List of Figures

................................
................................
................................
....................

xi

Chapter

INTRODUCTION

................................
................................
................................
..............

1

1.
1

Shared Memory Multiprocessor Architecture

................................
...........................

2

1.
2

Message Passin
g Multiprocessor Architecture

................................
..........................

3

1.
3

Project Overview

................................
................................
................................
.......

3

M
ESSAGE PASSING SYSTEM

MODELING

................................
................................
.

5

2.1

Simics Overview

................................
................................
................................
........

5

2.
2

Message Passing Interface

................................
................................
.........................

6

2.
2
.1

MPICH2

................................
................................
................................
.....

10

2.
2
.2

Open MPI

................................
................................
................................
...

11

2.
3

MPI
Overview

................................
................................
................................
..........

12

2.
3
.1

Beowulf Cluster and MPI Cluster

................................
..............................

12

2.
3
.2

MPI
Network Simulation

................................
................................
...........

12

2.
4

Simulation of
Matrix Multiplication

................................
................................
........

13




ix

SIMULATION RESULTS AND PERFORMANCE ANALYSIS

................................
..

15

3.1

Si
mulation
Parameters

................................
................................
.............................

15

3.2

Data Analysis

................................
................................
................................
...........

16

CONCLUSION

................................
................................
................................
.................

25

Appendix A. Simics Script to Run an 16
-
node MPI Network Simulation

.......................

26

Appendix B. MPI Program for Matrix Multiplication

................................
......................

28

Appendix C. Simics Script to Add L1 and L2 Cache Memories

................................
......

33

Appendix D. Python Script to Collect Simulation Data

................................
...................

42

Appendix E. SSH
-
Agent script

................................
................................
.........................

43

Appendix F.
User Guide

................................
................................
................................
...

45

Appendix G.
Simulation Data

................................
................................
...........................

67

Bibliography

................................
................................
................................
.....................

7
6




x

LIST OF TABLES

Table
s

Page


1.

Table
1

MPI_Init and MPI_Finalize Functions

................................
.......................
7

2.

Table
2

MPI_Comm Functions

................................
................................
................
8

3.

Table
3

MPI Send

and

Receive Functions

................................
...............................
9

4.

Table
4

MPI Broadcast Function

................................
................................
.............
9

5.

Table
5

Configuration Information

................................
................................
........
15

6.

Table
6

Processing Time and Network Traffic Data Collected

.............................
74

7.

Table
7

Processing Time, Total Bytes and Number of Packets Ratios

.................
7
5

8.

Table
8

Time before the start

of 1st slave

................................
..............................
75





xi

LIST OF FIGURES

Figure
s

Page


1.

Figure
1

S
hared memory multiprocessor
interconnected via bus

............................
2

2.

Figure
2

Scalable

Shared M
emory Multiprocesso
r

................................
.................
3

3.

Figure
3

Processing Time per n
ode

................................
................................
.......
17

4.

Figure
4

Time before the start of 1st slave

................................
............................
18

5.

Figure
5

Total
Bytes per node

................................
................................
................
19

6.

Figure
6

Number of Packets per node
................................
................................
....
21

7.

Figure
7

Processing Time Ratio

................................
................................
.............
22

8.

Figure
8

Bytes Ratio

................................
................................
..............................
23

9.

Figure
9

Number of Packets Ratio

................................
................................
.........
24









1



Chapter 1

INTRODUCTION


A parallel computer is a “collection of processing elements that
communicate and cooperate to
solve large problems fast”

[Almasi and Gollieb, Highly Parallel Computing, 1989]


Parallel Computing is the main approach to process massive data and to solve complex
problem
s
. Parallel computing is used in a wide range of applications including galaxy
for
mation, weather forecasting, quantum physics, climate research, manufacturing
processes, chemical reactions and planetary movements
.


Parallel processing means to divide a workload into subtasks and complete the subtasks
concurrently. In order to achieve
that, communication between processing elements is
required.


Parallel Programming Model
s

such as, Shared Address Space (SAS)

and Message
Passing (MP)
will

define
how a set of parallel processes communicate, share information
and coordinate their
activities

[1]
.




2



1.
1

Shared Memory Multiprocessor Architecture

In this
case

multiple processes access a shared memory space using standard
load

and
store

instructio
ns.
Each thread/process
accesses

a
portion of
the
shared data

address
space. The threads communicate with each other by reading and writing
shared

variables.
S
ynchronization
functions are used

to prevent
a

thread

from

updating the
same
-
shared

variable at t
he same time

or for the threads to coordinate their activities
.


A shared memory system
is

implemented
using

a bus
or
interconnect
ion network

to
interconnect the processors
.
Figure
1

illustrates a
bus based multiprocessor system called
UMA
(Uniform Memory Access) because all memory accesses have the same latency
.

A
NUMA
(Non
-
uniform Memory Access)
multiprocessor, on the
other hand, is designed by
distributing the shared memory space among the different processors
as illustrated in

Figure
2
.

The processors are interconnected using an i
nterconnection network, making
the architecture scalable.



Figure
1

S
hared memory multiprocessor
interconnected via bus

[1]




3




Figure
2

Scalable

Shared M
emory Multiprocesso
r
[1]


1.
2

Message Passin
g Multiprocessor Architecture

In a message passing system
, processes
communicat
e

by sending and receiving messages
thought the network.
To send a message, a processor
executes a system call to request an
operating system to send the message to a destination process.



A common message
passing
system
is

a cluster network. A message passing architecture
diagram is also
similar to that shown for NUMA in
Figure
2
; except that

each processor
can only access its own memory
and

can send and receive data

to and
from other
processors
.


1.
3

Project Overview


Chapter 2
covers

tools

and
concepts
to model a message passing system.

Chapter 3
describe
s

simulation

data collection and
analysis
,

and
Chapter 4
is the conclusion,
and
future work. Appendi
x

A
presents
the

Simics script to start a 16
-
node

MPI
network
simulation
.
Appendix B includes

a
n

MPI program for

Matrix Multiplication
.
Appendix
C



4



presents
the

Simics script to add

an

L1 and L2 cache
s

to simulated machines
.

Appendix
D

presents
the

Python script to collect
the
processing time and network tr
affic data from
Simics
.

Appendix E presents

the SSH
-
Agent script
.
Appendix
F

contains a step
-
by
-
step
User Guide to
configure and simulate
a message passing system model

using Simics
.





5



Chapter 2

M
ESSAGE PASSING SYSTEM

MODELING


This chapter
presents
a description of
the
Simics simulation environment, MPI, and the
multithreaded message passing matrix multiplication program.


2.1

Simics Overview

Simics is a
complete machine simulator
that
models all the hardware components found
in a typical computer system.
It

is used by
software developers to simulate any target
hardware from a single processor to large

and
complex system
s
[2]
. Simics facilitates
integration

and testing environment

for software by providing the same experience
as

a
real hardware system.


Simics
is
a user
-
friendly

interface with many tools and options.

Among the many
products and features of Simics
are

the
craff utility
,
SimicsFS,
Ethernet

networking and
scripting with P
ython. These are the main
Simics functionalities
used

in this project
for
the
simul
ation of
a

m
essage passing
m
ultiprocessor
s
ystem
s
imulation.

Each processor is
modeled as a stand
-
alone processing node with its own copy of

the OS.
The craff
utility
allows

users
t
o

creat
e

an
operating system image from a simulated
machine and
use

i
t to
simulate multiple identical

nodes
.

This
utility
save
s

significant time by setting only one
target machine with all the software and configura
tion
features, which is then replicated


6



to the remaining nodes
.
SimicsFS
allows
users
to cop
y files from a host directory t
o
a

simulated
node
.

The E
thernet Networking provides network connectivity for
a S
imics
platform

inside one S
imics session
.
Scripting
with P
ython is
very simple and can be used
to
access

system configuration parameters
,
invoke

command

line

functions
, define hap
events and interface with Simics API functions
.

The primary use of hap and Simics API
functions
is

for collecting simulation dat
a.


2.
2

Message Passing Interface

Message Passing Interface

(MPI)

is a
standard
message library developed to create
practical, portable, efficient, and flexible message passing programs

[4]
.
The
MPI
standardization

process commenced in 1992.

A group of researchers from academia and
industry worked together in a standardization process exploiting the most advantageous
features of the existing message passing systems.
The
MPI standard consists of
two

publications
:

MPI
-
1 (1994) and MPI
-
2 (1996).

The
MPI
-
2 is mainly additions and
extensions to MPI
-
1.


The MPI Standard
includes

point
-
to
-
point communication, collective operations, process
groups, communication contexts, process topologies,
and

interfaces
supported in

FORTRAN, C and C++.



P
rocesses
/threads

communicate by calling
MPI
library routines to send and receive
message
s

to other processes.
All programs

using MPI

require the mpi.h header file to


7



make MPI library calls.
The
MPI
includes

over
one

hundred different functions
. The first
MPI funct
ion
that a MPI
-
base message passing program must call
is MPI_
INIT, which

initialize
s

an MPI execution
. T
he last function is
MPI_FINALIZE
which

terminate
s

the
MPI execution.

Both

functions
are

called once during a program execution
.
Table
1

illustrates the declaration and description of MPI_INIT and MPI_FINALIZE functions.


Table
1

MPI_Init and MPI_Finalize Functions

MPI_
INIT (
int *argc, char *a
rgv
[]

)

First MPI function called in a program.

Some of the common arguments taken from the command
-
line are number of
processes, specified hosts or list of hosts, hostfile (text file with hosts specified),
directory of the program,

Initializes MPI variab
les

and f
orms the MPI_COMM_WORLD communicator

Opens TCP connections

MPI_FINALIZE

()

Terminates MPI execution environment

Called last by all process
es

Closes TCP Connections

Cleans up


The t
wo basic concepts to program with MPI are groups and communicators. A group is
an ordered set of processes, where each process has its
own

rank
number
.

A

communicator determines the scope and the "communication universe" in which a point
-
to
-
point or colle
ctive operation is to operate.
Each

communicator
is associated with a
group

[3]
.


MPI_COMM_WORLD

is

a

communicator defined by MPI referring to
all the
processes.
Groups and communicators are dynamic objects th
at

may get

created and destroyed
during program execution. MPI provides flexibility to create groups and communicators


8



for applications that might require communications among selected subgroup of
processes.
M
PI_COMM_SIZE and MPI_COMM_RANK are t
he most common
ly

used
communication functions

in a
n

MPI program
. MPI_COMMON_SIZE determines the size
of the group or number of
the
processes associated with a communicator.
M
PI_COMM_RANK determines the rank of the callin
g process in the communicator.


The
Matrix Multiplication M
PI

program

uses the MPI_COMM_WORLD as the
communicator.

Table
2

illustrates the declaration and description

of MPI_
COMM

functions.

Table
2

MPI_Comm Functions

MPI_COMM_SIZE(
MPI_Comm
comm,
int *
size)

Determines number of processes within a communicator.

In this study the MPI_Comm argument is MPI_COMM_WORLD.

MPI_COMM_RANK
(
MPI_Comm
comm,
int *
rank)

Returns the process identifier for the process that invokes it.

Rank is integer between 0 and size
-
1
.


In MPI
,

p
oint
-
to
-
point communication is fundamental

for send
ing

and receiv
ing

operations
. MPI defines two models of communication blocking and non
-
blocking. The
non
-
blocking
functions return immediately even if the communication is not finished yet,
while the blocking functions do not return until the communication is finished. Using
non
-
bloc
king functions allows computations and calculations to proceed simultaneously.




9



For this stud
y
,

we use the asynchronous non
-
blocking MPI_ISend
and MPI_Recv
function.

Table
3

illustrates the declaration and description of MPI_ISend and MPI_Recv
functions.

Table
3

MPI Send

and

Receive Functions

MPI_Isend (
void *
bu
f
f
er
, int count, MPI_Datatype datatype, int dest,

int tag,


MPI_Comm comm, MPI_Request *request)

Sends a message.

A
n

MPI Nonblocking call, where the computation can proceed immediately allow
ing

both communications and computations to proceed concurrently.

MPI supports
messages with
all the basic datatypes
.

MPI_R
ecv

(
void *
buf
fer
,
int
count,
MPI_Datatype
datatype,
int
source,
int
tag,


MPI_Comm
comm,
MPI_
Status

*status
)

Receives a message

The count argument indicates the maximum length of a message
.

The tag argument must match between sender
and receiver.


Many applications require a communication between two or more processes. MPI
include
s

collective communication operations that involve the participation of all
processes in a communicator.
Broadcast is

o
ne
of the most c
ommon collective
operation
s

that is

used
in

this study
.
Broadcast is defined as
MPI_
Bcast and is used for a process,
which is
the root, to send a message to all the members of the communicator.
Table
4

illustrates the declaration and description of MPI_Bcast function.

Table
4

MPI Broadcast Function

MPI_B
cast

(void *buffer, int count, MPI_Datatype datatype, int
master
,



MPI_Comm comm)

Broadcasts a message from the process with rank "
master
" to all other processes of the
communicator
.






10



The MPI Standard
is a set of function
s

and capabilities that any implementation of the
message
-
passing library must follow.

The two leading open source

implementations of
MPI are MPICH2 and Open MPI.

Both implementations are available for different
versions of Unix, Linux, Mac OS X and MS Windows.


2.
2
.1

MPICH2

MPICH2 is a broadly used MPI implementation developed at the Argonne National
Laboratory (ANL)

and Mississippi State University (MSU). MPICH2 is a high
-
performance and widely portable implementation of the Message Passing Interface (MPI)
standard (both MPI
-
1 and MPI
-
2). The CH comes from “Chameleon”, the portability
layer used in the original MPICH
. The founder of MPICH developed the Chameleon
parallel programming library.


MPICH2 uses an external process manager that spawns and manages parallel jobs.
MPD
is

t
he default process
manager
and

is used to manage all

MPI processes.
This process
manager
u
se
s

PMI (process management interface) to communicate with MPICH2.

MPD
involves starting up an mpd daemon on each of the worker nodes
. MPD used to be the
default process manager for MPICH2. Starting with version 1.3 Hydra, a more robust and
reliable process manager, is the default MPICH2 process manager.




11



2.
2
.2

Open MPI

Open MPI
is

evolved from the merger of three established MPI implem
entations:
FT_MPI, LA_MPI and LAM/MPI plus contributions of PACX
-
MPI.
Open MPI
is
developed using the best practices among th
em

established MPI implementations.
Open

MPI runs using Open Run
-
Time Environment ORTE. ORTE is open source software
developed to s
upport distributed high
-
performance applications and transparent
scalability. ORTE starts
MPI

jobs and provides
some status information
to the upper
-
layer Open MPI

[5]
.


The Open MPI project go
al is to work with and for the s
upercomputer community to
support MPI implementation for

a

large number and variety of systems
.

A

K computer is
a supercomputer produced by Fujitsu and is currently the world’s second fastest
supercomputer. It uses a Tofu
-
optimized MPI based on Open MPI
.


MPICH2 and Open MPI are the most
common

MPI implementations used by
supercomputers.
Open M
PI

does not
require the usage

of a process manager, which
makes

the
installation, configuration and ex
ecution a simpler process. Open MPI is the
MPI implementation

used in this project.




12



2.
3

MPI
Overview

2.
3
.1

Beowulf Cluster and MPI Cluster

Beowulf Project started in 1994 at NASA's Goddard Space Flight Center. A result of this
research was

the
Beowulf
Cluster system, a scalable combination of hardware and
software that provides a sophisticated and robust environment to support a wide range of
applications

[6]
.
Th
e name “Beowulf” comes from the mythical Old
-
English hero with
extraordinary strength who defeats Grendel, the green dragon.

The motivation to develop
a cost
-
efficient solution makes a Beowulf Cluster attainable to anyone. The three required
components are

a collection of stand
-
alone computers networked together, an open
source operating system

such

as Linux and a message passing interface or PVM Parallel
Virtual Machine implementation. Based on that, the components selected for this project
are: 4
, 8 and 1
6
Pentium PCs with Fedora 5, TCP/IP network connectivity and Open MPI
implementation.


The
Beowulf Cluster
is known as
“MPI

Cluster


when
MPI
is used
for communication
and coordination between the processing nodes
.


2.
3
.2

MPI
Network Simulation

One early
consideration to make when setting an MPI
Network

is to determine the use of
a file system. The options are whether to setup a Network File System (NFS) or not. NFS
is a protocol that facilitates access to files in the network, as if they were local.
With
NFS


13



a folder containing
the
Open MPI program can be shared in the master node to all the
other slave nodes.
NFS can become a bottleneck when the nodes all use the NFS shared
directory. NFS will not be used in this
project;

instead Open MPI
is installed in
the local
drive in each node
.


A second consideration is setting up
a
secure SSH
protocol
in the master node
. MPI uses
SSH to communicate among the nodes.

Simics Tango targets are loaded with OpenSSH,
a widely used implementation of SSH protocol, configur
ed with password protection.
Because Open MPI will rely on OpenSSH at the execution time, additional commands
will be run to ensure a connection without

a password.


The last setting to

be

perform
ed

in this simulation is to define a username with the same
user ID to be created on each node with the same path directory to access common files.

Now that,
all
the required co
mponents
ha
ve

been i
ntroduced, the MPI matrix
multiplication program will be described

next
.



2.
4

Simulation of
Matrix Multiplication

In

t
his project
,

Simics scripts are used to
configure
and simulate

4, 8 or 16
-
node

message
passing system. Each node is configured as a complete single processor system.
Each
node
also includes a copy of executable
matrix multiplication code. A file with all t
he
nodes hostname must be created.




14



When entering the execution command two arguments are passed: 1) the number of
processes “np” to specify how many processors to use and 2)
a

file
that includes the
names of the processing nodes.

However, in this project, the node names are not
referenced in the program explicitly; only the node ID (also called rank) as 0, 1, 2,
etc.
are referenced.


One of the nodes is
the
master

node, which

coordinate
s

the task of multiplying two
matrices A and B
.
The master
partitions matrix A among the different (slave) nodes

and
then
broadcast
s

Matrix B to all the slaves. Each slave
node
multiplies its portion of
matrix A with matrix B and sends the results to the master, which combines

the results
to
produce
t
he final
product of A and B.



15



Chapter 3

SIMULATION RESULTS AND PERFORMANCE ANALYSIS


As

was

described in
the
previous

chapter
,

for modeling a message passing system
simulation
,

Open MPI
wa
s installed and configured
on a
Simics
target machine.
T
his
chapter
present
s

the performance

simulation results and
analysis of running a
message
passing matrix multiplication program
.


3.1

Si
mulation
Parameters

The matrix multiplication program is executed in three simulated MP

systems with
4
,

8
and 16

processing
nodes
.
The nodes are identical in terms of
processor type, clock
frequency,
and the size of main
and cache memories.
Table
5

displays the configuration
data of each node. The

master and slave nodes are interconnected by Ethernet with the
MTU (Maximum Transmission Unit)
set to

the

default
value of 1
500
B
.


The nodes are independent and each includes a copy of the test program. Seven different
matrix sizes were used in the simula
tion.

Table
5

Configuration Information

Nodes

Cores

MHz

Main
Memory

L1 Data
Cache

L1
Instruction
Cache

L2 Cache

4

1

2000

1 GB

32K

32 K

256 KB

8

1

2000

1 GB

32K

32 K

256 KB

16

1

2000

1 GB

32K

32 K

256 KB



16



3.2

Data Analysis

Figure
3

shows the average processing time per node using 100x100, 200x200, etc.,
matrices. As expected, the average processing time per nodes decreases as number of
nodes increases. Also, as expected, as the matrix size increases the average processing
time per node in each system also increases.



In the
4
-
node

system,
the average processing time increases
linearly as the matrix
size
increases. On the other hand, in the 8
-
node and 16
-
node systems, the increases of the
average processing time per node are not linear as matrix size

increase
. In the 8
-
node
system, when
the
matrix size is 800x800 the average processing time climbs fro
m
34.23
to 98.80
. In the case of the 16
-
node system, when
the
matrix is
1000x1000
the average
processing time per node

climbs from 24.59 to 94.48.
This jump in
the average
processing time
s

is due to the increase
d

delay from the time the program starts running
until the first slave
starts multiplying its portion of the matrix

as
illustrated in
Figure
4
. In
the 8
-
node and 16
-
nod
e systems,
the delay to start the

first slave

node

jumps when the
matrix size is 800x800 and 1000x1000, respectively
.

One can conclude
th
at th
e
communication delay time increases
at a higher rate
as
the
matrix size
and number of
processing nodes
increase.


17



Figure
3

Processing Time per n
ode





18


Figure
4

Time before the start of 1st slave








19

Figure
5

shows the
average
of the total bytes
communicated
per node
;

as expected
the

larger the matrix sizes
are
the

larger the
number of bytes transmitted. This increase is
proportional to the number

of elements in each matrix. For example
, in the 16
-
node

system,
the number of
transmitted
bytes
for

500x500
matrix
is 2
,284,426

and
for

1000x1000

matrix

is
9
,
097
,
049
,
a

ratio
of

3.98
equal approximately to the number of elements in each matrix
.


Figure
5

Total
Bytes per node
20


20

Figure
6

shows the
average

number of packets per nodes
;

as expected

the

larger the
matrix sizes
the

larger the
number of packets incurred during the program execution
.
I
n
general

as matrix size increases there are more packets per node when there are fewer
nodes
.

Each node must receive a bigger section of matrix A when there are fewer nodes.


Figure
7

through
Figure
9

illustrate the processing time, number of bytes communicated,
number of packets of the 8
-
node and 16
-
node systems as
compared with those of the 4
-
node system
.

While the ratios of the number of bytes communicated and number of
packets between 8 vs. 4 and 16 vs. 4 remain the same as matrix size increases, the 16
-
node system has the least processing time per node. However,
the ratios of the 8 vs. 4 and
16 vs. 4 processing time per node decrease as matrix size became larger.


21



Figure
6

Number of Packets per node




22



Figure
7

Processing Time Ratio




23


Figure
8

Bytes Ratio



24

.

Figure
9

Number of Packets Ratio

25


25

Chapter
4

CONCLUSION


This project
simulates
a message passing multiprocessor system using

Simics
.
Using an

MPI matrix multiplication program
,

t
he processing time and network traffic information
w
ere

collected to evaluate the performance in three separate
d

system
s: 4
-
node
, 8
-
node

and 16
-
node
. Several iterations of Simics simulations were performed to study the
performance by varying the matrix size.

The results indicate that as the matrix size gets
larger and there are more processing nodes, there is a rapid increase in the processing
time per node. However, the average processing time per node is less when ther
e are
more nodes.


This project serves as the base
research

for future projects
. F
urther studies
may
include
performance analysis of a different problem. Other studies may
include

the simulation of
alternative

interconnection network
s

in Simics.

For exampl
e, this ca
n be done
with

multiple Ethernet connections per n
ode to implement a H
ypercube interconnection
network.





26


26

APPENDIX

A
.
Simics Script to Run
an
16
-
node

MPI Network Simulation

if not defined create_network {$create_network = "yes"}

if

not defined disk_image {$disk_image="
tango
-
openmpi.craff"}

load
-
module std
-
components

load
-
module eth
-
links


$host_name = "master"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:31"

$ip_address = "10.10.0.13"

$hos
t_name = "slave1"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:32"

$ip_address = "10.10.0.14"

$host_name = "slave2"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:33"

$ip_address

= "10.10.0.15"

$host_name = "slave3"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:34"

$ip_address = "10.10.0.16"

$host_name = "slave4"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:
10:35"

$ip_address = "10.10.0.17"

$host_name = "slave5"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:36"

$ip_address = "10.10.0.18"

$host_name = "slave6"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_add
ress = "10:10:10:10:10:37"

$ip_address = "10.10.0.19"

$host_name = "slave7"

run
-
command
-
file "%script%/tango
-
common.simics"

27


27

$mac_address = "10:10:10:10:10:38"

$ip_address = "10.10.0.20"

$host_name = "slave8"

run
-
command
-
file "%script%/tango
-
commo
n.simics"


$mac_address = "10:10:10:10:10:39"

$ip_address = "10.10.0.21"

$host_name = "slave9"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:40"

$ip_address = "10.10.0.22"

$host_name = "slave10"

run
-
command
-
file

"%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:41"

$ip_address = "10.10.0.23"

$host_name = "slave11"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:42"

$ip_address = "10.10.0.24"

$host_name = "sla
ve12"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:43"

$ip_address = "10.10.0.25"

$host_name = "slave13"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:44"

$ip_address = "10.10.0.26
"

$host_name = "slave14"

run
-
command
-
file "%script%/tango
-
common.simics"


$mac_address = "10:10:10:10:10:45"

$ip_address = "10.10.0.27"

$host_name = "slave15"

run
-
command
-
file "%script%/tango
-
common.simics"


set
-
memory
-
limit 980



28


28

APPENDIX
B
.
MPI
Program for Matrix Multiplication

The Matrix Multiplication MPI program was found on

the Internet in the following
w
ebsite URL:
http://www.daniweb.com/software
-
development/c/code/334470/matrix
-
multiplication
-
using
-
mpi
-
parallel
-
programming
-
approach
.
A

request to
Viraj Brian
Wijesuriya
, the author of the code, was submitted
asking authorization

to use his code in
this study. Below are the screenshots of the email requesting and authorizing permission
to use the Matrix Multiplication MPI program.


Email
Sent to
Request Permission to Use Matrix Multiplication Program using MPI.



Email Received
from
Viraj Brian Wijesuriya granting authorization to use his program.



29


29

A Simics MAGIC(n) function has been added to the Matrix Multiplication Program to
insert
a breakpoint to invoke a callback function to collect
simulation data. MAGIC (1)
and MAGIC(2) are executed by the master node to dump start and end processing time
and to Start and Stop network traffic capture. MAGIC(3) and MAGIC(4) are executed by
each sl
aves to dump start and end processing time.



/******************************************
**
***************************

* Matrix Multiplication Program using MPI.

* Viraj Brian Wijesuriya
-

University of Colombo School of Computing, Sri Lanka.

* Works with any type of two matrixes [A], [B] which could be multiplied to produce

* a matrix [c].

* Master process initializes the multiplication operands, distributes the

multiplication

*
operation to worker processes and reduces the worker results to

construct the final

*
output.

***********************************************************************/

#include<stdio.h>

#include<mpi.h>

#include <magic
-
instruction.h>



//part of Simics SW

#define NUM_ROWS_A 12




//rows of input [A]

#define NUM_COLUMN
S_A 12



//columns of input [A]

#define NUM_ROWS_B 12




//rows of input [B]

#define NUM_COLUMNS_B 12



//columns of input [B]

#define MASTER_TO_SLAVE_TAG 1

//tag for messages sent from master to slaves

#define SLAVE_TO_MASTER_TAG 4

//tag for messages

sent from slaves to master


void makeAB();




//makes the [A] and [B] matrixes

void printArray();




//print the content of output matrix [C];


int rank;





//process rank

int size;





//number of processes

int i, j, k;





//helper variables

doubl
e mat_a[NUM_ROWS_A][NUM_COLUMNS_A];

//declare input [A]

double mat_b[NUM_ROWS_B][NUM_COLUMNS_B];

//declare input [B]

double mat_result[NUM_ROWS_A][NUM_COLUMNS_B];//declare output [C]

double start_time;




//hold start time

double end_time;




// hold e
nd time

int low_bound;

//low bound of the number of rows of [A] allocated to a slave

int upper_bound;

//upper bound of the number of rows of [A] allocated to a slave

int portion;


//portion of the number of rows of [A] allocated to a slave


MPI_Status status;




// store status of a
n

MPI_Recv

MPI_Request request;



//capture request of a
n

MPI_Isend


int main(int argc, char *argv[])

30


30

{


MPI_Init(&argc, &argv);





//initialize MPI operations


MPI_Comm_rank(MPI_COMM_WORLD, &rank);

//get th
e rank


MPI_Comm_size(MPI_COMM_WORLD, &size);

//get number of processes

/* master initializes work*/

if (rank == 0)

{


MAGIC (1);


makeAB();


start_time = MPI_Wtime();


for (i = 1; i < size; i++) {



//for each slave other than the master


portion = (NUM_ROWS_A / (size
-

1));

// calculate portion without master


low_bound = (i
-

1) * portion;


if (((i + 1) == size) && ((NUM_ROWS_A % (size
-

1)) != 0))

{



//if rows of [A] cannot be equally divided among sl
aves


upper_bound = NUM_ROWS_A;


//last slave gets all the remaining rows


} else {

//rows of [A] are equally divisable among slaves


upper_bound = low_bound + portion;



}

//send the low bound first
without blocking, to the intended slave

MPI_Isend(&low_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG,
M
PI_COMM_WORLD, &request);

//next send the upper bound without blocking, to the intended slave

MPI_Isend(&upper_bound, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG + 1,

MPI_COMM_WORLD, &request);

//finally send the allocated row portion of [A] without blocking, to the intended slave

MPI_Isend(&mat_a[low_bound][0], (upper_bound
-

low_bound) *
NUM_COLUMNS_A, MPI_DOUBLE, i, MASTER_TO_SLAVE_TAG + 2,
MPI_COMM_WORLD, &request)
;


}

}


//broadcast [B] to all the slaves

MPI_Bcast(&mat_b, NUM_ROWS_B*NUM_COLUMNS_B, MPI_DOUBLE, 0,
MPI_COMM_WORLD);



/* work done by slaves*/


if (rank > 0)

{



MAGIC(
3
)
;



//receive low bound from the master


MPI_Recv(&low_bound, 1, MPI_INT, 0
, MASTER_TO_SLAVE_TAG,


31


31


MPI_COMM_WORLD, &status);

//next receive upper bound from the master



MPI_Recv(&upper_bound, 1, MPI_INT, 0, MASTER_TO_SLAVE_TAG + 1,



MPI_COMM_WORLD, &status);

//finally receive row portion of [A] to be processed from the mas
ter



MPI_Recv(&mat_a[low_bound][0], (upper_bound
-

low_bound) *


NUM_COLUMNS_A, MPI_DOUBLE, 0, MASTER_TO_SLAVE_TAG + 2,


MPI_COMM_WORLD, &status);



for (i = low_bound; i < upper_bound; i++)


{





//iterate through a given set of rows of [A]


for

(j = 0; j < NUM_COLUMNS_B; j++)


{





//iterate through columns of [B]




for (k = 0; k < NUM_ROWS_B; k++)



{





//iterate through rows of [B]




mat_result[i][j] += (mat_a[i][k] * mat_b[k][j]);




}



}


}



//send b
ack the low bound first without blocking, to the master



MPI_Isend(&low_bound, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG,



MPI_COMM_WORLD, &request);

//send the upper bound next without blocking, to the master



MPI_Isend(&upper_bound, 1, MPI_INT, 0, SL
AVE_TO_MASTER_TAG + 1,



MPI_COMM_WORLD, &request);

//finally send the processed portion of data without blocking, to the master



MPI_Isend(&mat_result[low_bound][0], (upper_bound
-

low_bound) *


NUM_COLUMNS_B, MPI_DOUBLE, 0, SLAVE_TO_MASTER_TAG + 2,



MPI_COMM_WORLD, &request);



MAGIC(
4
)
;

}


/* master gathers processed work*/


if (rank == 0)

{


for (i = 1; i < size; i++)


{





// untill all slaves have handed back the processed data

//receive low bound from a slave



MPI_Recv(&low_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG,


MPI_COMM_WORLD, &status);


//receive upper bound from a slave

32


32


MPI_Recv(&upper_bound, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG + 1,



MPI_COMM_WORLD, &status);


/
/receive processed data from a slave


MPI_Recv(&mat_result[low_bound][0], (upper_bound
-

low_bound) *


NUM_COLUMNS_B, MPI_DOUBLE, i, SLAVE_TO_MASTER_TAG + 2,


MPI_COMM_WORLD, &status);

}


printArray();


end_time = MPI_Wtime();


printf("
\
nRunning Time
= %f
\
n
\
n", end_time
-

start_time);

}



MPI_Finalize();




//finalize MPI operations


MAGIC(
2
);


return 0; }


void makeAB()

{


for (i = 0; i < NUM_ROWS_A; i++) {


for (j = 0; j < NUM_COLUMNS_A; j++) {


mat_a[i][j] = i + j; }


} for (i = 0; i
< NUM_ROWS_B; i++) {


for (j = 0; j < NUM_COLUMNS_B; j++) {


mat_b[i][j] = i*j;


}


}


}


void printArray()

{


for (i = 0; i < NUM_ROWS_A; i++)



{


printf("
\
n");


for (j = 0; j < NUM_COLUMNS_B; j++)


printf("%8.2f ", mat_result[i][j]);



}


printf ("Done.
\
n");


end_time = MPI_Wtime();


printf("
\
nRunning Time = %f
\
n
\
n", end_time
-

start_time);

}





33


33

APPENDIX
C
.
Simics Script to Add L1 and L2 Cache Memories


This script add
s

L1 and

L2 cache memory
to each simulated machine in a 4
-
node
network simulation.
Each processor has a 32KB write
-
through L1 data cache, a 32KB L1
instruction cache and a 256KB L2 cache with write
-
back policy. Instruction and data
accesses are separated out

by id
-
splitters and are sent to the respective caches. The
splitter allows the correctly aligned accesses to go through and splits the incorrectly
aligned ones into two accesses. The transaction staller (trans
-
staller) simulates main
memory latency

[11].


##Add L1 and L2 caches to
M
aster Node

##
T
ransaction staller to represent memory latency. Stall instructions 239 cycles to
simulate memory latency

@master_staller = pre_conf_object("master_staller", "trans
-
staller", stall_time = 239)
##Latency of (L2 +
RAM) in CPU cycles


##
M
aster core

@master_cpu0 = conf.master.motherboard.processor0.core[0][0]


## L2 cache(l2c0) for cpu0: 256KB with write
-
back

@master_l2c0 = pre_conf_object("master_l2c0", "g
-
cache")

@master_l2c0.cpus = master_cpu0

@master_l2c0.config_
line_number = 4096

@master_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines

@master_l2c0.config_assoc = 8

@master_l2c0.config_virtual_index = 0

@master_l2c0.config_virtual_tag = 0

@master_l2c0.config_write_back = 1

@master_l2c0.config_write_alloca
te = 1

@master_l2c0.config_replacement_policy = 'lru'

@master_l2c0.penalty_read =37 ##Stall penalty (in cycles) for any incoming read
transaction

@master_l2c0.penalty_write =37 ##Stall penalty (in cycles) for any incoming write
transaction

@master_l2c0.pen
alty_read_next =22
##
Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@master_l2c0.penalty_write_next =22 ##Stall penalty (in cycles) for a write

transactions
issued by the ca
che to the next level cache. Rounding error, value should be 7

@master_l2c0.timing_model = master_staller


##L1
-

Instruction Cache (ic0) : 32Kb

@master_ic0 = pre_conf_object("master_ic0", "g
-
cache")

@master_ic0.cpus = master_cpu0

34


34

@master_ic0.config_line_
number = 512

@master_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines

@master_ic0.config_assoc = 8

@master_ic0.config_virtual_index = 0

@master_ic0.config_virtual_tag = 0

@master_ic0.config_write_back = 0

@master_ic0.config_write_allocate = 0

@master_ic0.config_replacement_policy = 'lru'

@master_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction

@master_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction

@master_ic0.penalty_re
ad_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@master_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to

the next level cache. Rounding error, value should be 7

@master_ic0.timing_model = master_l2c0


# L1
-

Data Cache (dc0) : 32KB Write Through

@master_dc0 = pre_conf_object("master_dc0", "g
-
cache")

@master_dc0.cpus = master_cpu0

@master_dc0.config_line_nu
mber = 512

@master_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines

@master_dc0.config_assoc = 8

@master_dc0.config_virtual_index = 0

@master_dc0.config_virtual_tag = 0

@master_dc0.config_write_back = 0

@master_dc0.config_write_allocate = 0

@master
_dc0.config_replacement_policy = 'lru'

@master_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction

@master_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction

@master_dc0.penalty_read_next

= 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@master_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the ne
xt level cache. Rounding error, value should be 7

@master_dc0.timing_model = master_l2c0


# Transaction splitter for L1 instruction cache for master_cpu0

@master_ts_i0 = pre_conf_object("master_ts_i0", "trans
-
splitter")

@master_ts_i0.cache = master_ic0

@master_ts_i0.timing_model = master_ic0

@master_ts_i0.next_cache_line_size = 64

35


35


# transaction splitter for L1 data cache for master_cpu0

@master_ts_d0 = pre_conf_object("master_ts_d0", "trans
-
splitter")

@master_ts_d0.cache = master_dc0

@master_ts_d0.timi
ng_model = master_dc0

@master_ts_d0.next_cache_line_size = 64


# ID splitter for L1 cache for master_cpu0

@master_id0 = pre_conf_object("master_id0", "id
-
splitter")

@master_id0.ibranch = master_ts_i0

@master_id0.ibranch = master_ts_d0


#Add Configuration

@
SIM_add_configuration([master_staller, master_l2c0, master_ic0, master_dc0,
master_ts_i0, master_ts_d0, master_id0], None);

@master_cpu0.physical_memory.timing_model = conf.master_id0


#End of master


##Add L1 and L2 caches to slave1 Node

## transaction s
taller to represent memory latency. Stall instructions 239 cycles to
simulate memory latency

@slave1_staller = pre_conf_object("slave1_staller", "trans
-
staller", stall_time = 239)
##Latency of (L2 + RAM) in CPU cycles


##
S
lave1 core

@slave1_cpu0 = conf.s
lave1.motherboard.processor0.core[0][0]


## L2 cache(l2c0) for cpu0: 256KB with write
-
back

@slave1_l2c0 = pre_conf_object("slave1_l2c0", "g
-
cache")

@slave1_l2c0.cpus = slave1_cpu0

@slave1_l2c0.config_line_number = 4096

@slave1_l2c0.config_line_size = 64
##64 blocks. Implies 512 lines

@slave1_l2c0.config_assoc = 8

@slave1_l2c0.config_virtual_index = 0

@slave1_l2c0.config_virtual_tag = 0

@slave1_l2c0.config_write_back = 1

@slave1_l2c0.config_write_allocate = 1

@slave1_l2c0.config_replacement_policy = 'lru'

@slave1_l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read
transaction

@slave1_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write
transaction

36


36

@slave1_l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) f
or a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@slave1_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, val
ue should be 7

@slave1_l2c0.timing_model = slave1_staller


##L1
-

Instruction Cache (ic0) : 32Kb

@slave1_ic0 = pre_conf_object("slave1_ic0", "g
-
cache")

@slave1_ic0.cpus = slave1_cpu0

@slave1_ic0.config_line_number = 512

@slave1_ic0.config_line_size = 64 #
#64 blocks. Implies 512 lines

@slave1_ic0.config_assoc = 8

@slave1_ic0.config_virtual_index = 0

@slave1_ic0.config_virtual_tag = 0

@slave1_ic0.config_write_back = 0

@slave1_ic0.config_write_allocate = 0

@slave1_ic0.config_replacement_policy = 'lru'

@slave
1_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction

@slave1_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction

@slave1_ic0.penalty_read_next = 9

##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@slave1_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next le
vel cache. Rounding error, value should be 7

@slave1_ic0.timing_model = slave1_l2c0


# L1
-

Data Cache (dc0) : 32KB Write Through

@slave1_dc0 = pre_conf_object("slave1_dc0", "g
-
cache")

@slave1_dc0.cpus = slave1_cpu0

@slave1_dc0.config_line_number = 512

@
slave1_dc0.config_line_size = 64 ##64 blocks. Implies 512 lines

@slave1_dc0.config_assoc = 8

@slave1_dc0.config_virtual_index = 0

@slave1_dc0.config_virtual_tag = 0

@slave1_dc0.config_write_back = 0

@slave1_dc0.config_write_allocate = 0

@slave1_dc0.config_
replacement_policy = 'lru'

@slave1_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction

@slave1_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction

@slave1_dc0.penalty_read_next = 9 ##Sta
ll penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

37


37

@slave1_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cac
he. Rounding error, value should be 7

@slave1_dc0.timing_model = slave1_l2c0


# Transaction splitter for L1 instruction cache for slave1_cpu0

@slave1_ts_i0 = pre_conf_object("slave1_ts_i0", "trans
-
splitter")

@slave1_ts_i0.cache = slave1_ic0

@slave1_ts_i0
.timing_model = slave1_ic0

@slave1_ts_i0.next_cache_line_size = 64


# transaction splitter for L1 data cache for slave1_cpu0

@slave1_ts_d0 = pre_conf_object("slave1_ts_d0", "trans
-
splitter")

@slave1_ts_d0.cache = slave1_dc0

@slave1_ts_d0.timing_model =
slave1_dc0

@slave1_ts_d0.next_cache_line_size = 64


# ID splitter for L1 cache for slave1_cpu0

@slave1_id0 = pre_conf_object("slave1_id0", "id
-
splitter")

@slave1_id0.ibranch = slave1_ts_i0

@slave1_id0.ibranch = slave1_ts_d0


#Add Configuration

@SIM_add_con
figuration([slave1_staller, slave1_l2c0, slave1_ic0, slave1_dc0,
slave1_ts_i0, slave1_ts_d0, slave1_id0], None);

@slave1_cpu0.physical_memory.timing_model = conf.slave1_id0


#End of slave1


##Add L1 and L2 caches to slave2 Node

##
T
ransaction staller to r
epresent memory latency. Stall instructions 239 cycles to
simulate memory latency

@slave2_staller = pre_conf_object("slave2_staller", "trans
-
staller", stall_time = 239)
##Latency of (L2 + RAM) in CPU cycles


##
S
lave2 core

@slave2_cpu0 = conf.slave2.mothe
rboard.processor0.core[0][0]


## L2 cache(l2c0) for cpu0: 256KB with write
-
back

@slave2_l2c0 = pre_conf_object("slave2_l2c0", "g
-
cache")

@slave2_l2c0.cpus = slave2_cpu0

@slave2_l2c0.config_line_number = 4096

@slave2_l2c0.config_line_size = 64 ##64 blocks.
Implies 512 lines

@slave2_l2c0.config_assoc = 8

@slave2_l2c0.config_virtual_index = 0

38


38

@slave2_l2c0.config_virtual_tag = 0

@slave2_l2c0.config_write_back = 1

@slave2_l2c0.config_write_allocate = 1

@slave2_l2c0.config_replacement_policy = 'lru'

@slave2_l2c0.penalty_read = 37 ##Stall penalty (in cycles) for any incoming read
transaction

@slave2_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write
transaction

@slave2_l2c0.penalty_read_next = 22 ##Stall penalty (in cycles) for

a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@slave2_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value

should be 7

@slave2_l2c0.timing_model = slave2_staller


##L1
-

Instruction Cache (ic0) : 32Kb

@slave2_ic0 = pre_conf_object("slave2_ic0", "g
-
cache")

@slave2_ic0.cpus = slave2_cpu0

@slave2_ic0.config_line_number = 512

@slave2_ic0.config_line_size = 64 ##6
4 blocks. Implies 512 lines

@slave2_ic0.config_assoc = 8

@slave2_ic0.config_virtual_index = 0

@slave2_ic0.config_virtual_tag = 0

@slave2_ic0.config_write_back = 0

@slave2_ic0.config_write_allocate = 0

@slave2_ic0.config_replacement_policy = 'lru'

@slave2_
ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction

@slave2_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction

@slave2_ic0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read t
ransaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@slave2_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be

7

@slave2_ic0.timing_model = slave2_l2c0


# L1
-

Data Cache (dc0) : 32KB Write Through

@slave2_dc0 = pre_conf_object("slave2_dc0", "g
-
cache")

@slave2_dc0.cpus = slave2_cpu0

@slave2_dc0.config_line_number = 512

@slave2_dc0.config_line_size = 64 ##64
blocks. Implies 512 lines

@slave2_dc0.config_assoc = 8

@slave2_dc0.config_virtual_index = 0

@slave2_dc0.config_virtual_tag = 0

@slave2_dc0.config_write_back = 0

39


39

@slave2_dc0.config_write_allocate = 0

@slave2_dc0.config_replacement_policy = 'lru'

@slave2_dc0
.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction

@slave2_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction

@slave2_dc0.penalty_read_next = 9 ##Stall penalty (in cycles) for a read tran
saction
issued by the cache to the next level cache. Rounding error, value should be 7.

@slave2_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache. Rounding error, value should be 7

@slave2_dc0.timing_model = slave2_l2c0


# Transaction splitter for L1 instruction cache for slave2_cpu0

@slave2_ts_i0 = pre_conf_object("slave2_ts_i0", "trans
-
splitter")

@slave2_ts_i0.cache = slave2_ic0

@slave2_ts_i0.timing_model = slave2_ic0

@slave2_ts_
i0.next_cache_line_size = 64


# transaction splitter for L1 data cache for slave2_cpu0

@slave2_ts_d0 = pre_conf_object("slave2_ts_d0", "trans
-
splitter")

@slave2_ts_d0.cache = slave2_dc0

@slave2_ts_d0.timing_model = slave2_dc0

@slave2_ts_d0.next_cache_line
_size = 64


# ID splitter for L1 cache for slave2_cpu0

@slave2_id0 = pre_conf_object("slave2_id0", "id
-
splitter")

@slave2_id0.ibranch = slave2_ts_i0

@slave2_id0.ibranch = slave2_ts_d0


#Add Configuration

@SIM_add_configuration([slave2_staller, slave2_l2c0,

slave2_ic0, slave2_dc0,
slave2_ts_i0, slave2_ts_d0, slave2_id0], None);

@slave2_cpu0.physical_memory.timing_model = conf.slave2_id0


#End of slave2


##Add L1 and L2 caches to slave3 Node

## T
ransaction staller to represent memory latency. Stall instructi
ons 239 cycles to
simulate memory latency

@slave3_staller = pre_conf_object("slave3_staller", "trans
-
staller", stall_time = 239)
##Latency of (L2 + RAM) in CPU cycles


##
S
lave3 core

@slave3_cpu0 = conf.slave3.motherboard.processor0.core[0][0]

40


40


## L2 cach
e(l2c0) for cpu0: 256KB with write
-
back

@slave3_l2c0 = pre_conf_object("slave3_l2c0", "g
-
cache")

@slave3_l2c0.cpus = slave3_cpu0

@slave3_l2c0.config_line_number = 4096

@slave3_l2c0.config_line_size = 64 ##64 blocks. Implies 512 lines

@slave3_l2c0.config_as
soc = 8

@slave3_l2c0.config_virtual_index = 0

@slave3_l2c0.config_virtual_tag = 0

@slave3_l2c0.config_write_back = 1

@slave3_l2c0.config_write_allocate = 1

@slave3_l2c0.config_replacement_policy = 'lru'

@slave3_l2c0.penalty_read = 37 ##Stall penalty (in c
ycles) for any incoming read
transaction

@slave3_l2c0.penalty_write = 37 ##Stall penalty (in cycles) for any incoming write
transaction

@slave3_l2c0.penalty_read_next = 22

##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@slave3_l2c0.penalty_write_next = 22 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next le
vel cache. Rounding error, value should be 7

@slave3_l2c0.timing_model = slave3_staller


##L1
-

Instruction Cache (ic0) : 32Kb

@slave3_ic0 = pre_conf_object("slave3_ic0", "g
-
cache")

@slave3_ic0.cpus = slave3_cpu0

@slave3_ic0.config_line_number = 512

@slav
e3_ic0.config_line_size = 64 ##64 blocks. Implies 512 lines

@slave3_ic0.config_assoc = 8

@slave3_ic0.config_virtual_index = 0

@slave3_ic0.config_virtual_tag = 0

@slave3_ic0.config_write_back = 0

@slave3_ic0.config_write_allocate = 0

@slave3_ic0.config_rep
lacement_policy = 'lru'

@slave3_ic0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction

@slave3_ic0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction

@slave3_ic0.penalty_read_next = 9 ##Stall
penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@slave3_ic0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to the next level cache.

Rounding error, value should be 7

@slave3_ic0.timing_model = slave3_l2c0




41


41

# L1
-

Data Cache (dc0) : 32KB Write Through

@slave3_dc0 = pre_conf_object("slave3_dc0", "g
-
cache")

@slave3_dc0.cpus = slave3_cpu0

@slave3_dc0.config_line_number = 512

@slave3_d
c0.config_line_size = 64 ##64 blocks. Implies 512 lines

@slave3_dc0.config_assoc = 8

@slave3_dc0.config_virtual_index = 0

@slave3_dc0.config_virtual_tag = 0

@slave3_dc0.config_write_back = 0

@slave3_dc0.config_write_allocate = 0

@slave3_dc0.config_replacement_policy = 'lru'

@slave3_dc0.penalty_read = 12 ##Stall penalty (in cycles) for any incoming read
transaction

@slave3_dc0.penalty_write = 12 ##Stall penalty (in cycles) for any incoming write
transaction

@slave3_dc0.penalty_re
ad_next = 9 ##Stall penalty (in cycles) for a read transaction
issued by the cache to the next level cache. Rounding error, value should be 7.

@slave3_dc0.penalty_write_next = 9 ##Stall penalty (in cycles) for a write transactions
issued by the cache to

the next level cache. Rounding error, value should be 7

@slave3_dc0.timing_model = slave3_l2c0


# Transaction splitter for L1 instruction cache for slave3_cpu0

@slave3_ts_i0 = pre_conf_object("slave3_ts_i0", "trans
-
splitter")

@slave3_ts_i0.cache = slave
3_ic0

@slave3_ts_i0.timing_model = slave3_ic0

@slave3_ts_i0.next_cache_line_size = 64


# transaction splitter for L1 data cache for slave3_cpu0

@slave3_ts_d0 = pre_conf_object("slave3_ts_d0", "trans
-
splitter")

@slave3_ts_d0.cache = slave3_dc0

@slave3_ts_d
0.timing_model = slave3_dc0

@slave3_ts_d0.next_cache_line_size = 64


# ID splitter for L1 cache for slave3_cpu0

@slave3_id0 = pre_conf_object("slave3_id0", "id
-
splitter")

@slave3_id0.ibranch = slave3_ts_i0

@slave3_id0.ibranch = slave3_ts_d0


#Add Configura
tion

@SIM_add_configuration([slave3_staller, slave3_l2c0, slave3_ic0, slave3_dc0,
slave3_ts_i0, slave3_ts_d0, slave3_id0], None);

@slave3_cpu0.physical_memory.timing_model = conf.slave3_id0


#End of slave3


42


42

APPENDIX
D
.
Python Script to Collect Simulation
Data

This script

define
s

a hap function, which is
call
ed by the magic instruction

included in the
matrix multiplication program. This script uses Simics API to get the CPU time and run
the command to start and stop capturing the network traffic.



Python s
cript to collect processors and network traffic statistics (matrix_
100
.py)


from cli import *

from simics import *


def hap_callback(user_arg, cpu, arg):


if arg == 1:



print "cpu name: ", cpu.name



print "Start at= ", SIM_time(cpu)



SIM_run_alone(
run_command, "ethernet_switch0.pcap
-
dump
matrix_1
00
.txt")


if arg == 2:



print "cpu name: ", cpu.name



print "Start at= ", SIM_time(cpu)


if arg == 3:



print "cpu name: ", cpu.name



print "End at= ", SIM_time(cpu)


if arg == 4:



print "cpu name:
", cpu.name



print "End at= ", SIM_time(cpu)


SIM_run_alone(run_command, "ethernet_switch0.pcap
-
dump
-
stop")

SIM_hap_add_callback("Core_Magic_Instruction", hap_callback, None)




43


43

APPENDIX
E
.
SSH
-
Agent script

The
ssh
-
agent script
was found on

the Internet in the following w
ebsite URL:
http://www.cygwin.com/ml/cygwin/2001
-
06/msg00537.html
.

This script will be added to
the Linux shell startup file of the MPI user.


A request t
o Joseph Reagle, the author of the code, was submitted
asking authorization

to
use his script to automate the ssh
-
agent at the logging time. Below are the screenshots of
the email requesting and authorizing permission to use the
ssh
-
agent script.


Email
Se
nt to
Request Permission to

ssh
-
agent script
.




Email Received
from
Joseph Reagle

granting authorization to use his
script
.





44


44

The .bash_profile file contains the ssh
-
agent script, which is executed at login time.

In addition
, t
he .bash_profile

and .bash_profile include the

lines to add the
Open MPI
libraries and executables to the user’s
path
.



.bash_profile

File

# .bash_profile

# Get the aliases and functions


if [
-
f ~/.bashrc ]; then


. ~/.bashrc

fi

# User specific environment and startup programs

SSH_ENV="$HOME/.ssh/environment"

function start_agent {


echo "Initialising new SSH agent..."


/usr/bin/ssh
-
agent | sed 's/^echo/#echo/' > "${SSH_ENV}"


echo succeeded


chmod 600 "${SSH_ENV}"


. "${SSH_ENV}" > /dev/null


/usr/bin/ss
h
-
add;

}

# Source SSH settings, if applicable

if [
-
f "${SSH_ENV}" ]; then


. "${SSH_ENV}" > /dev/null


#ps ${SSH_AGENT_PID} doesn't work under cywgin


ps
-
ef | grep ${SSH_AGENT_PID} | grep ssh
-
agent$ > /dev/null || {


start_agent;



}

else


start_agent;

fi

PATH=$PATH:$HOME/bin

export PATH=/home/mpiu/openmpi/bin:$PATH

export LD_LIBRARY_PATH=/home/mpiu/openmpi/lib:$LD_LIBRARY_PATH


export PATH


.bashrc

File

# .bashrc

# Source global definitions

if

[
-
f /etc/bashrc ]; then


. /etc/bashrc

fi

# User specific

aliases and functions

export PATH=/home/mpiu/openmpi/bin:$PATH

export LD_LIBRARY_PATH=/home/mpiu/openmpi/lib:$LD_LIBRARY_PATH


45


45

APPENDIX
F
. User Guide


MP simulation system

using Simics User’s Guide



This user guide describes
the steps to
install Open MPI in a Simics target
simulated machine
. This installation is used as a craff file to open several
simulated machine connected through the network inside one Simics
session.

This user guide has been carefully prepared to avoid extra steps
by automatizing configuration and settings that could be used repeatedly.




Table of Contents

INSTALL SIMICS

II.

SIMICS SUPPORT

III.

NETWORK SIMULATION IN
SIMICS

IV.

REQUIRED COMPONENTS AND PREREQUISITES

V.

OPEN MPI

INSTALLATION AND CONFIGURATION

VI.
CREATING A NEW CRAFF FILE

VII.
STARTING MPI NETWORK SIMULATION WITH SIMICS SCRIPTS

VIII.
RUN
NING MPI PROGRAMS




I.

INSTALL SIMICS

1.

Download Simics files

Go to
https://www.simics.net/pub/simics/4.6_wzl263/

You can go back to this link after your first installation and check for new versions
and repeat the steps below.
A newer version of Simics will be installed inside the
Simics directory in a new separated directory
.
You will need to update t
he
Simics
icon
to access
the
newer version.


Download the following packages based on your operating system



Simics Base: simics
-
pkg
-
1000

This is the base product package that contains Simics Hindsight, Simics
Accelerator, Simics Ethernet Networking, Simics Analyzer, and
others
functionalities.

Other packages are optional add
-
on products



46


46



x86
-
440BX Target: simics
-
pkg
-
2018

This package will allow users to model various PC systems based on