1. Describe the theoretical problem setting, the algorithm, the performance figures and expectations.

unevenoliveΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

65 εμφανίσεις


1

Parallel and Distributed Computing (ICTS 6301)

Programming Assignment #1

Strassen Matrix Multiplication

Report

Lamiya El_Saedi 220093158


1. Describe the theoretical problem setting, the algorithm, the performance
figures and expectations.

Algorithm:

We wa
nt to calculate the matrix product
C


the matrices
A
,
B

are of type 2
n

x 2
n

s.

We partition
A
,
B

and
C

into equally sized
block matrices


With this construction we have not reduced the number of multiplications. We still
need 8 multiplications to calcula
te the
C
i,j

matrices, the same number of
multiplications we need when using standard matrix multiplication.








2

Now comes the important part. We define new matrices


which are then used to express the
C
i,j

in terms of
M
k
. Because of our definition of th
e
M
k

we can eliminate one matrix multiplication and reduce the number of
multiplications to 7 (one multiplication for each
M
k
) and express the
C
i,j

as


We iterate this division process
n

times until the
submatrices

degenerate into numbers
(elements of the

ring
R
).

Practical implementations of Strassen's algorithm switch to standard methods of
matrix multiplication for small enough submatrices, for which they are more efficient.
The particular crossover point for which Strassen's algorithm is more efficient

depends on the specific implementation and hardware. It has been estimated that
Strassen's algorithm is faster for matrices with widths from 32 to 128 for optimized
implementations.
[1]

2. Describe the parallel implementation
-

the logical structure of the MPI machine
you have in mind, data structures and other related issues.

Logical

serial
structure:

In the beginning I

was

tried to implement the above algorithm in

serial way

and I
use dynamic algorithm to apply the structure
, the structure I
'll
use it like this:


Step 1

:

Store the results of each addition operation in the M's operation
s

in
another matrix

TA
with size nxn
in a certain steps:

I have this matrix (or
iginal matrix):

A11

A12

A21

A22


3


TA matrix:

A11+A22

A11+A12

A21+A22


And store the subtraction in other

SA

matrix nxn in a certain steps also:

SA matrix
:

A1
2
-
A22


A21
-
A
11


(Note: make the same matrex for B, then I have TB and SB)

Step 2: Then I dea
l
with the two new matrix TA ,
TB
, SA and SB to achieve the M
equations

(note: I divided the original matrix to four blocks by using the above
technique on
each level in the matrix and apply strassen's equations in each level until I have
2x2 matrix then I ca
n apply the strassen's equations to multiply one element to
one element
)

Step 3:
To perform the result matrix C I'll do the following:


For example: if I have matrix 4x4
, I divide the matrix to 4 blocks each block is
matrix 2x2 . I apply the above s
tructure to have matrices TA,TB, SA and SB.
Now I need to multiply (A11+A22)*(B11+B22) but the two sides of the equation
is a matrix of

2x2. So, again I multiply the elements by applying the strassen's
equations to give matrix 2x2 as a result and store the

result in another matrix.
(note: I apply this way for all parts)

Finally,
I have anther 7 of 2x2 matrices, I collect these matrices by function
finalc(); to apply this formula



Therefore, I have a C result matrix of 4x4.



4

I apply the idea in the exam
ple

on matrix of 1024
x1024

Parallel structure idea:

As a result there is no dependency in my work and I can apply the same idea in
parallel. But I insert some modification to
broadcast the matrix result, so I write a
function to convert from two dim
ensiona
l array to one dimensional array and
another function to convert from one dimensional

array to two dimensional array
.
These two function I written to convert matrix of
64
x
64
.


In a simple word I distribute the processor in the last level of M's computa
tions.
Therefore, if I have one process I'll make it to do the seven equations without
need to broadcast the result matrices. If I have two processor , one of them take
three equations and the other process take four equations. Each process is the
owner co
mputes rule for all computation in the equation
.

If I have four processes, three of them take two equations and the last one take
one equation. If I have six processes five of them take one equation and the last
take two equations.

Note: In the tasks from
2 to 6 processes, process zero collect the results and
compute final C matrix.


If I have eight processes , each seven processes take one equation and the last
collect the results and compute the final C result.

3.
T
able(s) of results for serial program,
then varying number of processors
p

(
p
=
1, 2, 4, 6, and 8) corresponding speedup and efficiency figures.

This result was taken by
running on cluster.


128x128 time in Per
second

Time
serial of

Time
parallel
1process

Time
parallel
2processes

Time
parallel
4processes

Time
parallel
6processes

Time
parallel
8processes

0.
109

0.
085248

0.048349

0.013826

0.0136

0.001687

Speedup

(Ts/Tp)


1.278624

2.254453

7.883735

8.014588

64.59629

Efficiency
(S/P)


1.278624

1.127226

1.970934

1.335765

8.074537

Total parallel t
ime

(PTp )(cost)


0.085137

0.096558

0.057201

0.081622

0.013405

Overhead

(PTp
-
Ts)


-
0.02386

-
0.01244

-
0.0518

-
0.02738

-
0.0956

Table(1): illustrate parallel run time and serial run time and other computation of matrix
128x128



5

4. Present plots of the speed
up (show also the ideal speedup on the plot).


Graph

(1
):

illustrate the relation between
Tp time parallel

and
P
number of processes



Graph

(2
):

illustrate the relation between
S
speedup and

P

number of processes



6


Graph

(3
):

illustrate the relation b
etween
P

number of processes

and To time overhead



Graph

(4
):

illustrate the relation between
E
Efficiency

and
P
number of processes



7


Graph

(5):

illustrate the relation between
PTp coast optimal, Ts time serial

and
P
number of
processes



5. Analyze th
e speedup and efficiency results in comparison with the theoretical
expectations.

1) Speedup increase when number of processes increase
.

2) When processes number are increase the efficiency

also in
crease.

3)
The highest speedup at P = 8, and the lowest at
P = 1
.

4)
The maximum fin grain is 8

processes because after this number efficiency
decrease
.

5)
L
ess

amount of overhead
at P=8 and the greatest amount of overhead at p=2

because there is a dependency and no interaction between any process and no
communica
tion between each other. But at p=2 is the greatest overhead may be
because each process come to do a large number of computation
.

6)
E
ach of processes are coast optimal

because it is less than Ts.



8

6. Add observations, comments how the particular computer

architecture
influences the parallel performance. Pay attention how to balance properly the
workload per workstation and the number of processors used.


1)

Observation:

I think that

the best way

to solve any serial problem in a
parallel problem, you shou
ld if you can
to
distinct the computation in a way when
ever there is no dependency

and interaction
between any other computation.
So,
you can make your parallel problem in a safe way by give every processes one or
more computation (or group the computati
on that related to each other).
Therefore, the process responsible for every inside computation.
Finally,
gathering the results in one process to give a final result.



2)

you don't need a cluster to test your work
because you can imagine the parallel
situ
ation in your PC as a native way to thinking in parallel
.


3)
when you try to run a parallel program that use more than one process , you
should to try to stop each process at the same time, not to reside some process
still work.

You can do this by give p
rocess zero the large number of computation or
give it the last computation and to collect the results give it to last process.




7. Conclusions, ideas for possible optimizations


1)

If you need a good run time parallel program you must stop the screen
serve
r, and don’t do any job at this time in your computer because the
execution be stop before finishing the complete execution.

2)

Don't print any computation results on the output screen because it takes
a long time in execution and effect on the time of execut
ion.

3)

The best platform to run parallel programming you need a Multicore on
your computer or run it on a cluster of computer.

4)

The safe way to execute large number of matrix using pointer not static to
a void stack overflow problem.

5)

From graph(1),(2),(3)
,(4)
,(5)

and table (1) the best situation to solve the
serial strassen's algorithm in parallel is mapping to
8
processes. Because
it has minimum interaction , idle and communication (overhead).

6)

I think the suitable way to convert serial algorithm to parallel
without
overhead is give all the algorithm serial to one process in parallel and
change the size
problem
.

Or
fixed the efficiency and change the problem
size and number of processes.









9

Appendix



1
-

Main program in
8

processes:


int

main(
int

argc,
c
har

*argv[])

{

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

if

(numtasks == 8) {


double

start,end;

start=MPI_Wtime ();


for

(
int

i=0;i<size128;i++)


for
(
int

j=0;j<size128;j++)



{




A128[i
][j]=i+1;




B128[i][j]=i+1;




C128[i][j]=0;




TA128[i][j]=0;




TB128[i][j]=0;




SA128[i][j]=0;




SB128[i][j]=0;






}

source = 0;


sendcount =1;


recvcount =1;



strass128(A128,B128,0,0,size128,size128);



for
(
int

i=0;i<size64*size64;
i++)


{
//mpi_broadcast




if
(rank==1){


MPI_Bcast(&g1[i],sendcount,MPI_INT,1,MPI_COMM_WORLD);



}



if
(rank==2){


MPI_Bcast(&g2[i],sendcount,MPI_INT,2,MPI_COMM_WORLD);



}



if
(rank==3){


MPI_Bcast(&g3[i],sendcount,MPI_INT,3,MPI_COMM_WORLD);



}



if
(ra
nk==4){


MPI_Bcast(&g4[i],sendcount,MPI_INT,4,MPI_COMM_WORLD);



}



if
(rank==5){


MPI_Bcast(&g5[i],sendcount,MPI_INT,5,MPI_COMM_WORLD);



}



if
(rank==6){


MPI_Bcast(&g6[i],sendcount,MPI_INT,6,MPI_COMM_WORLD);



}



if
(rank==0){


MPI_Bcast(&g7[i],send
count,MPI_INT,0,MPI_COMM_WORLD);



}



}


if
(rank==7)

{
//summation


10



conv1281to2(gp18,g1,64);


conv1281to2(gp28,g2,64);


conv1281to2(gp38,g3,64);


conv1281to2(gp48,g4,64);


conv1281to2(gp58,g5,64);


conv1281to2(gp68,g6,64);


//conv1281to2(gp78,g7,64);




finalgc128(C128);




cout<<endl<<
"C=
\
n"
;


//wrc128(C128);

}



end=MPI_Wtime();



double

ftime;



ftime=(end
-
start);


cout<<endl<<
"parallel8 time = "
<<ftime<<endl;



}

else


printf(
"Must specify %d processors. Terminating.
\
n"
,SIZE);


MPI
_Finalize();



}


////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////


2
-

strass128
(

) in parallel:


void

strass128(
int

A[size128][size128],
int

B[size128][size128],
int

ai,
int

aj,
int

bi,
int

bj)

{



add128A11A22(A,TA128,0,0,size128/2,size128/2);


add128A11A12(A,TA128,0,size128/2);


add128A21A22(A,TA128,0,size128/2);


add128A11A22(B,TB128,0,0,size128/2,size128/2);


add128A11A12(B,TB128,0,size128/2);


add128A21A22(B,TB128,0,size12
8/2);


sub128A12A22(A,SA128,0,size128);


sub128A21A11(A,SA128,0,size128/2);


sub128A12A22(B,SB128,0,size128);


sub128A21A11(B,SB128,0,size128/2);




if
(rank==1){


strass64(TA128,TB128,0,0,63,63,0,0,63,63);


finalgc64(gp18);


conv1282to1(g1,gp18);



tim++;


cout<<
"
\
ntim= "
<<tim;


// cout<<endl<<"
\
ngp1=
\
n";


// wrc4(gp1);


}







11

if
(rank==2){



strass64(TA128,B128,64,0,127,63,0,0,63,63);


finalgc64(gp28);


conv1282to1(g2,gp28);



tim++;


cout<<
"
\
ntim= "
<<tim;


}


// cout<<endl<<"
\
ngp2=
\
n";


// wrc4(gp2);


if
(rank==3){


strass64(A128,SB128,0,0,63,63,0,0,63,63);


finalgc64(gp38);


conv1282to1(g3,gp38);



tim++;


cout<<
"
\
ntim= "
<<tim;


}


// cout<<endl<<"
\
ngp3=
\
n";


// wrc4(gp3);


if
(rank==4){


strass64(A128,SB128,64,64,127,1
27,64,0,127,63);


finalgc64(gp48);


conv1282to1(g4,gp48);



tim++;


cout<<
"
\
ntim= "
<<tim;


}


// cout<<endl<<"gp4=
\
n";


// wrc4(gp4);


if
(rank==5){


strass64(TA128,B128,0,64,63,127,64,64,127,127);


finalgc64(gp58);


conv1282to1(g5,gp58);



tim++;


cout<<
"
\
ntim= "
<<tim;


}


// cout<<endl<<"gp5=
\
n";


// wrc4(gp5);


if
(rank==6){


strass64(SA128,TB128,64,0,127,63,0,64,63,127);


finalgc64(gp68);


conv1282to1(g6,gp68);


tim++;


cout<<
"
\
ntim= "
<<tim;


}


if
(rank==0){


// cout<<endl
<<"gp6=
\
n";


// wrc4(gp6);


strass64(SA128,TB128,0,0,63,63,64,0,127,63);


finalgc64(gp78);


conv1282to1(g7,gp78);


tim++;


cout<<
"
\
ntim= "
<<tim;


}


// cout<<endl<<"gp7=
\
n";


//wrc4(gp7);



}

///////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////


12

3
-

convert from 2 dim to 1 dim:


void

conv
64
2to1(
int

send[size
64
*size
64
],
int

A[][size
64
])


{


int

v=0;


for

(
int

i=0;i<size
64
;i++)


for
(
int

j=0;j<size
64
;j++)



{


send[v]=
A[i][j];


// sendbuf2[v]=B[i][j];



v++;



}



}

/////////////////////////////////////////////////////////////////////


4
-

convert from 1 dim to 2 dim:


void

conv
64
1to2(
int

A[][size
64
],
int

res[size
64
*size
64
],
int

si)

{
int

c=0;

int

j=0;


for
(
int

i=
0;i<si*si;i++)



if
(j<si)






{ A[c][j]=res[i];


j++;



//B[c][j]=recvbuf2[i];



}



else
{



j=0;




c++;


A[c][j]=res[i];


j++;





}


}

////////////////////////////////////////////////////////////////
/////

5
-

the computation of 2x2 matrix:


void

processall(
int

A[][s],
int

B[][s],
int

ai,
int

aj,
int

aii,
int

ajj,
int

bi,
int

bj,
int

bii,
int

bjj)

{


P[0]=((A[ai][aj]+A[aii][ajj])*(B[bi][bj]+B[bii][bjj]));


P[1]=((A[aii][aj]+A[aii][ajj])*B[bi][bj
]);


P[2]=((B[bi][bjj]
-
B[bii][bjj])*A[ai][aj]);



P[3]=(A[aii][ajj]*(B[bii][bj]
-
B[bi][bj]));


P[4]=((A[ai][aj]+A[ai][ajj])*B[bii][bjj]);


P[5]=((A[aii][aj]
-
A[ai][aj])*(B[bi][bj]+B[bi][bjj]));


P[6]=((A[ai][ajj]
-
A[aii][ajj])*(B[
bii][bj]+B[bii][bjj]));



}

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////









13

6
-

finalc(

) function:


void

finalgc128(
int

c[][size128])

{
//C
11;



int

i,j;


int

k=0;


int

r;


for
(i=0;i<64;i++)


{r=0;


for
(j=0;j<64;j++)


{



c[i][j]=gp18[k][r]+gp48[k][r]
-
gp58[k][r]+gp78[k][r];



r++;


}


k++;


}

//C12;


for
(i=0;i<64;i++)


{k=0;


for
(j=64;j<size128;j++)


{



c[i][j]=gp38[i][k]+gp58[i][k];


k++;


}


}

//C21;

k=0;


for
(i=64;i<128;i++)


{


for
(j=0;j<64;j++)




c[i][j]=gp28[k][j]+gp48[k][j];


k++;


}

//C22;


k=0;




for
(i=64;i<128;i++)


{ r=0;


for
(j=64;j<128;j++)


{


c[i][j]=gp18[k][r]+gp38[k][r]
-
gp28[k][r]+gp68[k][r];



r++;


}


k++;

}

}

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////












14

7
-

some of the output screen:


a) part of the C matrix 1
2
8x128 and the parallel time at 8 processes










Problem:

1) when I coming to do something the program finished and the same result
occur.



15





Problem

2) without print output and with print output there are a huge different of time.











16





On
Cluster: