Parallelizing C Programs

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

65 εμφανίσεις

Parallelizing C Programs
Using Cilk

Mahdi Javadi


Cilk Language


Cilk is a language for multithreaded
parallel programming based on C.


The programmer should not worry about
scheduling the computation to run
efficiently.


There are three additional keywords:
cilk
,

spawn
and

sync
.


Example: Fibonacci

Int fib (int n)

{



int x, y;



if (n<2) return n;



x = fib (n
-
1);


y = fib (n
-
2);


return x+y;

}

cilk

Int fib (int n)

{



int x, y;



if (n<2) return n;



x =
spawn

fib (n
-
1);


y =
spawn

fib (n
-
2);


sync;


return x+y;

}

Performance Measures


T
p

= execution time on
P

processors.


T
1

is called
work
.


T


is called
span
.


Obvious lower bounds:


T
p

≥ T
1
/P



T
p

≥ T



p
=T
1
/T


is called
parallelism
.

Using more
than
p
processors makes little sense.

Cilk Compiler


The file extension should be “.cilk”.


Example:



> cilkc
-
O3 fib.cilk
-
o fib


To find the 30
th

Fibonacci number using 4 CPUs:


> fib
--
nproc 4 30


To collect timings of each processor and
compute the span (not efficient):



> cilkc
-
cilk
-
profile
-
cilk
-
span
-
O3 fib.cilk
-
o fib




Example: Matrix Multiplication


Suppose we want to multiply two
n

by
n

matrices:





We can recursively formulate the problem:





i.e. one
n

by
n

matrix multiplication reduces to:


8 multiplications and for additions of (
n
/2) by (
n
/2)
submatrices.

(

C
11

C
12

C
21

C
22

)

=

(

A
11

A
12

A
21

A
22

)

.

(

B
11

B
12

B
21

B
22

)

(

A
11
B
11
+ A
12
B
21
A
11

B
12
+ A
12

B
22

A
21

B
11
+ A
22

B
21
A
21

B
12
+ A
22

B
22

)

(

C
11

C
12

C
21

C
22

)

=

Multiplication Procedure

Mult(
C
,
A
,
B
,
n
)



if (
n

= 1)
C
[1,1] =
A
[1,1].
B
[1,1]



else



{




spawn Mult(
C
11
,
A
11
,
B
11
,
n
/2);









spawn Mult(
C
22
,
A
21
,
B
12
,
n
/2);




spawn Mult(
T
11
,
A
12
,
B
21
,
n
/2);









spawn Mult(
T
22
,
A
22
,
B
22
,
n
/2);




sync;




Add(
C
,
T
,
n
);



}

Addition Procedure

Add(
C,T,n
)



if (
n

= 1)
C
[1,1] =
C
[1,1]+
T
[1,1];



else



{




spawn Add(
C
11
,
T
11
,
n
/2);









spawn Add(
C
22
,
T
22
,
n
/2);




sync;



}


T
1

(work) for addition = O(
n
2
).


T

(span) for addition = O(log(
n
)).

Complexity of Multiplication


We know that matrix multiplication is O(
n
3
)
hence
T
1

(work) for multiplication = O(
n
3
).



T

: M

(
n
)

=
M

(
n
/2) + O(log(
n
)) = O(log
2
(
n
)).




p = T
1

/
T


= O(
n
3
) / O(log
2
(
n
)).



To multiply 1000 by 1000:
p
= 10
7

( a lot of
CPUs !!!)

Discrete Fourier Transform

DFT(
n,w,p,…
)



...



t

=
w
2

mod
p



DFT(
n/2,t,p,…
);



DFT(
n/2,t,p,…
);







w
1

= 1;



for (
i

= 0;
i

<
n/2
;
i
++)



{









a
[
i
] = …




w
1

=
w
1
.
w

mod
p
;



}






cilk

DFT(
n,w,p,…
)


...


t

=
w
2

mod
p


spawn

DFT(
n/2,t,p,…
);


spawn

DFT(
n/2,t,p,…
);


sync;





spawn

ParCom(
n,a,p,1,…
);



cilk ParCom(
n,a,p,m,…
)


if (
n

<= 512) …


spawn

ParCom(
n/2,a,p,1,…
);


m
’ =
m

.
w
n/2

mod
p
;


spawn

ParCom(
n/2,a+n/2,p,m
’,…);


sync;


Complexity of ParCom


The sequential combining does
n
/2 multiplication.


T


(span) for ParCom:


T

(
n
) =
T

(
n
/2) + O(log(
n
))
T

(
n
) = O(log
2
(
n
)).


p
= O(
n
/log
2
(
n
)).





We run FFT on

stan” which has 4 CPUs.



Thus
p

> 4 does not make sense, so we cut off the
parallelism at some level of recursion to speed up the
program.

Timings

# processors

Par time (ms)

Speed up

4

32837

3.77

3

44315

2.79

2

66262

1.87

1

124006

0.998

Sequential FFT: 123789 (ms)