# Parallelizing C Programs

Λογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 4 χρόνια και 5 μήνες)

84 εμφανίσεις

Parallelizing C Programs
Using Cilk

Cilk Language

Cilk is a language for multithreaded
parallel programming based on C.

The programmer should not worry about
scheduling the computation to run
efficiently.

cilk
,

spawn
and

sync
.

Example: Fibonacci

Int fib (int n)

{

int x, y;

if (n<2) return n;

x = fib (n
-
1);

y = fib (n
-
2);

return x+y;

}

cilk

Int fib (int n)

{

int x, y;

if (n<2) return n;

x =
spawn

fib (n
-
1);

y =
spawn

fib (n
-
2);

sync;

return x+y;

}

Performance Measures

T
p

= execution time on
P

processors.

T
1

is called
work
.

T

is called
span
.

Obvious lower bounds:

T
p

≥ T
1
/P

T
p

≥ T

p
=T
1
/T

is called
parallelism
.

Using more
than
p
processors makes little sense.

Cilk Compiler

The file extension should be “.cilk”.

Example:

> cilkc
-
O3 fib.cilk
-
o fib

To find the 30
th

Fibonacci number using 4 CPUs:

> fib
--
nproc 4 30

To collect timings of each processor and
compute the span (not efficient):

> cilkc
-
cilk
-
profile
-
cilk
-
span
-
O3 fib.cilk
-
o fib

Example: Matrix Multiplication

Suppose we want to multiply two
n

by
n

matrices:

We can recursively formulate the problem:

i.e. one
n

by
n

matrix multiplication reduces to:

8 multiplications and for additions of (
n
/2) by (
n
/2)
submatrices.

(

C
11

C
12

C
21

C
22

)

=

(

A
11

A
12

A
21

A
22

)

.

(

B
11

B
12

B
21

B
22

)

(

A
11
B
11
+ A
12
B
21
A
11

B
12
+ A
12

B
22

A
21

B
11
+ A
22

B
21
A
21

B
12
+ A
22

B
22

)

(

C
11

C
12

C
21

C
22

)

=

Multiplication Procedure

Mult(
C
,
A
,
B
,
n
)

if (
n

= 1)
C
[1,1] =
A
[1,1].
B
[1,1]

else

{

spawn Mult(
C
11
,
A
11
,
B
11
,
n
/2);

spawn Mult(
C
22
,
A
21
,
B
12
,
n
/2);

spawn Mult(
T
11
,
A
12
,
B
21
,
n
/2);

spawn Mult(
T
22
,
A
22
,
B
22
,
n
/2);

sync;

C
,
T
,
n
);

}

C,T,n
)

if (
n

= 1)
C
[1,1] =
C
[1,1]+
T
[1,1];

else

{

C
11
,
T
11
,
n
/2);

C
22
,
T
22
,
n
/2);

sync;

}

T
1

n
2
).

T

n
)).

Complexity of Multiplication

We know that matrix multiplication is O(
n
3
)
hence
T
1

(work) for multiplication = O(
n
3
).

T

: M

(
n
)

=
M

(
n
/2) + O(log(
n
)) = O(log
2
(
n
)).

p = T
1

/
T

= O(
n
3
) / O(log
2
(
n
)).

To multiply 1000 by 1000:
p
= 10
7

( a lot of
CPUs !!!)

Discrete Fourier Transform

DFT(
n,w,p,…
)

...

t

=
w
2

mod
p

DFT(
n/2,t,p,…
);

DFT(
n/2,t,p,…
);

w
1

= 1;

for (
i

= 0;
i

<
n/2
;
i
++)

{

a
[
i
] = …

w
1

=
w
1
.
w

mod
p
;

}

cilk

DFT(
n,w,p,…
)

...

t

=
w
2

mod
p

spawn

DFT(
n/2,t,p,…
);

spawn

DFT(
n/2,t,p,…
);

sync;

spawn

ParCom(
n,a,p,1,…
);

cilk ParCom(
n,a,p,m,…
)

if (
n

<= 512) …

spawn

ParCom(
n/2,a,p,1,…
);

m
’ =
m

.
w
n/2

mod
p
;

spawn

ParCom(
n/2,a+n/2,p,m
’,…);

sync;

Complexity of ParCom

The sequential combining does
n
/2 multiplication.

T

(span) for ParCom:

T

(
n
) =
T

(
n
/2) + O(log(
n
))
T

(
n
) = O(log
2
(
n
)).

p
= O(
n
/log
2
(
n
)).

We run FFT on

stan” which has 4 CPUs.

Thus
p

> 4 does not make sense, so we cut off the
parallelism at some level of recursion to speed up the
program.

Timings

# processors

Par time (ms)

Speed up

4

32837

3.77

3

44315

2.79

2

66262

1.87

1

124006

0.998

Sequential FFT: 123789 (ms)