Pacific Northwest National Laboratory Overview

wastecypriotInternet και Εφαρμογές Web

10 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

83 εμφανίσεις

Efficient

Sparse Matrix
-
Matrix Multiplication
on Heterogeneous High
Performance Systems

AACEC 2010


Heraklion
, Crete, Greece

Jakob

Siegel
1
,
Oreste

Villa
2
,
Sriram

Krishnamoorthy
2
,
Antonino

Tumeo
2

and
Xiaoming

Li
1


1

University of Delaware

2

Pacific Northwest National Laboratory

1

September 24
th
, 2010

Overview

Introduction

Cluster level

Node level

Results

Conclusion

Future Work


2

Overview

Introduction

Cluster level

Node level

Results

Conclusion

Future Work

3

Sparse Matrix
-
Matrix Multiply

-

Challenges

The efficient implementation of sparse
matrix
-
matrix multiplications on HPC
systems poses several challenges:

Large size of input matrices

E.g. 10
6
×
10
6

with 30
×
10
6

nonzero
elements

Compressed representation

Partitioning

Density of the output matrices

Load balancing

large differences in density and
computation times

4

Matrices taken from Timothy A. Davis. University of Florida
Sparse Matrix
Collection, available online
at
: h
ttp://www.cise.ufl.edu/davis/sparse.

Sparse Matrix
-
Matrix Multiply


Cross Cluster implementation:

Partitioning

Data Distribution

Load Balancing

Communication/Scaling

Result handling


In
-
Node implementation:

Multiple efficient
SpGEMM

algorithms

CPU/GPU implementation

Double buffering

Exploiting heterogeneity


5

Matrices taken from Timothy A. Davis. University of Florida
Sparse Matrix
Collection, available online
at
: h
ttp://www.cise.ufl.edu/davis/sparse.

Overview

Introduction

Cluster level

Node level

Results

Conclusion

Future Work

6

Sparse Matrix
-
Matrix Multiply

-

Cluster level


Blocking

Block size depends on
sparsity

of input matrices and # processing
elements.

NumOfBlocksX

×

NumOfBlocksY

>>
NumOfProcessingElements

Data Layout

What format and order to allow for easy and fast access

Communication and storage implemented using Global
Arrays (GA)

Offers a set of primitives for non
-
blocking operations, contiguous
and non
-
contiguous data transfers.


7

Sparse Matrix
-
Matrix Multiply

-

Data representation and Tiling

8

A

B

C

C=A
×
B


Blocked Matrix representation:


Each block is stored in




CSR* form



1
-
1 0 0 0


0 5 0 0 0


0 0 4 6 0

-
2 0 2 7 0


0 0 0 0 5

data

(1
-
1 5 4 6
-
2 2 7 5)

c
ol


(0 1 1 2 3 0 2 3 4)

row

(0 2 3 5 8 9)

*CSR: Compressed Sparse Row

Sparse Matrix
-
Matrix Multiply

-

Data representation and Tiling

9

A

B

C

C=A
×
B

data

column

row

data

col


Tile 0

Tile 2





Matrix A:



The single CSR tiles are stored serialized into
the GA space.



Tile sizes and offsets are stored in a 2D array



Tiles with 0 nonzero elements are not
represented in the GA dataset.





Sparse Matrix
-
Matrix Multiply

-

Data representation and Tiling

10

B


Matrix B:



tiles are serialized in a transposed way.




depending on the algorithm used to calculate the single
tiles the data in the tiles can be stored transposed or not
transposed.


For the
Gustavson

algorithm the representation of the data in the
tiles themselves is not transposed.


1
-
1 0 0 0


0 5 0 0 0


0 0 4 6 0

-
2 0 2 7 0


0 0 0 0 5


1 0 0
-
2 0

-
1 5 0 0 0


0 0 4 2 0


0 0 6 7 0


0 0 0 0 5

not transposed



or


transposed

Sparse Matrix
-
Matrix Multiply

-

Tasking and Data Movement

11

0

1

2

3

4

5

6

7

8

.

.

1

C



Each Block in C represents a Task.




Nodes grab tasks and additional
needed data when they have
computational power available





Results are stored locally




meta data of the result blocks in each
node is distributed to determine the
offsets of the tiles in the GA space.



Tiles are put into the GA space in right
order


0

1

N
-
1



3

4

0

2

5

Sparse Matrix
-
Matrix Multiply

-

Tasking and Data Movement

12

A

B

C=A
×
B



Each node fetches the data needed
by the task to handle:

E.g. here for task/tile 5 the node
has to load the data of Stripes

s
a

= 1 and
s
b

= 0



N
-
1

2

5

0

1

2



S
a
-
1

0

1

2



S
b
-
1

Overview

Introduction

Cluster level

Node level

Results

Conclusion

Future Work


14


2 3 0 0 0 0


0
-
1 0 2 3 0


0 0
-
3 1 0 0


0 0 2 3 0 0


1 0 0 2 2 0


0 0 0 2
-
1 4


1
-
1 0 0 0


0 5 0 0 0


0 0 4 6 0

-
2 0 0 7
-
4


0 1 0 0 5


0 0 0 1 2

Sparse Matrix
-
Matrix Multiply
-

Gustavson


15

The algorithm is based on the
equation:





i
-
th

row of
C

is a linear
combination of the

v

rows of
B

for which
a
iv

is nonzero.

Where
A

has the dimensions
p
×
q

and
B
q
×
r

0
-
5 0 0 0

-
4
-
5 0 14
-
8

-
4
-
2 0 14 7

0 0 0 0 0

×

data
(2,3,
-
1,2,3,
-
3,1,2,3,1,2,2,2,
-
1,4)

col

(0,1, 1,3,4, 2,3,2,3,0,3,4,3, 4,5)

row

(0,2,5,7,9,12,15)

data
(1,
-
1,5,4,6,
-
2,7,
-
4,1,5,1,2)

col

(0, 1,1,2,3, 0,3, 4,1,4,3,4)

row

(0,2,3,5,8,10,12)

A

C

B

×

p
i
b
a
c
iv
a
v
iv
i






1
for

0
i
=1

i
=1, v=1

i
=1, v=3

i
=1, v=4

+

+

×

+


2 3 0 0 0 0


0
-
1 0 2 3 0


0 0
-
3 1 0 0


0 0 2 3 0 0


1 0 0 2 2 0


0 0 0 2
-
1 4


1
-
1 0 0 0


0 5 0 0 0


0 0 4 6 0

-
2 0 0 7
-
4


0 1 0 0 5


0 0 0 1 2

Sparse Matrix
-
Matrix Multiply
-

Gustavson


16

A

C

B

p
i
b
a
c
iv
a
v
iv
i






1
for

0

In the CUDA implementation:



each result row
c
i

is handled by the
16 threads of a half warp (
1/2W
)


For each nonzero elements
a
iv

in A
one
1/2W

performs the
multiplications for each row
v∙
in
parallel


The results are kept in dense form
until all calculations are complete


Then the results get compressed
on the device.



0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

13

0

0

0

-
4

-
2

0

14

7

-
2

0

-
12

-
11

-
4

-
6

0

8

33

-
12

-
3

-
1

0

14

2

-
4

-
1

0

18

-
5

half
-
warp 0

half
-
warp 1

half
-
warp 2



Overview

Introduction

Cluster level

Node level

Results

Conclusion

Future Work


17

Sparse Matrix
-
Matrix Multiply


Case Study

Midsize matrix from the University of Florida Sparse
Matrix Collection*


2D/3D problem

size 72, 000
×

72, 000

28, 715, 634 nonzero

Blocked into 5041 tiles.

Multiplying matrix with itself.

18

*http://www.cise.ufl.edu/davis/sparse

Darker colors represent higher densities of nonzero
elements.

Sparse Matrix
-
Matrix Multiply
-

Results

19

Scaling of
SpGEMM

with the different
approaches

0
50
100
150
200
250
300
350
1
2
4
8
16
time in sec

Nodes

execution time over number of nodes

Static
LB-Hom
LB-Het
Sparse Matrix
-
Matrix Multiply
-

Results

20

200
250
300
350
400
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
number of tasks

node id

tasks executed by each node

Static
LB-Hom
LB-Het
0
5
10
15
20
25
30
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
time in sec

node id (7 processes per node)

Time to
complete all
assigned
tasks per
process

Static
LB-Hom
LB-Het
Sparse Matrix
-
Matrix Multiply
-

Results

Even inside a node where different compute elements are used the
load balancing mechanism still performs well

The processes using the CUDA devices here completing almost 5x more tasks
than the pure CPU processes.

21

0
20
40
60
80
100
120
Static
CPU0
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
LB-Hom
CPU0
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
LB-Het
CUDA0
CUDA1
CPU0
CPU1
CPU2
CPU3
CPU4
number of tasks

Tasks per Core in one of the
nodes

0
5
10
15
20
25
Static
CPU0
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
LB-Hom
CPU0
CPU1
CPU2
CPU3
CPU4
CPU5
CPU6
LB-Het
CUDA0
CUDA1
CPU0
CPU1
CPU2
CPU3
CPU4
time in sec

Time to complete all assigned tasks
for each processor

Overview

Introduction

Cluster level

Node level

Results

Conclusion

Future Work


22

Sparse Matrix
-
Matrix Multiply

We presented a parallel framework using a co
-
design
approach which takes into account characteristics of:

The selected application (here
SpGEMM
)

The underlying hardware (heterogeneous cluster)

The difficulties of using static partitioning approaches
show that a global load balancing method is needed

Different optimized implementations of the
Gustavson

algorithm are presented and are used depending on the
available compute element

For the selected case study optimal load balancing with
uniform computation time across all processing elements
is achieved





23

Overview

Introduction

Cluster level

Node level

Results

Conclusion

Future Work


24

Future Work


General Tasking Framework
for
Heterogeneous GPU Clusters

More General Task definition

More flexibility in Input and output data definition

Exploring limits imposed on Tasks by a Heterogeneous system

Feedback loop during execution that allows more efficient
assignment of tasks.

Introducing heterogeneous execution on GPU and CPU in
one process/core.

Locality aware Task queue(s) and work stealing

Task reinsertion or generation at the node level.

25

Thank you


Questions?

26