Multithreaded Algorithms for

bloatdecorumSoftware and s/w Development

Oct 30, 2013 (3 years and 9 months ago)

225 views

Alex Pothen

Purdue University

CSCAPES Institute

www.cs.purdue.edu/homes/apothen/


Assefaw

Gebremedhin
,
Mahantesh

Halappanavar

(PNNL),

John
Feo

(PNNL),
Umit

Catalyurek

(Ohio State)



CSC’11 Workshop

May 2011



Multithreaded Algorithms for
Graph Coloring



1

References



Multithreaded algorithms for graph coloring.
Catalyurek
,
Feo
,
Gebremedhin
,
Halappanavar

and
Pothen, 40 pp., Submitted to Parallel Computing.


New multithreaded ordering and coloring
algorithms for
multicore

architectures.

Patwary
,
Gebremedhin
, Pothen, 12pp.,
EuroPar

2011.


Graph coloring for derivative computation and
beyond: Algorithms, software and analysis.
Gebremedhin
, Nguyen, Pothen, and
Patwary
, 32 pp.,
Submitted to TOMS.


Distributed Memory Parallel Algorithms for
Matching and Coloring.
Catalyurek
,
Dobrian
,
Gebremedhin
,
Halappanavar
, and Pothen, 10pp.,
IPDPS Workshop PCO, 2011.

2

3

Graph


Architecture


Algorithm

Latency Tolerance


Performance

Outline



The many
-
core and multi
-
threaded world


Intel Nehalem


Sun Niagara


Cray XMT


A case study on multithreaded graph coloring


An Iterative and Speculative Coloring Algorithm


A Dataflow algorithm


RMAT graphs: ER, G, and B


Experimental results


Conclusions

4

Architectural Features

O(|E|)
-
time implementations possible for all four



B

= max back degree


over entire seq.



B
+1 colors suffice

to color
G
.

Proc.

Threads/
Core

Cores/Socket

Threads


Cache

Clock

Multithreading,
Other Detail

Intel
Nehalem

2

4

16

Shared

L3

2.5 G

Simultaneous,
Cache
Coher
.

protocol

Sun
Niagara
2

8

2

128

Shared
L2

1.2 G

Simultaneous

Cray

XMT

128

128
Procs
.

16,384

None

500 M

Interleaved,

Fine
-
grained

synchronization

Multithreaded: Iterative
Greedy Algorithm

6


v

Forbidden Colors

v

V

c
i

Multi
-
threaded: Data Flow Algorithm

7

Multi
-
threaded: Data Flow
Algorithm

8


v

Forbidden Colors

v

V

c
i

RMAT Graphs


R
-
MAT: Recursive
MATrix

method


Experiments


RMAT
-
ER

(0.25, 0.25, 0.25, 0.25)


RMAT
-
G

(0.45, 0.15, 0.15, 0.25)


RMAT
-
B

(0.55, 0.15, 0.15, 0.15)


Chakrabarti
, D. and
Faloutsos
, C. 2006. Graph mining:
Laws, generators, and algorithms.
ACM
Comput
.
Surv
.

38,
1.





10

RMAT Graphs

a

b

c

d

11

Nehalem: Strong Scaling (Niagara)


RMAT
-
ER


RMAT
-
G

RMAT
-
B

12

Cray XMT: Strong and Weak Scaling


Iter
-
G


Iter
-
B


DF
-
G

DF
-
B

13

Comparing Three Platforms

a) ER

c
) Good

e
) Bad

14

No. Colors in Parallel Algorithms

a) ER

c
) B

b
) G

15

Computing SL Orderings in Parallel:
RMAT
-
G graphs (Nehalem)

16

SL Ordering

Relaxed SL Ordering


Our contributions: Multithreaded Coloring


Massive multithreading


Can tolerate memory latency for graphs/sparse matrices


Dataflow algorithms easier to implement than distributed memory
versions


Thread concurrency ameliorates lack of caches, and lower clock speeds


Thread parallelism can be exploited at fine grain if supported by
lightweight synchronization


Graph structure critically influences performance


Many
-
core machines


Developed an iterative algorithm for greedy coloring (distance
-
1 and
-
2)
and ordering algorithms that port to
different machines


Simultaneous multithreading can hide latency (X threads on 1 core vs. 1
thread on X cores)


Decomposition into tasks at a finer grain than distributed
-
memory version,
and relax synchronization to enhance concurrency


Will form nodes of
Peta
-

and
Exa
-
scale machines, so single node
performance studies are needed






17


Multi
-
threaded Parallelism

24


Memory access times determine performance


By issuing multiple threads, mask memory latency if a ready
thread is available when a functional unit becomes free


Interleaved vs. Simultaneous
multithreading (IMT or SMT)


Figure from Robert
Golla
, Sun

Time

25

Multi
-
core: Sun Niagara 2



Two 8
-
core sockets,


8 hw threads per core


1.2 GHz processors linked by
8
x

9 crossbar to L2 cache banks


Simultaneous multithreading


Two threads from a core can be
issued in a cycle


Shallow pipeline

26


Multicore
: Intel Nehalem



Two quad
-
core sockets, 2.5 GHz



Two
hyperthreads

per core
support SMT


Off chip
-
data latency 106 cycles





Advanced architectural features:

Cache coherence protocol to reduce
traffic, loop
-
stream detection,
improved branch prediction,

out
-
of
-
order execution

27

Massive Multithreading: Cray XMT


Latency tolerance via massive multi
-
threading


Context
switch between threads
in a single

clock cycle


Global address space, hashed
to memory banks to
reduce hot
-
spots


No cache or local
memory, average latency 600 cycles


Memory
request doesn’t stall processor


Other

threads
work while

the request is
fulfilled


Light
-
weight, word
-
level


synchr
. (full/empty bits)


Notes:


500
MHz clock


128 Hardware thread streams/proc.,


Interleaved multithreading

28

Multithreaded Algorithms for Graph
Coloring



We developed two kinds of multithreaded
algorithms for graph coloring:


An
iterative
, coarse
-
grained method for generic shared
-
memory
architectures


A
dataflow

algorithm designed for massively multithreaded
architectures with hardware support for fine
-
grain synchronization
,
such as the Cray XMT


Benchmarked the algorithms on three systems:


Cray XMT, Sun Niagara 2
and

Intel Nehalem


Excellent speedup observed on all three platforms


29


Coloring Algorithms

30

Greedy coloring algorithms

31


Distance
-
k, star, and acyclic

coloring are NP
-
hard


Approximating

coloring to within
O(n
1
-
e
)

is NP
-
hard for any
e
>0

G
REEDY
(
G=(V,E)
)

Order

the vertices in
V

for

i = 1 to
|V|
do

Determine colors
forbidden

to
v
i

Assign
v
i

the
smallest

permissible color

end
-
for



A greedy heuristic usually gives a
near
-
optimal

solution


The key is to find
good orderings

for coloring, and many have
been developed


Ref:
Gebremedhin
,
Tarafdar
,
Manne
,
Pothen
, SIAM J. Sci. Compt. 29:1042
--
1072, 2007
.

Distance
-
1Coloring, Greedy Alg.

a

v

a

v

32

Many
-
core greedy coloring


Given a graph, parallelize greedy coloring on many
-
core machines
such that
Speedup is attained, and Number of colors is roughly same as in serial


Difficult task since greedy is inherently sequential, computation small
relative to communication, and data accesses are irregular


D1 coloring:
Approaches based on Luby’s parallel algorithm for maximal
independent set had limited success


Gebremedhin

and
Manne

(2000) developed
a parallel greedy coloring
algorithm on shared memory machines


Uses speculative coloring to enhance concurrency, randomized
partitioning to reduce conflicts, and serial conflict resolution



Number of conflicts bounded, so this approach yields an effective
algorithm


Extended to distance
-
2 coloring by G, M and P (2002)


We adapt this approach to implement the greedy algorithm for many
-
core computing



33

Parallel Coloring

34

Parallel Coloring: Speculation

a

v

w

a

v

w

35

Experimental results

Iterative

Dataflow

Cray XMT:
RMAT
-
G with 2
24
, …, 2
27

vertices and 134M, …, 1B edges

36

Experimental results

Niagara 2

Iterative

Perf
. With doubling threads on a core = Doubling cores!

37

Experimental results

RMAT
-
G with 2
24
= 16M vertices and 134M edges

All Platforms

RMAT
-
B, 2
24
vertices,134M edges

38

Iterative Greedy Coloring:
Multithreaded Algorithm


Adj(v
),
color(w
),
forbidden(v
)
:
d(v
)
reads each

forbidden(v
)
:
d(v
)

writes


Adj(v
),
color(w
)
:
d(v
)

reads each



39

Experimental results

RMAT
-
G with 2
24
= 16M vertices and 134M edges

All Platforms

40

Tentative Conclusions, Future
Work

41


Future Plans: Multithreaded Coloring


Massive multithreading


Microbechmarking

to understand where the cycles go: thread management,
data accesses, synchronization, instruction scheduling, function unit
limitations…


Develop a performance model of the computation


Experiment with other graph classes


Consider new algorithmic paradigms


Many
-
core machines


Four items as above


Ordering for coloring: Archetype of a problem for computing a sequential
ordering in a parallel environment (
Mostofa

Patwary

and
Assefaw

Gebremedhin
)


Extend to nodes of
Peta
-
scale machines, so single node performance is
enhanced, and complete our work on the Blue Gene and the Cray XT5





42

Thanks






Rob
Bisseling
, Erik
Boman
,

Ü
mit

Çatal
ü
rek
,

Karen
Devine
, Florin
Dobrian
, John
Feo
,
Assefaw

Gebremedhin
,

Mahantesh

Halappanavar
,
Bruce Hendrickson, Paul
Hovland
,

Gary
Kumfert
, Fredrik
Manne
,
Al
i

P
ı
nar
, Sivan
Toledo
,
Jean
Utke

43

Further reading

www.cscapes.org


Gebremedhin

and
Manne
, Scalable parallel graph
coloring algorithms,

Concurrency: Practice and
Experience,
12: 1131
-
1146, 2000.


Gebremedhin
,
Manne

and Pothen, Parallel distance
-
k

coloring algorithms for numerical optimization,
Lecture Notes in Computer Science,
2400: 912
-
921,
2002.


Bozdag
, Gebremedhin, Manne, Boman and Catalyurek.
A framework for scalable greedy coloring on
distributed
-
memory parallel computers.
J. Parallel
Distrib. Comput.
68(4):515
-
535, 2008.


Catalyurek
,
Feo
,
Gebremedhin
,
Halappanavar

and
Pothen, Multi
-
threaded algorithms for graph coloring,
Preprint,
Aug. 2010.


44