Alex Pothen
Purdue University
CSCAPES Institute
www.cs.purdue.edu/homes/apothen/
Assefaw
Gebremedhin
,
Mahantesh
Halappanavar
(PNNL),
John
Feo
(PNNL),
Umit
Catalyurek
(Ohio State)
CSC’11 Workshop
May 2011
Multithreaded Algorithms for
Graph Coloring
1
References
Multithreaded algorithms for graph coloring.
Catalyurek
,
Feo
,
Gebremedhin
,
Halappanavar
and
Pothen, 40 pp., Submitted to Parallel Computing.
New multithreaded ordering and coloring
algorithms for
multicore
architectures.
Patwary
,
Gebremedhin
, Pothen, 12pp.,
EuroPar
2011.
Graph coloring for derivative computation and
beyond: Algorithms, software and analysis.
Gebremedhin
, Nguyen, Pothen, and
Patwary
, 32 pp.,
Submitted to TOMS.
Distributed Memory Parallel Algorithms for
Matching and Coloring.
Catalyurek
,
Dobrian
,
Gebremedhin
,
Halappanavar
, and Pothen, 10pp.,
IPDPS Workshop PCO, 2011.
2
3
Graph
Architecture
Algorithm
Latency Tolerance
Performance
Outline
The many

core and multi

threaded world
◦
Intel Nehalem
◦
Sun Niagara
◦
Cray XMT
A case study on multithreaded graph coloring
◦
An Iterative and Speculative Coloring Algorithm
◦
A Dataflow algorithm
RMAT graphs: ER, G, and B
Experimental results
Conclusions
4
Architectural Features
O(E)

time implementations possible for all four
•
B
= max back degree
over entire seq.
•
B
+1 colors suffice
to color
G
.
Proc.
Threads/
Core
Cores/Socket
Threads
Cache
Clock
Multithreading,
Other Detail
Intel
Nehalem
2
4
16
Shared
L3
2.5 G
Simultaneous,
Cache
Coher
.
protocol
Sun
Niagara
2
8
2
128
Shared
L2
1.2 G
Simultaneous
Cray
XMT
128
128
Procs
.
16,384
None
500 M
Interleaved,
Fine

grained
synchronization
Multithreaded: Iterative
Greedy Algorithm
6
v
Forbidden Colors
v
V
c
i
Multi

threaded: Data Flow Algorithm
7
Multi

threaded: Data Flow
Algorithm
8
v
Forbidden Colors
v
V
c
i
RMAT Graphs
R

MAT: Recursive
MATrix
method
Experiments
◦
RMAT

ER
(0.25, 0.25, 0.25, 0.25)
◦
RMAT

G
(0.45, 0.15, 0.15, 0.25)
◦
RMAT

B
(0.55, 0.15, 0.15, 0.15)
Chakrabarti
, D. and
Faloutsos
, C. 2006. Graph mining:
Laws, generators, and algorithms.
ACM
Comput
.
Surv
.
38,
1.
10
RMAT Graphs
a
b
c
d
11
Nehalem: Strong Scaling (Niagara)
RMAT

ER
RMAT

G
RMAT

B
12
Cray XMT: Strong and Weak Scaling
Iter

G
Iter

B
DF

G
DF

B
13
Comparing Three Platforms
a) ER
c
) Good
e
) Bad
14
No. Colors in Parallel Algorithms
a) ER
c
) B
b
) G
15
Computing SL Orderings in Parallel:
RMAT

G graphs (Nehalem)
16
SL Ordering
Relaxed SL Ordering
Our contributions: Multithreaded Coloring
Massive multithreading
◦
Can tolerate memory latency for graphs/sparse matrices
◦
Dataflow algorithms easier to implement than distributed memory
versions
◦
Thread concurrency ameliorates lack of caches, and lower clock speeds
◦
Thread parallelism can be exploited at fine grain if supported by
lightweight synchronization
◦
Graph structure critically influences performance
Many

core machines
◦
Developed an iterative algorithm for greedy coloring (distance

1 and

2)
and ordering algorithms that port to
different machines
◦
Simultaneous multithreading can hide latency (X threads on 1 core vs. 1
thread on X cores)
◦
Decomposition into tasks at a finer grain than distributed

memory version,
and relax synchronization to enhance concurrency
◦
Will form nodes of
Peta

and
Exa

scale machines, so single node
performance studies are needed
17
Multi

threaded Parallelism
24
•
Memory access times determine performance
•
By issuing multiple threads, mask memory latency if a ready
thread is available when a functional unit becomes free
•
Interleaved vs. Simultaneous
multithreading (IMT or SMT)
Figure from Robert
Golla
, Sun
Time
25
Multi

core: Sun Niagara 2
•
Two 8

core sockets,
•
8 hw threads per core
•
1.2 GHz processors linked by
8
x
9 crossbar to L2 cache banks
•
Simultaneous multithreading
•
Two threads from a core can be
issued in a cycle
•
Shallow pipeline
26
Multicore
: Intel Nehalem
•
Two quad

core sockets, 2.5 GHz
•
Two
hyperthreads
per core
support SMT
•
Off chip

data latency 106 cycles
•
Advanced architectural features:
Cache coherence protocol to reduce
traffic, loop

stream detection,
improved branch prediction,
out

of

order execution
27
Massive Multithreading: Cray XMT
Latency tolerance via massive multi

threading
◦
Context
switch between threads
in a single
clock cycle
◦
Global address space, hashed
to memory banks to
reduce hot

spots
◦
No cache or local
memory, average latency 600 cycles
Memory
request doesn’t stall processor
◦
Other
threads
work while
the request is
fulfilled
Light

weight, word

level
synchr
. (full/empty bits)
Notes:
◦
500
MHz clock
◦
128 Hardware thread streams/proc.,
◦
Interleaved multithreading
28
Multithreaded Algorithms for Graph
Coloring
◦
We developed two kinds of multithreaded
algorithms for graph coloring:
An
iterative
, coarse

grained method for generic shared

memory
architectures
A
dataflow
algorithm designed for massively multithreaded
architectures with hardware support for fine

grain synchronization
,
such as the Cray XMT
◦
Benchmarked the algorithms on three systems:
Cray XMT, Sun Niagara 2
and
Intel Nehalem
◦
Excellent speedup observed on all three platforms
29
Coloring Algorithms
30
Greedy coloring algorithms
31
Distance

k, star, and acyclic
coloring are NP

hard
Approximating
coloring to within
O(n
1

e
)
is NP

hard for any
e
>0
G
REEDY
(
G=(V,E)
)
Order
the vertices in
V
for
i = 1 to
V
do
Determine colors
forbidden
to
v
i
Assign
v
i
the
smallest
permissible color
end

for
A greedy heuristic usually gives a
near

optimal
solution
The key is to find
good orderings
for coloring, and many have
been developed
Ref:
Gebremedhin
,
Tarafdar
,
Manne
,
Pothen
, SIAM J. Sci. Compt. 29:1042

1072, 2007
.
Distance

1Coloring, Greedy Alg.
a
v
a
v
32
Many

core greedy coloring
Given a graph, parallelize greedy coloring on many

core machines
such that
Speedup is attained, and Number of colors is roughly same as in serial
Difficult task since greedy is inherently sequential, computation small
relative to communication, and data accesses are irregular
D1 coloring:
Approaches based on Luby’s parallel algorithm for maximal
independent set had limited success
Gebremedhin
and
Manne
(2000) developed
a parallel greedy coloring
algorithm on shared memory machines
◦
Uses speculative coloring to enhance concurrency, randomized
partitioning to reduce conflicts, and serial conflict resolution
◦
Number of conflicts bounded, so this approach yields an effective
algorithm
◦
Extended to distance

2 coloring by G, M and P (2002)
We adapt this approach to implement the greedy algorithm for many

core computing
33
Parallel Coloring
34
Parallel Coloring: Speculation
a
v
w
a
v
w
35
Experimental results
Iterative
Dataflow
Cray XMT:
RMAT

G with 2
24
, …, 2
27
vertices and 134M, …, 1B edges
36
Experimental results
Niagara 2
Iterative
Perf
. With doubling threads on a core = Doubling cores!
37
Experimental results
RMAT

G with 2
24
= 16M vertices and 134M edges
All Platforms
RMAT

B, 2
24
vertices,134M edges
38
Iterative Greedy Coloring:
Multithreaded Algorithm
Adj(v
),
color(w
),
forbidden(v
)
:
d(v
)
reads each
forbidden(v
)
:
d(v
)
writes
Adj(v
),
color(w
)
:
d(v
)
reads each
39
Experimental results
RMAT

G with 2
24
= 16M vertices and 134M edges
All Platforms
40
Tentative Conclusions, Future
Work
41
Future Plans: Multithreaded Coloring
Massive multithreading
◦
Microbechmarking
to understand where the cycles go: thread management,
data accesses, synchronization, instruction scheduling, function unit
limitations…
◦
Develop a performance model of the computation
◦
Experiment with other graph classes
◦
Consider new algorithmic paradigms
Many

core machines
◦
Four items as above
◦
Ordering for coloring: Archetype of a problem for computing a sequential
ordering in a parallel environment (
Mostofa
Patwary
and
Assefaw
Gebremedhin
)
◦
Extend to nodes of
Peta

scale machines, so single node performance is
enhanced, and complete our work on the Blue Gene and the Cray XT5
42
Thanks
Rob
Bisseling
, Erik
Boman
,
Ü
mit
Çatal
ü
rek
,
Karen
Devine
, Florin
Dobrian
, John
Feo
,
Assefaw
Gebremedhin
,
Mahantesh
Halappanavar
,
Bruce Hendrickson, Paul
Hovland
,
Gary
Kumfert
, Fredrik
Manne
,
Al
i
P
ı
nar
, Sivan
Toledo
,
Jean
Utke
43
Further reading
www.cscapes.org
Gebremedhin
and
Manne
, Scalable parallel graph
coloring algorithms,
Concurrency: Practice and
Experience,
12: 1131

1146, 2000.
Gebremedhin
,
Manne
and Pothen, Parallel distance

k
coloring algorithms for numerical optimization,
Lecture Notes in Computer Science,
2400: 912

921,
2002.
Bozdag
, Gebremedhin, Manne, Boman and Catalyurek.
A framework for scalable greedy coloring on
distributed

memory parallel computers.
J. Parallel
Distrib. Comput.
68(4):515

535, 2008.
Catalyurek
,
Feo
,
Gebremedhin
,
Halappanavar
and
Pothen, Multi

threaded algorithms for graph coloring,
Preprint,
Aug. 2010.
44
Comments 0
Log in to post a comment