Better Speedups for Parallel Max-Flow

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

77 εμφανίσεις

Better Speedups for Parallel Max
-
Flow

George C.
Caragea

Uzi
Vishkin


Dept. of Computer Science

University of Maryland, College Park, USA


June 4
th
, 2011

Experience with an Easy
-
to
-
Program
Parallel Architecture


XMT (
eXplicit

Multi
-
Threading) Platform


Design goal: easy to program many
-
core architecture


PRAM
-
based design, PRAM
-
On
-
Chip programming


Ease of programming demonstrated by order
-
of
-
magnitude ease
-
of
-
teaching/learning


64
-
processor hardware, compiler, 20+ papers, 9 grad degrees, 6 US
Patents


Only one previous single
-
application paper (
Dascal

et. al, 1999)


Parallel Max
-
Flow results


[IPDPS 2010] 2.5x speedup vs. serial using CUDA


[
Caragea

and
Vishkin
, SPAA 2011] up to 108.3x speedup vs. serial using
XMT


3
-
page paper

2

How to publish application papers
on an easy
-
to
-
program platform?


Reward game is skewed


Easier to publish on “hard
-
to
-
program” platforms


Remember STI Cell?


Application papers for easy
-
to
-
program architectures are
considered “boring”


Even when they show good results


Recipe for academic publication:


Take simple application (e.g. Breadth
-
First Search in graph)


Implement it on latest (difficult to program) parallel architecture


Discuss challenges and work
-
arounds


3

Parallel Programming Today

4

Current Parallel Programming


High
-
friction n
avigation

-

by
implementation [walk/crawl]


Initial

program (1week) begins
trial & error tuning (½ year;
architecture dependent)

PRAM
-
On
-
Chip Programming


Low
-
friction navigation



mental
design and analysis [
fly
]


No need to crawl


Identify most efficient algorithm


Advance

to efficient implementation

PRAM
-
On
-
Chip Programming


High
-
school student comparing parallel
programming approaches


“I was motivated to solve all the XMT
programming assignments we got, since I had to
cope with
solving the algorithmic problems
themselves, which I enjoy doing. In contrast, I did
not see the point of programming other parallel
systems available to us at school, since too much
of the programming was effort
getting around the
way the systems were engineered
, and this was
not fun”

5

Maximum Flow in Networks


Extensively studied problem


Numerous algorithms and implementations (general graphs)


Application domains


Network analysis


Airline scheduling


Image processing


DNA sequence alignment


Parallel Max
-
Flow algorithms and implementations


Paper has overview


SMPs and GPUs


Difficult to obtain good speedups vs. serial


e.g. 2.5x for hybrid CPU
-
GPU solution

6

XMT Max
-
Flow Parallel Solution


First stage: identify/design parallel algorithm


[
Shiloach,Vishkin

1982] designed O(n
2
log n) time, O(nm) space
PRAM algorithm


[Goldberg,
Tarjan

1988] introduced distance labels in S
-
V:
Push
-
Relabel

algorithm with O(m) space complexity


[Anderson, Setubal 1992] observed poor practical performance
for G
-
T, augmented with S
-
V
-
inspired
Global Relabeling

heuristic


Solution: Hybrid SV
-
GT PRAM algorithm


Second stage: write PRAM
-
On
-
Chip implementation


Relax PRAM lock
-
step synchrony by grouping several PRAM
steps in an XMT spawn block


Insert synchronization points (barriers) where needed for correctness


Maintain active node set instead of polling all graph nodes for
work


Use hardware supported atomic operations to simplify
reductions

7

Input Graph Families


Performance is highly dependent on the structure of the graph


Graph structures proposed in DIMACS challenge [DIMACS90]


Used by virtually every Max
-
Flow publication


8

Datasets

Description

Nodes

Edges

ADG

Acyclic Dense Graph

1200

719400

RLG

Washington Random Level Graph

131074

391168

RMF
-
WIDE

GenRMF

Wide Graph

8192

23040

RMF
-
LONG

GenRMF

Long Graph

8192

22464

RANDOM

Radom Graph

65536

96759

Speed
-
Up Results


Compared to “best serial implementation”, running on recent x86
processor [Goldberg2006]


Clock cycle count speedups:




Two XMT configurations:


XMT.64: 64 core FPGA prototype


XMT.1024: 1024
-
core, cycle
-
accurate simulator
XMTSim


Speedups: 1.56x to 108.3x for XMT.1024

9

-
5.00
10.00
15.00
20.00
ADG
RLG-WIDE
RMF-WIDE
RMF-LONG
RANDOM

7.95


16.19


1.76


1.56


108.33


2.83


1.70


1.09


0.88


8.10

XMT.1024
XMT.64
XMT
xflow
ParallelMa
x
low
SerialMaxf
s
ClockCycle
ClockCycle
Speedup
_
86
_

Conclusion


XMT aims at being easy
-
to
-
program, general
-
purpose architecture


Performance improvements on hard
-
to
-
parallelize applications like Max
-
Flow


Ease of programming: by showing order
-
of
-
magnitude improvement in ease
-
of
-
teaching/learning


Achieved difficult speedups at much earlier developmental stage (10th graders in HS versus
graduate students). UCSB/UMD experiment, Middle
-
School, Magnet HS, Inner City HS,
freshmen course, UIUC/UMD
-
experiment: J. Sys. & SW08 SIGCSE10, EduPar11.


Current stage of XMT project: develop more complex applications beyond
benchmarks


Max
-
Flow is a step in that direction


More needed


Without an easy
-
to
-
program
many
-
core architecture, rejection of
parallelism by mainstream programmers is all but certain


Affirmative action: drive more researchers to work and seek publications on easy
-
to
-
program architectures


This work should not be dismissed as ‘too easy’



Thank you!

10