The Parallel Revolution Has Started:

companyscourgeΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

160 εμφανίσεις

B
ERKELEY
P
AR
L
AB

B
ERKELEY
P
AR
L
AB

The Parallel Revolution Has Started:

Are You Part of the Solution

or Part of the Problem?

Krste Asanovic,
Ras

Bodik, Eric Brewer, Jim Demmel,

Tony Keaveny, Kurt Keutzer, John Kubiatowicz,

Nelson Morgan,
Dave Patterson
, Koushik Sen,

David Wessel, and Kathy Yelick

UC Berkeley Par Lab

June 23,
2010


B
ERKELEY
P
AR
L
AB

The Transition to Multicore

2

Sequential App
Performance

B
ERKELEY
P
AR
L
AB

3

P.S. Multicore Revolution Could Fail


John Hennessy, President, Stanford University:

“…when we start talking about parallelism and ease of use of truly
parallel computers, we're talking about a problem that's as hard as
any that computer science has faced. …

I would be panicked if I were in industry.”


“A Conversation with Hennessy & Patterson,”

ACM Queue Magazine
, 1/07.


100% failure rate of Parallel Computer Companies


Ardent, Convex
, Encore, Inmos (
Transputer
),
MasPar
,

nCUBE
,
Kendall Square
Research, Sequent,

Tandem,
Thinking Machines


What if IT goes from a
growth


industry to a
replacement

industry?


If SW can’t effectively use 32, 64, ...

cores per chip

=>

SW no faster on new computer

=>

Only buy if computer wears
out


or some people buy cheaper PC (
netbook
)

B
ERKELEY
P
AR
L
AB

4

Need a
Fresh

Approach

to Parallelism


Berkeley researchers from many backgrounds
meeting since Feb. 2005 to discuss parallelism


Krste Asanovic,

Eric Brewer,
Ras

Bodik, Jim Demmel, Kurt Keutzer
,

John
Kubiatowicz,

Dave
Patterson,

Koushik
Sen,

Kathy
Yelick, …


Circuit design, computer architecture, massively parallel
computing, computer
-
aided design, embedded hardware

and software, programming languages, compilers,

scientific programming, and numerical analysis


Tried to learn from successes in high
-
performance computing
(LBNL) and parallel embedded (BWRC)


Led to “Berkeley View” Tech. Report 12/2006 and
new Parallel Computing Laboratory (“Par Lab”
)


2008: after open competition, Intel/MS award $10M


Goal: Productive, Efficient, Correct, Portable SW for
100+ cores & scale as core increase every 2 years (!)

B
ERKELEY
P
AR
L
AB



Past parallel projects often dominated by
hardware/architecture



This is the one true way to build computers:

software must adapt to this breakthrough



ILLIAC IV, Thinking Machines CM
-
2,
Transputer
,
Kendall Square KSR
-
1, Silicon Graphics Origin 2000 …



Or sometimes by programming language



This is the one true way to write programs:

hardware must adapt to this breakthrough



Id,
Backus Functional Language FP, Occam, Linda,
High Performance Fortran, Chapel, X10, Fortress …



Apps usually an afterthought

5

Need a
Fresh

Approach

to Parallelism

B
ERKELEY
P
AR
L
AB

Par Lab’s original


bets”


Let compelling applications drive research
agenda


Software platform: data center + mobile client


Identify common programming patterns


Productivity versus efficiency programmers


Autotuning

and software synthesis


Build
-
in correctness + power/performance diagnostics


OS/Architecture support applications, provide primitives
not pre
-
packaged solutions


FPGA simulation of new parallel architectures: RAMP


Co
-
located integrated collaborative center

Above all, no preconceived big idea
-

see what works
driven by application needs.


6

B
ERKELEY
P
AR
L
AB

7

Personal
Health

Image
Retrieval

Hearing,
Music

Speech

Parallel
Browser

Design Patterns/Motifs

Legacy
Code

Schedulers

Communication &
Synch. Primitives

Efficiency Language Compilers

Easy to write portable code that runs efficiently on manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

ParLab Manycore/RAMP

Hypervisor

Correctness

Selective Embedded JIT Specialization

Parallel
Libraries

Parallel
Frameworks

Dynamic
Checking

Debugging

with Replay

Directed
Testing

Autotuners

Efficiency
Languages

Diagnosing Power/Performance

Par Lab Research Overview

Productivity


Languages

B
ERKELEY
P
AR
L
AB

Par Lab, ~2 years in


How are our bets working out?


What big new ideas are emerging?


Where are we going in next 3 years?

8

B
ERKELEY
P
AR
L
AB

9

Dominant Application Platforms


Data Center or Cloud (“Server”)


Laptop/Handheld (“Mobile Client”)


Both together (“
Server+Client
”)


ParLab
-
RADLab/AMPLab

collaborations


Par Lab focuses on mobile clients


But many technologies apply to data center

Shift in Key Platforms:

‘90s: Desktop/workstation

‘00s: Laptops

‘10s: Smartphone/Tablet/TV + Cloud


Implications? Smaller clients, fewer
cores/client? More need for intra
-
task
parallelism in data centers?



9

B
ERKELEY
P
AR
L
AB

10

Music
and Audio Applications

(
David Wessel)

Musicians have an insatiable appetite
for computation
+ real
-
time demands


More
channels, instruments,
more processing
,

more
interaction!


Jitter free
l
atency
must be low
(<5
ms
)


Must be reliable (No clicks!)



Novel Instruments & User Interfaces


New
composition and performance systems
beyond keyboards


Input
devices
for
Laptop/Handheld


Enhanced sound delivery systems using large
microphone and speaker arrays


Music Information Retrieval (MIR) systems for
client & cloud


B
ERKELEY
P
AR
L
AB

Health
Application: Stroke
Treatment

(Tony
Keaveny
)


Stroke treatment time
-
critical, need
supercomputer performance in
hospital


Image scan, blood flow analysis, then
with stroke, then simulate blood thinner


200,000 cases per year in US


20% not treated since too long after


Potentially, 30% of those benefit

Recent progress: Build simplified 1.5D

Strokeflow
” model as test case for
ParLab

SEJITS approach.


11

B
ERKELEY
P
AR
L
AB

Health
Application: Stroke
Treatment

(Tony
Keaveny
)


Stroke treatment time
-
critical, need
supercomputer performance in
hospital


Image scan, blood flow analysis, then
with stroke, then simulate blood thinner


200,000 cases per year in US


20% not treated since too long after


Potentially, 30% of those benefit

Recent progress: Build simplified 1.5D

Strokeflow
” model as test case for
ParLab

SEJITS approach.


12

B
ERKELEY
P
AR
L
AB

13

Content
-
Based Image
Retrieval

(
Kurt
Keutzer
)

Relevance
Feedback

Image

Database

Query by example

Similarity

Metric

Candidate

Results

Final Result


Built around Key Characteristics of personal
databases


Very large number of pictures (>5K)


Non
-
labeled images


Many pictures of few people


Complex pictures including people, events, places,
and objects



1000’s of
images

SVM&Damascene

code ~10
-
100x speedups
over existing implementations, 100s
downloads.
Project expanded to

greater set
of machine vision
applications.


B
ERKELEY
P
AR
L
AB

14

Robust Speech Recognition

(Nelson Morgan)



Meeting Diarist


Laptops/ Handhelds at meeting
coordinate to create speaker
identified, partially transcribed
text diary of meeting

Use cortically
-
inspired
manystream

spatio
-
temporal
features to tolerate noise

Progress: Parallelized all main ASR components

Next: Parallelization of
diarization

code, SEJITS

B
ERKELEY
P
AR
L
AB

15

Parallel Browser

(Ras Bodik)


Original goal: Desktop
-
quality browsing on
handhelds (
Enabled by 4G networks, better output
devices)


Now: Better development environment for new
browser
-
based applications


Language for layout and behavior


Specifying CSS layout semantics


Layout engines via synthesis and
autotuning


Efficient parallel parsing components


Highlights:

Identified 7 major browser
-
based app classes,
and their needs.

Developing very high
-
level (constraints
-
based)
layout programming model to help productivity.

B
ERKELEY
P
AR
L
AB

More Applications:

“Beating down our doors!”


New external application collaborators:


Computer Vision (
Jitendra

Malik
, Thomas
Brox
)


Computational Finance (Matthew Dixon @UCD)


Speech (Dorothea
Kolossa

@TU Berlin)


Natural Language Translation (Dan Klein)


Programming
multitouch

interfaces (
Maneesh

Agrawala
)


Protein Docking (Henry
Gabb
, Intel)


Pediatric MRI (Michael
Lustig
,
Shreyas

Vassanwala

@Stanford)


16

B
ERKELEY
P
AR
L
AB

Fast Pediatric MRI

17


Pediatric MRI is difficult


Children cannot keep still or hold breath


Must put children under anesthesia for long

exams: risky & costly


Need techniques to accelerate MRI

acquisition (sample & multiple sensors)


Reconstruction must also be fast, or time
saved in acquisition is lost in compute



Current reconstruction time: 2 hours



Non
-
starter for clinical use


Mark Murphy (Par Lab)

starts on reconstruction 9
/09: Pick SW
Patterns, Good SW Architecture, Good Algorithms, Autotuning


Now 1 minute (100X): Fast enough for radiologist to decide


Starting 3/10: in use for clinical study by Dr.
Shreyas

Vasanawala

at Lucille Packard Children's Hospital at Stanford


B
ERKELEY
P
AR
L
AB

Types of Programming

(or “types of programmer”)

Hardware/OS

Efficiency
-
Level

(MS in CS)

C/C++/FORTRAN

assembler

Java/C#

Uses hardware/OS
primitives, builds
programming
frameworks (or apps)

Productivity
-
Level

(Some CS courses)

Python/Ruby/
Lua

Scala

Uses programming
frameworks, writes
application
frameworks (or apps)


Haskell/
OCamL
/F#

Domain
-
Level

(No CS courses)

Max/MSP, SQL,

CSS
/Flash/
Silverlight
,

Matlab
, Excel, Rails

Builds app with DSL
and/or by customizing
app framework

Provides hardware

primitives and
OS services

Example Languages

Example Activities

18

B
ERKELEY
P
AR
L
AB

Where to make parallelism visible?


Not in a Domain
-
Specific Language


Should focus on making domain experts productive


Too many domains, new domains, multiple domains/app


Not in a new general
-
purpose parallel language


An oxymoron?


Won’t get adopted.


Most big applications written in >1 language.


Par Lab: Pattern
-
based software components


Components use any language or parallel
prog
. model


Pattern
-
specific compilation to attain efficiency


Flexible, efficient composition of separately developed
software components




19

B
ERKELEY
P
AR
L
AB

Motifs common across applications

App 1

App 2

App 3

Dense

Sparse

Graph Trav.

Berkeley View
Motifs
(“Dwarfs”)

20

B
ERKELEY
P
AR
L
AB

21

How do compelling apps relate to 12 motifs?




Motif (nee “Dwarf”) Popularity



(
Red Hot



䉬略 䍯潬
)

B
ERKELEY
P
AR
L
AB

22

22

Graph
-
Algorithms

Dynamic
-
Programming

Dense
-
Linear
-
Algebra

Sparse
-
Linear
-
Algebra

Unstructured
-
Grids

Structured
-
Grids

Model
-
View
-
Controller

Iterative
-
Refinement

Map
-
Reduce

Layered
-
Systems

Arbitrary
-
Static
-
Task
-
Graph

Pipe
-
and
-
Filter

Agent
-
and
-
Repository

Process
-
Control

Event
-
Based/Implicit
-
Invocation

Puppeteer

Graphical
-
Models

Finite
-
State
-
Machines

Backtrack
-
Branch
-
and
-
Bound

N
-
Body
-
Methods

Circuits

Spectral
-
Methods

Monte
-
Carlo

Applications

Structural Patterns


Computational Patterns

Task
-
Parallelism

Divide and Conquer

Data
-
Parallelism

Pipeline

Discrete
-
Event

Geometric
-
Decomposition

Speculation

SPMD

Data
-
Par/index
-
space

Fork/Join

Actors

Distributed
-
Array

Shared
-
Data

Shared
-
Queue

Shared
-
map

Partitioned Graph

MIMD

SIMD

Parallel Execution Patterns

Concurrent Algorithm Strategy Patterns

Implementation Strategy Patterns

Message
-
Passing

Collective
-
Comm.

Transactional memory

Thread
-
Pool

Task
-
Graph

Data structure

Program structure

Point
-
To
-
Point
-
Sync. (mutual exclusion)

collective sync. (barrier)

Memory sync/fence

Loop
-
Par.

Task
-
Queue

Transactions

Thread creation/destruction

Process creation/destruction


Concurrency Foundation constructs (not expressed as patterns)

“Our” Pattern Language (OPL
-
2010)

(Kurt
Keutzer
, Tim Mattson)

A

=
M

x

V

Refine Towards
Implementation

Highlight: ~20 apps architected with patterns,
patterns
-
to
-
frameworks talk later today.

Wide uptake, 2
nd

ParaPLOP

Workshop

B
ERKELEY
P
AR
L
AB

Structural Patterns describe
Component Composition


Structural patterns describe how components
are assembled, e.g. Map
-
Reduce, Agent
-
and
-
Repository, Static
-
task
-
graph


Our belief: Any large software application must
have a comprehensible software architecture
describable as a hierarchy of patterns =>
hierarchy of components

23

B
ERKELEY
P
AR
L
AB

Mapping Patterns to Hardware

App 1

App 2

App 3

Dense

Sparse

Graph Trav.

Multicore

GPU

“Cloud”

Only a few types of hardware platform

24

B
ERKELEY
P
AR
L
AB

High
-
level pattern constrains space
of reasonable low
-
level mappings

(Insert latest OPL chart showing path)

25

B
ERKELEY
P
AR
L
AB

“Stovepipes”:

pattern
-
specific
and platform
-
specific compilers

Multicore

GPU

“Cloud”

App 1

App 2

App 3

Dense

Sparse

Graph Trav.

Allow maximum efficiency and
expressibility

in
stovepipes by avoiding mandatory intermediary layers

26

B
ERKELEY
P
AR
L
AB

27

Autotuning for Code Generation

Search space for
block sizes

(dense matrix):



Axes are block

dimensions



Temperature is

speed


Problem: generating optimized code is like searching for
needle in haystack; use computers rather than humans










Auto
-
tuning

Auto
-

parallelization

serial

reference

OpenMP

Comparison

Auto
-
NUMA


Auto
-
tuners

approach: program
generates
optimized code and
data structures for a “motif”
(~kernel)


ParLab

autotuners for stencils
(e.g., images), sparse matrices,
particle/mesh, collectives (e.g.,
“reduce”)



B
ERKELEY
P
AR
L
AB

SEJITS: “Selective, Embedded,
Just
-
In Time Specialization”



Use modern high
-
level scripting language (Python,
Ruby) for productivity programming


Use language facilities to embed “specializers” that
map high
-
level pattern to efficient low
-
level code at
(develop/install/run)
-
time


Specializers can incorporate autotuners to generate
tuned efficiency
-
level code for given platform


Efficiency programmers incrementally add
specializers for (Pattern
-
Target) pairs. Fall back to
scripting runtime if no specializer exists.


28

B
ERKELEY
P
AR
L
AB

.py

OS/HW

f()

@h()

Specializer

.c

PLL Interp

@g(
)

SEJITS

Productivity app

.so

cc/ld

$

SEJITS makes tuning decisions
per
-
function
(not per
-
app)

B
ERKELEY
P
AR
L
AB

.py

OS/HW

f()

@h()

Specializer

.c

PLL Interp

@g(
)

SEJITS

Productivity app

.so

cc/ld

$

SEJITS makes tuning decisions
per
-
function
(not per
-
app)

Selective

Embedded

JIT

Specialization

B
ERKELEY
P
AR
L
AB

SEJITS Main Ideas

1.
Specializer

== pattern
-
specific compiler


exploit pattern
-
specific strategies that may
not generalize


target specific hardware per pattern

2.
Can happen at runtime

3.
Productivity Level Language (PLL) program
always valid even without SEJITS support


vs. incompatibly
extended
syntax; inspired
by
Dom. Spec. Embed. Lang.
vs. DSL
argument

4.
Specializers

can be written in PLL

31

B
ERKELEY
P
AR
L
AB

Producing Software vs.
Producing An Answer


SEJITS delivers adaptive parallel software



HW variation: # cores, caches, DRAM, etc.


runtime variation: resource availability
(OS+HW)


SEJITS is a highly productive way to produce
exactly the code variants you need


SEJITS makes research code productive


Exploit full libraries, tools, etc. of PLL


Performance competitive with ELL code

=> run non
-
toy experiments


Develop specializer to target new HW feature

=> Test designs with real apps

32

B
ERKELEY
P
AR
L
AB

Correctness, Testing, and
Debugging


Productivity Layer


no low level data races, but non
-
determinism could still be present


specify and verify semantic determinism and atomicity, i.e. there is no
bug due to parallelism


separates parallel correctness from functional correctness


Efficiency Layer


data races, deadlocks, memory model related bugs, atomicity
violations,
livelocks


Active testing: combines static and dynamic analyses so rare
concurrency bugs discovered quickly and precisely


Debugging


Record and replay


Simplify a buggy concurrent trace


reduce number of context switches


reduce length of the trace


Concurrent Breakpoints

33

Highlight:



ACM SIGSOFT Distinguished Paper Awards at ICSE
09 and FSE 09



IFIP TC2 Manfred Paul Best Paper Award at ICSE 10

B
ERKELEY
P
AR
L
AB

Key Culprit:
Nondeterminism



Determinism key to parallel correctness



Same input ==> semantically same output



Parallelism is wrong if some schedules give a
correct answer while others don’t


Lightweight spec of parallel correctness


Independent of functional specification


Can effectively test deterministic specs


New Approach: Automatically infer deterministic
specifications by observing sample program
runs


Result: Recovered previous manual
specifications for most benchmarks

34

B
ERKELEY
P
AR
L
AB

Tessellation OS: Space
-
Time
Partitioning + 2
-
Level Scheduling

1
st

level:
OS determines coarse
-
grain allocation of resources to
jobs over space and time

2
nd

level:
Application schedules
component tasks onto
available “harts” (hardware
thread contexts) using Lithe

35

Time

Space

2nd
-
level
Scheduling

Address Space
A

Address Space
B

Tas
k

Tessellation Kernel

(Partition Support)

CPU

L1

L2

Bank

DRAM

DRAM & I/O Interconnect

L1 Interconnect

CPU

L1

L2

Bank

DRAM

CPU

L1

L2

Bank

DRAM

CPU

L1

L2

Bank

DRAM

CPU

L1

L2

Bank

DRAM

CPU

L1

L2

Bank

DRAM

Prototype ROS running on RAMP Gold and Nehalem
-
x86

B
ERKELEY
P
AR
L
AB

Software must be adaptive


Environment
-
adaptive


Number of cores, speed of cores, soft errors,
multiprogramming, battery life


SPMD/Static load
-
balancing/static mapping, things of
the past


Input
-
adaptive


Degree of parallelism depends on input parameters


Adaptation managed by:


OS resource manager (“how many resources best for
this app?”)


Real
-
time app (“how many resources to meet
deadline?”)


Quality
-
adjustable app (“what’s possible with these
resources?”)


Need hardware measurement facilities


36

B
ERKELEY
P
AR
L
AB

Par Lab “Multi
-
Paradigm”
Architecture


Single “Fat” ILP
-
focused Tile
Control Processor


Multiple “Thin”
Lane Control
Processors
embedded in
vector
-
thread lane



Core

Tile

Tile
-
Private L2U$

Fat Tile
Control
Processor(IL
P
)

L1D$

L1I$

Shareable L3$/LL$

Vector
-
Thread
Lane

Thin
Scalar
Control
Proc
.

Vector
-
Thread
Lane

Thin
Scalar
Control
Proc
.

Vector
-
Thread
Lane

Thin
Scalar
Control
Proc
.



Tile Control Processor, Lane Control Processor, and
Vector
-
Thread
microthreads

all run the same ISA, but
microarchs

optimized for different forms of parallelism

Par Lab Architecture research using a
“superset” architecture that can model
various forms of scalar + vector machine,
and various forms of memory hierarchy
and interconnect.

37

B
ERKELEY
P
AR
L
AB

Hardware Measurement:

This we believe


Parallel HW/SW must support

performance
-
portable parallel software


If you expect programmers to continue “Moore’s
Law” by doubling amount of portable parallelism
in programs every 2 years*, need hardware
measurement for them to see how well doing



During
development
inside an IDE



During
runtime
so that app, resource
scheduler, and OS can see and adapt

38

*
Shekhar

Borkar
, Intel,

CNET News, 2009

B
ERKELEY
P
AR
L
AB

RAMP
Gold


Rapid accurate simulation of
manycore

architectural ideas
using
FPGAs


Initial version models 64 cores


of
SPARC v8 with shared memory
system on $750 board


Hardware FPU, MMU, boots OS.


Cost

Performance

(MIPS)

Simulations per day

Software

Simulator

$2,000

0.1
-

1

1

RAMP Gold

$2,000 + $750

50
-

100

100

Highlight: RAMP Gold

in production use
(ISCA/
HotPar
/DAC papers based
on RAMP

Gold results
)
.

39

B
ERKELEY
P
AR
L
AB

Co
-
located Collaborative
Center Approach


Continual collaboration sometimes hard work,
but very beneficial. Many ideas emerge from, or
are honed by, cross
-
subproject discussions.


E.g.
Eigensolvers

for Damascene application


E.g. Audio framework built on Lithe


E.g. Sparse
-
matrix language for library development


Integration with other components/layers forces
real implementations not just paper prototypes


Whole stack demo on RAMP Gold


Can’t see how to have made this much progress
otherwise


40

B
ERKELEY
P
AR
L
AB

Par Lab Stats


16 faculty, 62 PhD students, 4
postdocs


125+ papers


Par Lab cover story,
CACM,
Oct. ‘09


“The Trouble with Multicore,”

IEEE Spectrum
, July ‘10


Workshops / Summer Courses @ UCB


UCB “
Bootcamp
”: 196 attend 2008, 397 in 2009



Next is August 16
-
18, 2010; see Par Lab Web Site


2 Founding Companies: Intel and Microsoft


6 Affiliates: National Instruments, NEC, Nokia,
NVIDIA, Oracle, and Samsung

41

B
ERKELEY
P
AR
L
AB

Par Lab Summary


Multicore challenge hardest for CS in 50 years:
Performance “Moore’s Law” up to programmers!


Unveil 2X parallelism/program every 2 years


Software platform: mobile client + cloud


Identify common programming patterns to reveal
parallelism and use good software architecture


Target productivity versus efficiency programmers


Autotuning & Specialized Embedded JIT Specialization


Active Testing


OS/Architecture support isolation, measurement


FPGA simulation of new parallel architectures: RAMP


No preconceived silver bullet; let’s apps decide


Already many apps with significant speedup



42

B
ERKELEY
P
AR
L
AB

Backup Slides & References


Armbrust, M., A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee,

D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, “A View of Cloud Computing,”
Communications of the ACM
, 53:3, March 2010.


Asanović
, K., R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan,
D. Patterson, K. Sen, J. Wawrzynek, K. Yelick, "A View of the Parallel Computing
Landscape,”
Communications of the ACM
, 52:10, October 2009.


J.
Burnim

and K. Sen. Asserting and checking determinism for multithreaded programs.
Communications of the ACM
, 53(6), June 2010.


Catanzaro, B., A. Fox, K. Keutzer, D. Patterson, B
-
Y. Su,
M.Snir
, K.
Olukotun
, P.
Hanrahan
, and H.
Chafi
, “Ubiquitous Parallel Computing from Berkeley, Illinois and
Stanford,”
IEEE Micro
, March/April 2010.


Catanzaro, B., S.
Kamil
, Y. Lee, K.
Asanović
, J. Demmel, K. Keutzer, J. Shalf, K. Yelick,
and A. Fox, "SEJITS: Getting Productivity and Performance with Selective Embedded
JIT Specialization,”
1st Workshop on Programmable Models for Emerging Architecture
at the 18
th

Int’l Conf. on Parallel Architectures and Compilation Techniques
, Raleigh,
North Carolina, November 2009.


J. A.
Colmenares
, S. Bird, H. Cook, P. Pearce, D. Zhu, J. Shalf, S.
Hofmeyr
, K.
Asanovic, and J. Kubiatowicz. Resource management in the Tessellation
manycore

OS.
HotPar’10, Berkeley, CA, USA, June 2010.


Tan, Z., A. Waterman, S. Bird, H. Cook, K.
Asanović
, and D. Patterson, “A Case for
FAME: FPGA Architecture Model Execution,” Proc. ISCA, June 2010.

43

B
ERKELEY
P
AR
L
AB

Par Lab Apps


What are the compelling future workloads?

o
Need apps of future vs. legacy to drive agenda

o
Improve research even if not the real killer apps


Computer Vision
: Segment
-
Based Object
Recognition,
Poselet
-
Based Human Detection


Health
: MRI Reconstruction
,

Stroke Simulation



Music
: 3D Enhancer, Hearing Aid, Novel UI


Speech
: Automatic Meeting Diary


Video Games
: Analysis of Smoke 2.0 Demo


Computational Finance
: Value
-
at
-
Risk
Estimation, Crank
-
Nicolson Option Pricing


Parallel Browser
: Layout, Scripting Language


44

B
ERKELEY
P
AR
L
AB

Developing Parallel Software



Conventional:
Measure and recode
slow pieces with more
threads


Par Lab: Find right
SW architecture that
reveals parallelism

45

Application Specification

SW Arch. to

Identify Parallelism

Performance

profile

Not fast enough

Fast enough

Ship it

Thought Experiment:

Map SW Arch. to HW Arch.

Write / Debug Code

B
ERKELEY
P
AR
L
AB

Correctness, Testing,

and Debugging


Active testing
: combines static and dynamic
analyses so that rare concurrency bugs are
discovered quickly and precisely


Burnim
, Sen “Asserting and Checking Determinism for
Multithreaded Programs”


ACM SIGSOFT Distinguished Paper Award

+ CACM Research Highlight Invitation


Naik
, Park, Sen, Gay “Effective Static Deadlock
Detection”


ACM SIGSOFT Distinguished Paper Award


Actively control scheduler to force potentially buggy
schedules: Data races, Atomicity Violations, Deadlocks


Found parallel bugs in real production OSS code:

Apache Commons Collections, Java Collections Framework,
Java Swing GUI framework, and Java Database
Connectivity (JDBC)




B
ERKELEY
P
AR
L
AB

SHOT Functional Requirements


Standardized Hardware Operation Tracker: SHOT


Low latency reads so deployed in production code


Can be read by OS and by user apps


To be used by virtual machines, must be able to
save and restore as part of context switch


Since some counters are per core, SW must read
all counters as if on same clock edge


Don’t need to be perfect counts, just consistent:
accuracy
±

1% OK

47

B
ERKELEY
P
AR
L
AB

Minimum SHOT Architecture

1.
Global real time clock
(vs. count clock cycles)


Since clock rate varies due to Dynamic Voltage
and Frequency Scaling (DVFS)



~ 100 MHz (fast enough for apps)

2.
Count Number instructions retired per core


Measure computation throughput

3.
Count off
-
chip memory traffic
(incl. prefetching)


Key to performance and energy

4.
Standard so apps and OS can rely on them


Standardized Hardware Measurement as
important as IEEE Floating Point Standard?





48

B
ERKELEY
P
AR
L
AB

Make productivity programmers efficient,

and efficiency programmers productive?


Autotuning problem: The search space is large

taking a
lot of cycles to explore and a long time


Search Full Parameter Space


More than 180 Days


Using machine learning + few performance counters

to democratize autotuning


12 minutes to find solution

As good or even beat the expert designed autotuner!


-
1% and 16% for a 7
-
pt Stencil


-
2% and 15% for a 27
-
pt Stencil


18% and 50%
for dense matrix



Enables even greater range of optimizations than we
imagined

B
ERKELEY
P
AR
L
AB

Why might we succeed this
time?


No Killer Microprocessor to Save Programmers


No one is building a faster serial microprocessor


For programs to go faster, SW must use parallel HW


New Metrics for Success vs. Linear Speedup


Real Time Latency/Responsiveness and/or MIPS/Joule


Just need some new killer parallel apps

vs. all legacy SW must achieve linear speedup


Necessity: All the Wood Behind One Arrow


Whole industry committed, so more working on it



If future growth of IT depends on faster processing at
same price (vs. lowering costs like
NetBook
)

50

B
ERKELEY
P
AR
L
AB

Why might we succeed this
time?



Multicore Synergy with Cloud Computing



Cloud Computing apps parallel even if client not parallel



Manycore

is cost
-
reduction, not radical SW disruption


Vitality of Open Source Software


OSS community more quickly embraces advances?


Single
-
Chip Multiprocessors Enable Innovation


Enables inventions that were impractical or

uneconomical when multiprocessors were 100s chips


FPGA prototypes shorten HW/SW cycle


Fast enough to run whole SW stack, can change

every day vs. every 4 to 5 years when do chips

51

B
ERKELEY
P
AR
L
AB

Gesture
-
Enhanced User
Interface


Using human gestures, motion, body
pose to control a video game interface


Natal
-
like interface using built
-
in
laptop webcam, not add
-
on stereo
-
infrared


We are collaborating with Jitendra Malik, world
-
leader in Computer
Vision on "Poselets"
-
based human detection and pose estimation


Algorithmic improvements, GPU implementation:
30x speedup


Detection running near real
-
time using a $5 webcam


Enrich the Windows user experience


Using tracking body pose, allow
user to interact with a 3D world:
richer than touch
-
screen provides

B
ERKELEY
P
AR
L
AB

Recent Results: Vision Acceleration


Bryan Catanzaro: Parallelizing

Computer Vision (image segmentation)


Problem: Malik’s highest quality algorithm

was 5.5 minutes / image on new PC



Good SW architecture

+

talk within Par Lab


on to use new algorithms, data structures



Bor
-
Yiing Su, Yunsup Lee, Narayanan Sundaram,

Mark Murphy, Kurt Keutzer, Jim Demmel, Sam Williams



Current result: 1.8 seconds / image on manycore



~ 150X speedup



Factor of 10 quantitative change is a qualitative change


Malik: “This will revolutionize computer vision.”

53

B
ERKELEY
P
AR
L
AB

RAMP Blue, July 2007


1008 modified
MicroBlaze

cores


90MHz


RTL directly mapped to FPGA


Runs UPC version of NAS parallel
benchmarks.


Message
-
passing cluster



No MMU


Requires lots of hardware


21 BEE2 boards /
84
FPGAs


Difficult to modify


High clock rate but low IPC,
particularly if want to model timing

B
ERKELEY
P
AR
L
AB

Alexander’s Pattern Language


Christopher Alexander’s approach to
(civil) architecture:


A
pattern

is a
generalizable

solution to
a recurring problem.


"Each
pattern

describes a problem
which occurs over and over again in
our environment, and then describes
the core of the solution to that
problem, in such a way that you can
use this solution a million times over,
without ever doing it the same way
twice.“
Page x
,
A Pattern Language,
Christopher Alexander


Alexander’s 253 (civil) architectural
patterns

range from the creation of cities
(2. distribution of towns) to particular
building problems (232. roof cap)


A
pattern language

is an organized way
of tackling an architectural problem using
patterns


Main limitation:


It’s about civil not software
architecture!!!

B
ERKELEY
P
AR
L
AB

SEJITS Prototypes


Copperhead (Bryan Catanzaro, Michael
Garland)


Pattern/target: data parallel programming/GPU


PSC finds opportunities to unroll loops, fuse
operations, avoid data structure conversions, ....


Asp* + PySKI (Shoaib Kamil, Erin Carson et al.)


Patterns: captured by OO classes & higher
-
order
methods; structured grid, lambda application (map),
sparse matrix, others to follow


Target: x86 multicore with OpenMP, or similar


LL (Gilad Arnold, Ras Bodík et al.)


Domain
-
specific language + compiler



Patterns: sparse matrix linear algebra

56

* “
A
sp is
S
EJITS for
P
ython”

B
ERKELEY
P
AR
L
AB

One Decade of
SAME

(
Software Architecture Model Execution)


Median
Instructions
Simulated/
Benchmark

Median
#Cores

Median
Instructions
Simulated/
Core

ISCA 1998

267M

1

267M

ISCA 2008

825M

16

100M

[ “A Case for FAME”, ISCA 2010]

B
ERKELEY
P
AR
L
AB

Dimensions in FAME

(FPGA Architecture Model Execution)

Direct:
One target cycle executed in one FPGA host cycle

Decoupled:

One target cycle takes one or more FPGA cycles



Full RTL:
Complete RTL of target machine modeled

Abstract RTL:

Partial/simplified RTL, split functional/timing


Host single
-
threaded:
One target model per host pipeline

Host multi
-
threaded:

Multiple target models per host pipeline


58

B
ERKELEY
P
AR
L
AB

Host Multithreading

CPU
1

CPU
2

CPU
3

CPU
4

Target Model


Multithreading emulation engine reduces FPGA resource use and
improves emulator throughput


Hides emulation latencies (e.g., communicating across
FPGAs
)

Multithreaded Emulation Engine
(on FPGA)

+1

2

PC

1

PC

1

PC

1

PC

1

I$

IR

GPR1

GPR1

GPR1

GPR1

X

Y

2

D$

Single hardware
pipeline with
multiple copies
of CPU state

B
ERKELEY
P
AR
L
AB

Composable

Resource
Management (Lithe + OS)

60

Tessellation OS

Lithe
User
-
Level Scheduling ABI

Hardware Cores

App 1

Module 2

Module 3

Module 1

App 2

B
ERKELEY
P
AR
L
AB

Resource Management for
Real
-
Time

61

Tessellation OS

Lithe
User
-
Level Scheduling ABI

Hardware Cores

App 1

Module 2

Module 1

Real
-
Time
Module 3