Framework for High-Performance

internalchildlikeInternet and Web Development

Nov 12, 2013 (4 years and 1 day ago)

99 views

Troxel

#197 MAPLD 2004

CARMA: A Comprehensive Management
Framework for High
-
Performance
Reconfigurable Computing


Ian A. Troxel, Aju M. Jacob, Alan D. George,

Raj Subramaniyan, and Matthew A. Radlinski


High
-
performance Computing and Simulation (HCS) Research Laboratory

Department of Electrical and Computer Engineering

University of Florida

Gainesville, FL

#197 MAPLD 2004

Troxel

2

CARMA Motivation


Key missing pieces in RC for HPC


Dynamic RC fabric discovery and management


Coherent multitasking, multi
-
user environment


Robust job scheduling and management


Design for fault tolerance and scalability


Heterogeneous system support


Device independent programming model


Debug and system health monitoring


System performance monitoring into the RC fabric


Increased RC device and system usability



Our proposed
C
omprehensive
A
pproach to
R
econfigurable
M
anagement
A
rchitecture

(
CARMA
) attempts to unify existing
technologies as well as fill in missing pieces

CARMA

(Holy Fire

by Alex Grey)

#197 MAPLD 2004

Troxel

3

CARMA Framework Overview


CARMA seeks to integrate:


Graphical user interface


Flexible programming model


COTS application mapper(s)


Handel
-
C, Impulse
-
C, Viva, System Generator, etc.


Graph
-
based job description


DAGMan,

Condensed Graphs, etc.


Robust management tool


Distributed, scalable job scheduling


Checkpointing, rollback and recovery


Distributed configuration management


Multilevel monitoring service (
GEMS
)


Networks, hosts, and boards


Monitoring down into RC Fabric


Device independent middleware API


Multiple types of RC boards


PCI (many), network
-
attached, Pilchard


Multiple high
-
speed networks


SCI, Myrinet, GigE, InfiniBand, etc.

Applications
RC Cluster
Management
Data
Network
Algorithm Mapping
Performance
Monitoring
Middleware
API
User
Interface
COTS
Processor
RC Fabric
API
RC Fabric
RC Node
To Other
Nodes
Control
Network
#197 MAPLD 2004

Troxel

4

Application Mapper Evaluation


Evaluating on basis of ease of use, performance, hardware device independence, programming
model, parallelization support, resource targeting, network support, stand
-
alone mapping, etc.



C
-
Based tools


Celoxica
-

SDK (Handel
-
C)


Provides access to in
-
house boards: ADM
-
XRC (x1), Tarari (x4), RC1000 (x4)


Good deal of success after lessons learned


Hardware design focused


Impulse Accelerated Technologies


Impulse
-
C


Provides an option for hardware independence


Built upon open source
Streams
-
C

from LANL


Supports ANSI standard C



Graphical tools


StarBridge Systems
-

Viva


Nallatech


Fuse / DIMEtalk


Annapolis Micro Systems
-

CoreFire



Xilinx
-

ISE

compulsory


Evaluating the role of
Jbits
,
System Generator
, and
XHWIF



Evaluations still ongoing


Programming model a fundamental issue to be addressed

Streams
-
C c/o LANL

#197 MAPLD 2004

Troxel

5

CARMA Interface


Simple graphical user interface


Preliminary basis for graphical user interface via the Simple Web
Interface Link Library (SWILL) from the University of Chicago
*


User view for authentication and job submission/status


Administration view for system status and maintenance


Applications supported


Single or multiple tasks per job (via CARMA DAGs
**
)


CARMA registered (via CARMA API and DAGs) or not


Provides security, fault tolerance


Sequential and

parallel

(hand
-
coded or via MPI)



C
-
based application mappers supported


CARMA middleware API provides architecture independence


Any code that can link to the CARMA API library can be executed (Handel
-
C
and ADM
-
XRC API tested to date)


Bit files must be registered with the CARMA Configuration Manager (CM)


All other mappers can use “not CARMA registered” mode


Plans for linking Streams/Impulse
-
C, System Generator, et al.

* http://systems.cs.uchicago.edu/swill/

** Similar to Condor DAGs

#197 MAPLD 2004

Troxel

6

CARMA User Interface

#197 MAPLD 2004

Troxel

7

CARMA Job Manager (JM)


Prototyping effort (
CARMA interoperability
)


Completed first version of CARMA JM


Task
-
based execution via Condor
-
like DAGs


Separate processes and message queues for fault
-
tolerance


Checkpointing enabled with rollback in progress


Links to all other CARMA components


Fully distributed multi
-
node operation with job/task migration


Links to CARMA monitor and GEMS to make scheduling decisions


Tradeoff studies and analyses underway



External extensions to COTS tools (
COTS plug and play
)


Expand upon preliminary work @ GWU/GMU
*


Striving for “plug and play” approach to JM


CARMA Monitor provides board info. (via ELIM)


Working to link to CARMA CM


Tradeoff studies and analysis underway


Integration of other CARMA components in progress

c/o GWU/GMU

* Kris Gaj, Tarek El
-
Ghazawi, et al., “Effective Utilization and

Reconfiguration of Distributed Hardware Resources Using

Job Management Systems,”
Reconfigurable Architecture

Workshop 2003
, Nice, France, April 2003.

Hyper.1
Hyper.2
Hyper.3
Hyper.4
Hyper.5
1
stdout
1
1
2
1
1
2
1
2
1
2
3
1
2
1
2
1
1
2
1
File1
File2
2
CARMA DAG Example

#197 MAPLD 2004

Troxel

8

CARMA CM Design


Builds upon previous design concepts
*


Execution Manager (EM)


Forks tasks from JM and returns results to JM


Requests and releases configurations


Configuration Manager (CM)


Manages configuration transport and caching


Loads, unloads configurations via BIM


Board Interface Module (BIM)


Provides board independence


Allows for configuration temporal locality benefits


Communication Module


Handles all inter
-
node communication

Execution
Manager
Configuration
Manager
RC Hardware
Comm.
Communication
Remote Node
Control Network
File Transfers
Local Node
Inter-Process Comm.
BIM
Board API
RC Board
Application
BIM
CM spawns
BIM for each
Board
RC Board
BIM
CM
Board Specific
Communication
CARMA Board
Interface Language
CM uses BIM to
Configure Board
Board Interface Module (BIM)


Configures and interfaces with diverse set of RC boards


Numerous PCI
-
based boards


Various interfaces for network attached RC


Instantiated at startup


Provides hardware independence to higher layers


Separate BIM for each supported board


Simple standard interface to boards for remote nodes


Enhances security by authenticating data and configurations

* U. of Glasgow (Rage), Imperial College
in UK, U. Washington, among others

#197 MAPLD 2004

Troxel

9

Distributed CM Management Schemes

Jobs submitted
“centrally”

APP

APP MAP

GJM

GRMAN

Network

LRMON

Local Sys

LRMON

Local Sys



Tasks,

States

Results,

Statistics

Global view of

the system at all times

LAPP

LAPP MAP

LJM

LRMAN

Network

LRMON

Local Sys



Configurations

Requests

LAPP

LAPP MAP

LJM

LRMAN

LRMON

Local Sys

Jobs submitted

locally

Requests

Master
-
Worker (MW)

Client
-
Server (CS)

Client
-
Broker (CB)

Simple Peer
-
to
-
Peer (SPP)

Global view of

the system at all times

GRMAN

LAPP

LAPP MAP

LJM

LRMAN

Network

LRMON

Local Sys



Tasks,

Configurations

Requests,

Statistics

LAPP

LAPP MAP

LJM

LRMAN

LRMON

Local Sys

Jobs submitted

locally

Requests,

Statistics

Server houses configurations

Global view of

the system at all times

GRMAN

LAPP

LAPP MAP

LJM

LRMAN

Network

LRMON

Local Sys



Configuration


Pointers

LAPP

LAPP MAP

LJM

LRMAN

LRMON

Local Sys

Jobs submitted

locally

Requests,

Statistics

Configurations

Server brokers configurations

Requests,

Statistics

Note: More in
-
depth results for
distributed CM appeared at
ERSA’04

#197 MAPLD 2004

Troxel

10

CM System Recommendations

System Constraints

System Size (number of nodes)

< 8


8 to 32

32 to 512

512 to 1024

1024 to 4096

Latency bound

Flat CS

CS over CS with
group size 4

SPP over CS
with group size 4

SPP over SPP
with group size 8

SPP over SPP
with group size 16

Bandwidth bound
*

Flat CS

CS over CS with
group size 4

CS over CS with
group size 8

SPP over CS with
group size 8

SPP over CS with
group size 8

Best Overall

Flat CS

CS over CS with
group size 4

SPP over CS
with group size 4

SPP over CS with
group size 8

SPP over CS with
group size 8

Conclusions


CARMA CM design imposes very little overhead on the system


Hierarchical scheme needed to scale to systems of thousands of nodes
(traditional MW will not work)


Multiple servers for CS scheme don’t reduce the server bottleneck for system sizes greater than 32


SPP over CS (group size 8) best overall performance for systems larger than 512 nodes

* Schemes with completion latency values

greater than 5 seconds excluded

Scalability projected up to 4096 nodes


Performed analytic scalability analysis based on 16
-
node experimental results


Dual 2.4GHz Xeons and a Tarari CPX2100 HPC board
in a 64/66 PCI slot


Gigabit Ethernet and 5.3 Gbps Scalable Coherent Interface (SCI) control and data networks respectively



Flat system of 4096 has very high completion times (~5 minutes for SPP and ~83 hrs for CS)


Layered hierarchy needed for reasonable completion times (~2.5 sec for SPP over SPP at 4096 nodes)


CS reduces network traffic by sacrificing response time and SPP improves response time by increasing
network utilization

#197 MAPLD 2004

Troxel

11

CARMA Monitoring Services


Monitoring service


Statistics Collector


Gathers local and remote information


Updates GEMS
*

and local values


Query Processor


Processes task scheduling requests from JM


Maintains local information


Round
-
Robin Database


Compact way to store performance logs


Supports simple query interface


CARMA Diagnostic


System watchdog alerts based on defined
heuristics of failure conditions


Provides system monitoring and debug


Initial monitor version is complete


Studying FPGA monitoring options


Increasing the scheduling options


Tradeoff studies and analyses underway

Initial CARMA Monitor Parameters
A
)
Stats from JM
,
ExMan
,
ConMan
,
BIM
,
Board
-
Dynamic statistics
(
push or pull
)
-
Static statistics
(
pull
)
B
)
Stats from remote nodes via GEMS
C
)
StatCollector passes info to the RRD from local
and remote modules via the Query Processor
D
)
JM queries RRD for resource information to
make scheduling decisions
E
)
The CARMA diagnostic tool performs system
administration
,
debug and optimization
To Other
Nodes
JM
ConMan
ExMan
FPGA
Board
Stat
.
Collector
GEMS
B
BIM
BIM
BIM
FPGA
Board
FPGA
Board
RRD
Query
Proc
.
CARMA
Diagnostic
A
A
A
A
A
C
D
E
*
*
*
*
* Gossip
-
Enabled Monitoring Service (GEMS); developed

by HCS Lab for robust, scalable, multilevel monitoring

of resource health and performance.

For more info. see http://www.hcs.ufl.edu/gems

#197 MAPLD 2004

Troxel

12

CARMA End
-
to
-
End Service Description


Functionality demonstrated to date


Graphical user interface


Job/task scheduling based on board requirements and
configuration temporal locality


Parallel and serial jobs


CARMA registered and non
-
registered tasks


Remote execution and result retrieval


Configuration caching and management


Mixed RC and “CPU
-
only” tasks


Heterogeneous board execution (3 types thus far)


System and RC device monitoring


Inter
-
node communication via SCI or TCP/IP/GigE


Fault
-
tolerant design


Processes can be restarted while running



Virtually no system impact from CARMA overhead
despite use of unoptimized code


Less than 5MB RAM per node


Less than 0.1% processor utilization on a 2.4 GHz Xeon
server


Less than 200 Kbps network utilization

CARMA Execution Stages

1
)
User submits job
2
)
JM performs a task schedule request and
monitor replies with execution location
3
)
JM forwards tasks to local or remote ExMan
4
)
If task requires an RC board
,
ExMan sends a
configuration request to the local CM
5
)
The CM finds the file and configures the board
6
)
The user’s task is forked
(
runs on processor
)
7
)
Users access RC boards via the BIM
8
)
Task results are forwarded to the originating JM
9
)
Job results are forwarded to the originating user
Note
:
All modules update the monitor
CARMA Node
JM
ExMan
CM
BIM
uP
Monitor
RC
Fabric
3
7
6
4
5
UI
1
2
3
5
7
8
8
9
#197 MAPLD 2004

Troxel

13

CARMA Framework Verification


Several test jobs executed concurrently


Parallel Add Test composed of


ADD.exe, a “CPU
-
only” task to add two numbers


AddOne.bit, an RC task to increment input value


Parallel N
-
Queens Test composed of


ADD.exe, a “CPU
-
only” task to add two numbers


NQueens.bit, an RC1000 task to calculate a subset of
the total number of solutions for an
N
×
N

board


Parallel Sieve of Erasthones (on Tarari)


Parallel Monte Carlo Pi Generator (on Tarari)


Blowfish encrypt/decrypt (on ADM
-
XRC)

ADD
.
exe
AddOne
.
bit
12
3
2
ADD
.
exe
5
5
6
AddOne
.
bit
11
Par. Add Test

N
-
Queens Test

Example System Setup

These simple applications used to test CARMA’s functionality, while

CARMA’s services have wider applicability to

problems of greater size and complexity.

NQueens
.
bit
92
ADD
.
exe
4
4
8
Xeon Server
JM
ExMan
CM
BIM
uP
Monitor
Tarari
Xeon Server
JM
ExMan
CM
BIM
uP
Monitor
RC
1000
UI
Xeon
Server
CM
Config
.
Store
GEMS
SCI
(
Config
.
Files
)
SWILL
Interface
TCP
/
IP
(
Requests
)
TCP
/
IP
(
Tasks and Results
)
Xeon Server
JM
ExMan
CM
BIM
uP
Monitor
ADMXRC
#197 MAPLD 2004

Troxel

14

CARMA Framework Example Setup


Parallel N
-
Queens example


8 boards of two types (RC1000 and Tarari)


Job written in Handel
-
C SDK2 and is CARMA
-
registered


MPI communication between nodes over GigE


Tasks begin “synchronized” (suggest an MPI barrier)


Result collected at “head” node (not hard
-
coded)


Final result passed to user interface

Xeon Server
uP
Tarari
Xeon Server
uP
Tarari
Xeon Server
uP
Tarari
Xeon Server
uP
Tarari
Xeon Server
uP
RC
1000
Xeon Server
uP
RC
1000
Xeon Server
uP
RC
1000
Xeon Server
uP
RC
1000
User Interface

#197 MAPLD 2004

Troxel

15

Simple Scheduling Example


Four jobs requested on local JM


First job deployed


Configuration requested and job begun


Second job deployed


Passed to remote JS


Configuration requested and job begun


Third job deployed


Configuration found preloaded so job queued


Fourth job deployed


Run
-
time decision to be made


Xeon
JM
ExMan
CM
BIM
uP
Monitor
RC
1000
Xeon
JM
ExMan
CM
BIM
uP
Monitor
RC
1000
UI
Xeon
CM
Config
.
Store
GEMS
SCI
(
Config
.
Files
)
SWILL
Interface
TCP
/
IP
(
Requests
)
TCP
/
IP
(
Tasks and Results
)
AddOne.bit

NQueens.bit

Par. Add Test

NQueens Test

NQueens Test

NQueens Test

Job Queue

Processing Order

?

#197 MAPLD 2004

Troxel

16

Conclusions


First working version of CARMA complete & tested


Numerous features supported


Simple GUI front
-
end interface


Coherent multitasking, multi
-
user environment


Dynamic RC fabric discovery and management


Robust job scheduling and management


Fault
-
tolerant and scalable services by design


Performance monitoring down into the RC fabric


Heterogeneous board support with hardware independence


Linking to COTS job management service


Initial testing shows the framework to be sound with very
little overhead imposed upon the system

#197 MAPLD 2004

Troxel

17

Future Work and Acknowledgements


Continue to fill in additional CARMA features


Include support for other boards, application mappers, and languages


Complete JM rollback feature and finish linkage to LSF


Include broker and caching mechanisms for the peer
-
to
-
peer distributed CM scheme


Include more intelligent scheduling algorithms (e.g. Last Release Time)


Expand RC device monitoring and include debug and opt. mechanisms


Enhance security including secure data transfer and authentication


Deploy on a large
-
scale test facility



Develop CARMA instantiations for

other RC domains


Distributed shared
-
memory machines with RC (e.g. SGI Altix)


Embedded RC systems (e.g. satellite/aircraft systems, munitions)



We wish to thank the following for supporting this research:


Department of Defense


Xilinx


Celoxica


Alpha Data


Tarari


Key vendors of our HPC cluster resources (Intel, AMD, Cisco, Nortel)