Integrating GPUs into Condor

birdsowlSoftware and s/w Development

Dec 2, 2013 (3 years and 7 months ago)

62 views

1

Integrating GPUs into Condor




Timothy Blattner

Marquette University

Milwaukee, WI


April 22, 2009

2

Outline


Background and Vision


Graphics Cards


Condor Approach


Problems


Conclusions and Future Work

3

Graphics cards


Powerful


NVIDIA Tesla C1060


240 massively parallel processing cores


4 GB GDDR3


CUDA Capable


~993 gigaflops


~$1,300


Cheap


NVIDIA 9800 GT


112 massively parallel processing cores


512 MB GDDR3


CUDA Capable


~$120

4

Vision and Focus


Pool of computers containing graphics cards, managed by
Condor


Provide users the ability to utilize graphics cards identified
by Condor

?

?

?

Central Manager

5

Opportunities

Resources may already be there


Majority of machines have graphics cards in them


GPU resources sit idle while Condor runs on the CPU


Similar work


GPUGRID.net


Distributed computing project using NVIDIA graphics
card for atom molecular simulations of proteins


Uses GPU
-
enabled BOINC client


6

Prototype Implementation


Linux only


Script queries operating system and graphics card


Hawkeye Cron job manager runs script


Script outputs graphics card information into ClassAd format


Binary for NVIDIA cards for more specific information

7

Graphics Card Architecture

8

Graphics card APIs


Favor general purpose computations



CUDA (NVIDIA)



Brook (ATI)



openCL (Khronos Group)

9

CUDA Programming Model


Kernels are functions run on the
device

(GPU)



Host (CPU) code invokes kernels and determines


Number of threads


Thread block structure for organizing threads



Kernel invocations are
asynchronous


Control returns to the CPU immediately


CUDA provides synchronization primitives


Some CUDA calls (e.g. memory allocation) are
synchronous

10

Hawkeye Cron Job Manager


Provides mechanism for collecting, storing, and using
information about computers


Periodically executes specified program(s)


Program outputs in form of ClassAd


Outputs are added to machine's ClassAd

11

Hawkeye Implementation


Added to local configuration file


Runs script every minute


Condor user must be granted graphics card privileges in
order to query the card

STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST),

UPDATEGPU

STARTD_CRON_UPDATEGPU_EXECUTABLE = gpu.sh

STARTD_CRON_UPDATEGPU_PERIOD = 1m

STARTD_CRON_UPDATEGPU_MODE = Periodic

STARTD_CRON_UPDATEGPU_KILL = True

12

Script Output


HasGpu = True


NGpu = 1


Gpu0 = "Quadro FX 3700"


Gpu0CudaCapable = True


Gpu0_Major = 1


Gpu0_Minor = 1


Gpu0Mem = 536150016


Gpu0Procs = 14


Gpu0Cores = 112


Gpu0ShareMem = 16384


Gpu0ThreadsPerBlock = 512


Gpu0ClockRate = 1.24


HasCuda = True


-


13

Job Submission


Users can submit jobs with GPU requirements into Condor


Portable across Linux Distros

Universe = vanilla

Executable = tests/CudaJob

Initialdir = gpuJobs

Requirements = (HasGpu == true) &&
(Gpu0CudaCapable == true)

Log = gpu_test.log

Error = gpu_test.stderr

Output = gpu_test.stdout

Queue

condor_submit gpu_job.submit

14

Access Control


/dev/nvidiactl, /dev/nvidia* devices need read/write by
submitting/running user


Could be


Nobody, open access


Controlled by Unix group, containing limited users


Integrated more directly with Condor user control, slot
users

15

Problems


Preemption


Jobs running in GPU kernel cannot be interrupted
reliably by Unix signals



Watchdog timer


After 5 seconds, job is killed


A Solution: use general purpose graphics card as
secondary display



Memory Security


Malicious users, interrupting a job between GPU kernel
calls, have the opportunity to overwrite or copy GPU
memory

16

Summary


Condor based approach for advertising GPU resources



Linux
-
based prototype implementation



Can access available GPUs


Works best on dedicated machines, with no need for
preemption



Current Limitations


Doesn’t report GPU usage


Lack of preemption


Limited OS and video card support



17

Future Work


Create benchmark and testing suite



Handle preemption


Investigate how watchdog works



GPU usage reporting



Integrate memory protection



Support more Operating Systems


Windows and Mac OS X



Support alternative architectures and APIs


Brook and OpenCL

18

Questions?


Contact:

timothy.blattner@marquette.edu

craig.struble@marquette.edu

https://sourceforge.net/projects/condorgpu/