Supporting GPUs in a Volunteer Computing System

monkeybeetleΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

57 εμφανίσεις

Supporting GPUs in a Volunteer Computing System


David P. Anderson

Space Sciences Laboratory

University of California, Berkeley


Abstract:


1. Introduction


Graphics Processing Units (GPUs) are coprocessors used in personal computers and game consoles
to

render 3
-
D graphics. They are relevant to high
-
performance computing for several reasons:



Speed: a current high
-
end GPU (the NVIDIA GTX280) has a peak performance of 1 trillion
double
-
precision floating point operations per second. This is roughly 50 ti
mes the speed of a
current high
-
end CPU (the Intel core7). The speed of both CPUs and GPUs has increased
roughly exponentially over time; however, the doubling time of GPUs is much less than that of
CPUs (9 months versus 18 months).



Programmability: a num
ber of systems are now available that allow GPUs to be programmed
fairly easily, and that provide libraries of scientific functions such as FFT. Examples include
CUDA, OpenCL, Stream+.



Energy efficiency: the watts/FLOP ratio of GPUs is about 10x less than

CPUs

GPU vendors have developed rack
-
mounted multi
-
GPU units targeted at scientific computing.
However, consumer GPU products are cheaper and far more prevalent. The NVIDIA GTX280 sells for
about $400, and there are roughly 100 million GPUs in home compu
ters around the world. Viewed as
a unit, these GPUs have a peak performance of about 20 ExaFLOPs


10,000 times the power of the
fastest supercomputer.

Making the 100 million GPUs in home computers available to computational science applications
requires
a mechanism by which computer owners can donate the use of their computing resources to
research projects. This arrangement, called
volunteer computing
, was pioneered in the late 1990’s by
projects such as distributed.net, GIMPS, SETI@home, and Folding@ho
me, and is now widely used by
research projects in many areas of science. Currently roughly 500,000 people and 1 million computers
participate in volunteer computing projects.

The dominant software platform for volunteer computing is BOINC [ref]. Using B
OINC, scientists
can create volunteer computing
projects
, and computer owners can donate computing resources to
these projects. BOINC allows and encourages volunteers to participate in multiple projects. To do so,
they download and run a client program (
available for all common operating systems) and "attach" the
client to the desired projects.

In its original form, BOINC assumed that all applications use 1 CPU and no other processing
resources. We decided to modify BOINC to support a much wider range o
f processing resources:
GPUs, other coprocessors (such as the Cell processor in the Sony Playstation 3), multiple CPUs, and
arbitrary combinations of these. This required major changes to the BOINC client, the server, and the
protocol by which they commun
icate.

The general goals of this redesign include:



Transparency: use the full processing resources of a volunteer’s computer, including GPUs,
with no action or even awareness required of the volunteer.



Heterogeneity: accommodate diversity of GPU manufactur
er, model, RAM size, driver version,
and multiplicity.



Generality: allow projects to offer alternative versions of a given application (e.g., one for GPU,
one for multicore CPU, one for single CPU), and to change this assortment frequently.

This paper desc
ribes the design decisions we encountered in this redesign, and the solutions we
adopted. Section 2 describes relevant aspects of the original (pre
-
GPU) version of BOINC, and section
3 describes the changes made to accommodate GPUs. Section 4 describes t
he use of GPUs by several
projects.

2. BOINC overview


At the highest level, BOINC involves
volunteers

(who provide computing resources) and
projects

(which use those resources). Volunteers run a BOINC client program on their computers, and can
attach

th
e client to any set of projects. Each attachment is associated with an
account

on the project; a
volunteer with several computers will typically attach them to one account. Each attachment has a
corresponding
resource share
; bottleneck resources are divi
ded among projects in proportion to their
resource shares.

F
igure 1: Volunteers can participate in multiple projects and control resource allocation.


Projects are autonomous; each operates its own server, which maintains a database of accounts and
jobs.

As of June 2009 there are 60 projects, 330,000 active volunteers with 575,000 active hosts.
These hosts provide an average of over 2 PetaFLOPS.

The client periodically makes a
scheduler transaction

to each of its attached projects. The request
message i
ncludes:



Account credentials.



A list of “platforms” supported by the host. A platform is a combination of OS and processor
type, e.g. Windows/x86. Some hosts support multiple platforms; for example, a Windows x64
host can generally execute Windows/x86 ap
plications as well. The list is ordered by decreasing
expected performance: e.g., 64
-
bit apps are assumed to run faster than 32
-
bit apps.



A description of the host’s CPU, memory, and disk space.



A description of queued and in
-
progress jobs.



A
work request

consisting of the number N of idle CPUs, and the number S of CPU
-
seconds
requested.



A list of completed jobs.


The reply message includes:



A list of programs to download. Each program can consist of multiple files; these files are not
included in the rep
ly message; they are described by one or more URLs from which they can be
downloaded.



A list of jobs to execute. Each is associated with a particular program, and can have arbitrary
numbers of input and output files.


On receipt of the reply, the client d
ownloads input and application files, executes the jobs, and
uploads the output files. Eventually it makes another scheduler request, and this cycle repeats
indefinitely.

All communication (scheduler transactions, file downloads, and file uploads) is perf
ormed by client
-
initiated HTTP operations, accommodating the fact that many clients are behind firewalls that allow
only outgoing HTTP traffic.

BOINC must tolerate version skew. The client does not automatically update itself (this is a design
decision,
to avoid a real or perceived security vulnerability). Hence the volunteer computers run a
range of client versions, dating back several years. Similarly the projects run various versions of the
server software. The scheduler transaction messages are in
XML format. All extensions to the protocol
are backwards
-
compatible, so that old clients can communicate with new servers and vice
-
versa.


2.1 The BOINC server


The volunteer computing host population is diverse in terms of operating system (Windows, Lin
ux,
Mac OS X) and processor type (x86, x64, PowerPC, SPARC). BOINC supports mechanisms (such as
virtual machines and interpreted languages) that mask this diversity. However, most projects simply
compile native
-
mode versions of their applications for al
l of the major platforms.

A program compiled for a particular platform is called an
app version
. A collection of functionally
equivalent app versions is called an
application
. When a job is submitted, it is associated with an
application, not an app vers
ion. It can be assigned to any host for which an app version is available.


A job has the following attributes:



A latency bound B. If a job is sent to a host at time T, it must be completed and reported by
time T+B; if not, it “times out” and an instanc
e of the job is sent to another host.



An estimate of the number of FLOPS used by the job, used to estimate its run time on a given
host.



An upper bound on the number of FLOPS, used to abort runaway jobs.



Upper bounds on the amount of RAM and disk space use
d by the job.


The BOINC scheduler supports a range of policies and parameters [X]. All the policies have the
same basic structure. They examine a sequence of jobs. For each job J:



If no app version is available for any of the host’s platforms, skip J.



If the host has insufficient memory or disk space, skip J.



If, based on the host’s current workload (including new jobs) and J’s estimated run time, J will
not be completed within its delay bound, skip J.



Add J to the list of jobs to send to the host.


Thi
s is repeated until the host’s work request is satisfied (i.e., NIDLE jobs are sent, and their total
run time is at least X).


2.2 The BOINC client


The BOINC client maintains a “work buffer” of multiple jobs, typically several hours or days of
work. Thi
s buffer shrinks as jobs are completed. When it falls below a certain hysteresis point, the
client replenishes it by issuing a scheduler transaction. There are several reasons for using this
approach instead of processing jobs synchronously:



Some hosts a
re connected only occasionally to the Internet. The work buffer prevents such
hosts from running out of work while disconnected.



The frequency of scheduler transactions is determined by the buffer size, rather than by job size.
This is important for limi
ting the server load load on projects with large numbers (hundreds of
thousands or millions) of attached hosts.



If the client is attached to several projects, the work buffer typically contains a job from project,
and the client time
-
slices between them.
The resulting variety reassures the volunteer that they
are contributing to all of the projects.


The BOINC client contains two mechanisms that are relevant to this paper. First, the
job
scheduling policy

decides, from among the currently queued jobs, whi
ch to execute at a given time.
This policy normally does round
-
robin time
-
slicing between jobs from different projects, thus
providing maximum variety to the volunteer. However, if periodically estimates that completion times
of jobs under this policy, a
nd if any job is in danger of missing its deadline, it is run to completion.

Second, the
job fetch policy

decides when to contact a scheduler, which one to contact, and how
much work to ask for. It is responsible for a) maintaining the work buffer within
its hysteresis limits,
and b) enforcing resource shares. The latter is accomplished by maintaining a
processing debt

for
each project. This constantly increases in proportion to the project’s resource share, and decreases as
CPU time is allocated to the
project. When more work is needed, it is requested from the project
whose processing debt is largest.

2.3 The anonymous platform mechanism


BOINC provides a mechanism called
anonymous platform

that lets volunteers compile their own
applications, rather t
han obtaining applications from projects. This accommodates volunteers who



want to inspect source code for security reasons;



have unusual computer types not supported by the project;



want to run versions that are optimized (perhaps for a specific processo
r type) relative to the
version distributed by the project.


To use this mechanism, volunteers create an XML
-
format
application description file

describing
the application versions that are available on their host. This information is conveyed to the serv
er,
which sends it jobs only for applications for which the host has a version.


3. Supporting GPUs in BOINC


3.1 Modeling processing resources


How should the processing resources of a given host be represented in its scheduler requests, and
how should
the processing resource requirements of a job be described in scheduler replies? These
questions presented several design decisions.

Suppose a host has several NVIDIA GPUs. In general, these GPUs may differ in various ways
(speed, video RAM, compute capa
bilities, and so on). Some of them may be able to handle the jobs of
a particular project, others not. One approach would be to model the GPUs as separate resources;
scheduler requests would contain the specifications of each GPU, and scheduler replies w
ould link each
job to the particular GPU or GPUs it is to use.

We chose not to this approach, because a) the static association of job and GPU instance can lead to
unnecessarily idle GPUs, and b) it leads to great software complexity. Instead, we chose a
model in
which the NVIDIA GPUs on a given host are represented as N identical and interchangeable instances.
We achieve this by identifying the most “capable” GPU on the host (based on its speed, RAM, etc.) and
ignoring the GPUs that are not roughly simila
r to this one. The set of remaining GPUs is modeled as N
instances of the most capable one. This approach has the disadvantage of not using the less capable
GPUs on a given host, but this is not a significant loss.

A host’s processing resources, then, ar
e represented by:



A description of its CPUs and memory (as in the original BOINC).



Zero or more “coprocessor types”, each of which has an instance count and a type
-
dependent
XML
-
format description of an instance (in the case of NVIDIA GPUs this includes me
mory
size, registers per block, warp size, threads per block, clock rate, and processor count).


Currently the only supported coprocessor type is NVIDIA GPUs. We plan to support ATI GPUs and
Cell SPEs as well.

We now turn to the second question: how to re
present the processing resource requirements of a job
on a particular host? We chose to represent this
host usage

information as follows:



The average number (possibly non
-
integer) of CPUs used by the job.



For each coprocessor type, the number (integer) of
instances used by the job.



The expected FLOPS achieved when running the job.

We chose to associate host usage information not with individual jobs, but with the app versions
contained in scheduler replies. We will explain in Section 3.4 how the informatio
n is determined.

This model is general in the sense that it supports applications that use arbitrary combinations of
CPU and coprocessor resources. However, it is limited in two ways:



It does not support applications whose behavior changes over the course

of a job.



It requires an
a priori
estimate of how many CPUs the job will use on a given host, rather than
dynamically observing its behavior; for many parallel applications it is difficult to make an
accurate estimate.


Removing these limitations would ma
ke it impossible to accurately the completion times of a set of
jobs, which would undermine the effectiveness of BOINC’s scheduling mechanisms in general.

We can now summarize the changes to the scheduler request RPC protocol:



Scheduler requests now includ
e a description of the host’s processing resources, as described
above. For each resource type (CPU and coprocessors) the request includes a) the duration
(instance/seconds) being requested, and b) the number N of idle instances; enough jobs to use N
inst
ances simultaneously should be returned. A client can therefore request work for certain
resources and not others.



Scheduler replies now contain host usage information with each app version, and each job is
associated with an app version. A given reply c
an contain two versions of the same application,
e.g. a CPU version and a GPU version. Each job is linked to a version.


3.2 GPU
-
enabling the BOINC scheduler


In the new version of BOINC, a project may have several different versions of a given applicatio
n
for a given platform. For example, for the Windows/x86 platform there might be a single
-
CPU version,
a version for NVIDIA GPUs, and a multithreaded version for multicore CPUs. Furthermore, each of
these versions may have a range of possible behaviors r
elated to resource usage. For example, the
multithreaded version may be able to use different numbers of threads, depending on the number of
available CPU cores. Typically the behavior can be specified by a command
-
line argument.

Given a scheduler reques
t as described above, the scheduler must now decide for each job that it
sends:



Which application version to use. A given scheduler reply might include jobs linked to different
versions of the same application, e.g. to utilize both the CPUs and the GPUs o
f the requesting
host.



Command
-
line arguments specifying the desired resource behavior.



The expected resource usage (i.e. the number of CPUs and GPUs).



The estimated run time.


This decision may be complex and application
-
specific, so its logic is placed i
n a project
-
supplied
application planning function

that is linked with the BOINC scheduler. This function takes as input a
host description and an application version. It returns:



Whether the application version can run on the host, and if so:



What CPU a
nd coprocessor resources it will use.



What FLOPS it is expected to attain.



What command
-
line argument should be passed to it.


For GPU apps, this function can enforce constraints on GPU speed, video RAM size, and driver
version number.

For a multiprocessor

application, this function can embody knowledge of its speedup function; for a
given host the logic would select the optimum number of threads to use, supply a command
-
line
argument instructing the application accordingly, and estimate the corresponding F
LOPS.

The default behavior of the function (for compatibility with existing app versions) is that the
resource usage is 1 CPU, and the FLOPS estimate is the benchmark speed of the CPU.

In handling a given request, the scheduler maintains a memoized functio
n that maps an application
A to the “best” version V = B(A) to use for the given host. This is determined as follows:



Enumerate all versions V of A for platforms listed in the request



Call the application planning function to get the version’s resource us
age.



If the planning function indicates that the host can’t handle V, skip V.



If the version uses resources for which no work is being requested, skip V.



Of the remaining versions, B(A) is the one whose FLOPS estimate is highest.


When the scheduler assign
s a job to the host, it knows how many instances of each resource type the
job will use, and how many instance
-
seconds. It updates the corresponding fields of the request
message. If at any point the scheduler sees that B(A) uses a resource for which the

request fields are
now zero, it re
-
evaluates B(A).

As an example, suppose that a project has both CPU and GPU application versions for Win/x86, and
that a Win/86 client requests one minute each of both CPU and GPU jobs. Then initially B(A) will be
the GP
U version, and the scheduler will assign one or more jobs using the GPU version, and satisfy the
GPU request. At this point (assuming the GPU version uses a small CPU fraction) there will still be a
nonzero CPU request; the scheduler will change B(A) to t
he CPU version, and send sufficient jobs
using the CPU version to satisfy the request.

SOMETHING ABOUT PER
-
JOB MIN #PROCS, SCORE
-
BASED SCHED ETC.


3.3 GPU
-
enabling the BOINC client


In running GPU applications on volunteer PCs, we need to take into account
:



Video RAM (the local memory used by GPU apps) is not paged. Thus



The runtime systems in current GPUs do not provide preemptive multitasking. Kernels run to
completion. Thus, if a GPU application is running


even if its CPU part runs at low priority


the user will typically see some slowdown of GUI operations.

The BOINC client was modified in several ways to accommodate GPUs. First, it detects the
presence of GPUs, together with their hardware characteristics (clock rate, video RAM, number of
process
ors) and their driver software version; this is reported to the server.

Second, the work
-
fetch policy, which decides when to fetch work, and which project to fetch it from,
was redesigned. Recall from Section 2 that volunteers can associate a “resource s
hare” with each
project, determining the fraction of resources that it receives. The notion of resource share has been
generalized to apply to the entirety of a host’s processing resources, rather than to the resources
separately, with the goal that the d
ivision of FLOPS among projects should match the resource shares
as closely as possible.

For example, suppose a host is attached to projects A and B with equal shares, the host has a GPU
that is twice times as fast as its CPU, and that both projects have a

GPU app but only project A has a
CPU app. Then project A will get 100% of the CPU and 25% of the GPU, while project B will get 75%
of the GPU.

shortfall/ RR simulation/ debt

overall debt =

If a GPU is idle

if a CPU is idle

if a GPU has shortfall

if a CPU

has shortfall

...

How to update debt

How to estimate if project has job for resource

Using each job's host usage data, this mechanism estimates the queue length for each resource. For
example, if the current work queue will keep the CPU busy for 20 hours

and the GPU busy for 5 hours,
and the network connect interval is 8 hours, then the client needs to fetch at least 3 hours of GPU work.
The work
-
fetch mechanism also keeps track of which projects have work for which resource types, and
uses exponential b
ackoff to avoid repeatedly asking a project for a type of work it currently doesn't
have.

Second, client's job scheduling policy, which selects a set of jobs to run, has been generalized to use
host usage data. It gives priority to GPU jobs, schedules the
m in earliest
-
deadline
-
first order to
minimize preemption, and passes them a command
-
line argument indicating which GPU instances to
use. It runs GPU jobs at normal operating system priority; the CPU portion of a GPU job typically
uses little time, and if

it runs at low priority the entire job runs inefficiently.

GPU jobs: always preempt by quit

Preempting GPU jobs: need to wait for exit

User preferences:

Don’t use GPU while computer in use



3.4 GPU
-
enabling the anonymous platform mechanism


We extended t
he anonymous platform mechanism (Section 2.3) to handle coprocessor and
multiprocessor applications. The application description file now can optionally specify the
coprocessor and CPU requirements of each app version, and its FLOPS estimate


the same “
host
usage” information returned by the scheduler's application planning function.

The host usage information is used by the client for job scheduling and work fetch, as described in
Section 3.3. It is also passed to the scheduler, which uses it, in lieu
of the information returned by the
application planning function, to calculate resource usage and estimate job run times.

If using anon platform for a project, this overrides the “resource backoff” scheme

The anonymous platform mechanism is useful for proj
ects with open
-
source applications. It has
been used in two different ways:



SETI@home offered a NVIDIA GPU version of one of its applications. Volunteers optimized
the application, achieving a speedup of over 20% over the stock version.



Milkyway@home of
fered only a CPU application. Volunteers modified the application to run
on ATI GPUs, achieving a speedup of about 100X over the CPU version

In both cases the modified application will eventually become a standard version offered by the
project’s server;

the anonymous platform mechanism allows volunteers to develop and debug the new
versions, and to verify that their results match those of the “stock” versions.


4. Case studies


In SETI@home:

table: fraction of active hosts with NVIDIA GPUs

tables of GPU

types, RAM sizes, est. FLOPS, multiplicities

fraction of work done by GPUs

elapsed time for GPU jobs vs. CPU

GPUgrid.net:

goal: min latency, not max throughput

max # outstanding jobs per GPU

5. Conclusion


Future work: homogeneous redundancy

job
-
size mat
ching (since completion time skewed)


References


Local Scheduling for Volunteer Computing. David P. Anderson and John McLeod VII. Workshop on
Large
-
Scale, Volatile Desktop Grids (PCGrid 2007) held in conjunction with the IEEE International
Parallel & Dist
ributed Processing Symposium (IPDPS), March 30, 2007, Long Beach.


High
-
Performance Task Distribution for Volunteer Computing. David P. Anderson, Eric Korpela,
Rom Walton First IEEE International Conference on e
-
Science and Grid Technologies. 5
-
8 December

2005, Melbourne

Homogeneous Redundancy: a Technique to Ensure Integrity of Molecular Simulation Results Using
Public Computing. M. Taufer, D. Anderson, P. Cicotti, C.L. Brooks III. From 19th IEEE International
Parallel and Distributed Processing Symposiu
m (IPDPS'05) Heterogeneous Computing Workshop.
April 4 2005, Denver CO.