Automatic Intra-Task Parallelization of OpenCL Programs on Multi-GPU Systems (Thesis Draft)

tangibleassistantΛογισμικό & κατασκευή λογ/κού

3 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

236 εμφανίσεις

Automatic Intra-Task Parallelization of
OpenCL Programs on Multi-GPU Systems
(Thesis Draft)
Department of Computer Science
National Chiao Tung University
Advisor:Dr.Yi-Ping You
Student:Vincent Sung
November 2013
Open Computing Language (OpenCL),which enables GPUs to be programmed for
general-purpose computation,has promoted GPU systems to be widely applicable plat-
forms for parallel computing.In recent years,from cloud computing environments to
consumer/entry-level platforms,multiple GPU devices in one system are getting more
common.However,the OpenCL programming model requires programmers to explicitly
identify and schedule computations and communications among multiple GPU devices.
A legacy OpenCL program,which is usually written for systems with only a single GPU
device,does not directly benefit from multiple computing resources.Furthermore,the re-
source management,such as workload distribution and synchronization among discrete
memory spaces,in such a system increases programmers’ burden.In this thesis,we pro-
pose a runtime abstraction layer ViCL,which acts as a high-level agent between program-
mers and the vendor’s OpenCL runtime system to untangle these problems and therefore
provides backward compatibility for legacy programs and leads to more productive devel-
opment of OpenCL programs.ViCL comprises a static analyzer and a runtime system.The
static analyzer extracts dependency information between kernel functions of an OpenCL
program.The runtime system provides a front-end library and a runtime manager:the
front-end library encapsulates the concept of single virtual device to programmers and de-
livers OpenCLAPI calls as commands to the runtime manager,which schedules commands
to appropriate GPU devices according to the statically analyzed dependency information
so as to automatically maximize the parallelismof OpenCL applications.The experimental
results show that the mean abstraction overhead of VirCL with single device is about 5%,
while the main overhead was incurred from data communications between applications
and the runtime system.ViCL also made legacy OpenCL applications scalable:the mean
efficiency when running on a one-GPU,two-GPU,four-GPUplatformwas 95%,92%,and
87%,respectively.
i
Keywords:
OpenCL;GPGPU abstraction;GPGPU task scheduling;Memory management;Program
flow analysis;Heterogeneous multi-core systems
ii
Contents
Abstract i
Contents iii
List of Figures vii
List of Tables ix
1 Introduction 1
1.1 Motivation..................................1
1.2 Thesis Organization.............................2
2 Background 4
2.1 Introduction to OpenCL...........................4
2.1.1 Anatomy of OpenCL........................5
2.1.2 PlatformModel...........................5
2.1.3 Execution Model..........................5
Context...............................6
Programand Kernel........................6
Command-Queues.........................7
Index Space.............................7
2.1.4 Memory Model...........................9
iii
2.1.5 Programming Model........................10
2.2 OpenCL Application Phases........................11
2.2.1 Devices Setup Phase........................11
2.2.2 Objects Allocation Phase......................12
2.2.3 Data Initialization Phase......................13
2.2.4 Kernel Issuing Phase........................13
2.2.5 Retrieving Phase..........................14
2.2.6 Cleanup Phase...........................14
3 SystemDesign and Implementation 16
3.1 Case Study:A Parallelizable Example...................16
3.2 ViCL SystemOverview...........................21
3.3 Runtime Manager..............................23
3.3.1 Software Architecture.......................23
Front-end Library and Interpret Threads..............24
Kernel Objects Initializer......................24
Memory Pool............................25
Dispatching Thread.........................28
Overlord Threads..........................29
3.3.2 Execution Flow...........................29
Prologue..............................29
Device Setup............................30
Objects Allocation.........................30
Data Initialization..........................30
Kernel Issuing...........................31
Overview..........................31
Kernel Dependency Graph.................31
iv
Selection Policy and Dispatching Algorithm........32
Balanced Degree......................37
3.4 Static Analyzer...............................39
3.4.1 Kernel Read/Write Dependency Hints...............39
3.4.2 Control Data Flow Graph......................40
3.5 Related Issues of Implementation......................42
3.5.1 Drawbacks on Using Separated Model...............42
3.5.2 Dependencies among Kernel Issuing................43
3.5.3 Unsupported Feature in OpenCL..................44
3.5.4 Optimization for Partial Reference.................45
4 Experimental Results 47
4.1 Environment.................................47
4.1.1 PlatformSetup...........................47
4.1.2 Test Pattern.............................48
4.2 Evaluation..................................49
4.2.1 Result................................49
4.2.2 Analysis &Discussion.......................49
Efficiency..............................57
4.2.3 Improvement............................58
4.2.4 Multiple OpenCL Platforms....................59
5 Related Work 61
Intra-task scheduling....................61
Inter-task scheduling....................62
Memory management...................63
v
6 Conclusion and Future Work 64
6.1 Summary..................................64
6.2 Future Work.................................65
Bibliography 65
vi
List of Figures
2.1 OpenCL platformmodel.(adapted from[19])...............6
2.2 An example of NDRange index space.(adapted from[19])........8
2.3 Conceptual OpenCL device architecture with processing elements,com-
pute units and devices.(adapted from[19])................10
2.4 Relation graph of OpenCL objects for multiple platforms..........11
2.5 Standard phases of an OpenCL application.................12
3.1 The 3MM-Family Examples.........................17
3.2 ViCL framework overview.The Read/Write Behavior Hints can be gener-
ated stand-alone at static time or at runtime before enqueuing kernel object
into command-queues............................22
3.3 The Finite automata of memory objects on host..............27
3.4 The Finite automata of memory objects on devices............28
3.5 (a) kernel abstraction for kernel add and kernel MMboth with two inputs
and single output.(b) kernel node after argument setting.Box is kernel
node.Circle represents memory object...................32
3.6 Example graphs for Ext-3MM(a) Data FlowGraph (b) kernel Dependency
Graph.Box is kernel node.Circle represents memory object.......33
3.7 Example schedule for kDG in Ext-3MM...................36
3.8 Balanced Degree working example.Green nodes are nodes ready to be
scheduled.Purple node is Split Kernel.Red node is Retrieve Kernel....38
vii
3.9 Example CDFG.Circle node is memory object.Rectangle is basic block.42
3.10 The shaded grids are referenced regions.Assume the reference attempts to
alternate the values on referenced regions.Region with color black is the
overlap region.Note that the grids inside rectangle stand for a region of data.45
4.1 Normalized scores to vendor runtime....................50
4.2 Vendor version and ViCL comparison for 3MM...............52
4.3 Breakdown on Rodinia group........................52
4.4 Result of streamcluster.The unit of measurement is s..........53
4.5 Breakdown charts forMCXCL in s....................55
4.6 Breakdown charts for clsurf in s.In vendor column,Init is the cost of
runtime initialization and green block is I=Oinstead of callback in column
of ViCL.I=O part can be regarded as the role of callback and Pro=Epo
in column of ViCL.The overlapping effect in callback tuples is obvious.
Figure 4.7 illustrates this concept......................56
4.7 Overlaped execution in current implementation due to prefetched imple-
mentation...................................56
4.8 V iCLx2
dis is running on two seperate devices.V iCLx2
int is the de-
fault configuration..............................57
4.9 Comparsion between each optimizations for clsurf with 4 devices in s.
na¨ıve applies expected flow.prefetch is the default result as V iCLx4 in
Figure 4.1.shmalloc applies shmalloc() only.fullopt applies above two
optimization.................................59
4.10 3MM Comparison for different configuration on devices.Note:(X:Y )
assigns dev0 as X and dev1 as Y respectively under ViCL configuration.59
viii
List of Tables
2.1 Memory regionallocation and memory access capabilities.........9
3.1 Memory region - allocation and memory access capabilities........41
4.1 Summary Detail for Test Patterns......................48
4.2 Configuration for each test instance.....................49
4.3 Raw scores for test patterns on vendor runtime and ViCLxN,N is#of
devices....................................50
4.4 Detail representations for tuples in breakdown charts............51
4.5 Efficiency for test patterns..........................58
ix
Chapter 1
Introduction
1.1 Motivation
The OpenCL [7],Open Computational Language,which enabling GPUs more suitable for
general purpose computations,has promoted the GPGPUsystems or in the name of hetero-
geneous multi-cores systems to be a main applicable domain in parallel computing.Het-
erogeneous multi-cores systems have been widely spread in multiple class domains,from
the consumer level personal computer with x86-based CPU coordinating with GPGPUs to
the field of cloud and scientifically computing cluster.
In recent years,multiple GPGPU devices in one system are getting more common.
Trends in recent architectures such as AMD Fusion,or HSA [3],and Intel-Haswell [5] are
designed to acquire additional on-chip GPGPU in order to improve parallel computations
to avoid relative long transfer latency from sharing parallel works to discrete GPGPU.
The computational farms,or the clouds,in the field of supercomputer [11] equip multiple
GPGPUs to improve not only computational throughput but also aiming on suppressing
the GFLOPS/TFLOPS per watt [10].
In the software view,there are two types of parallelism among task-level parallelism,
inter-task parallelism and intra-task parallelism.The former is the degree of concurrent
tasks issuing and the later represents a degree of ability to parallelize single task.While
multiple devices cohabit on the same board,provides additional computing powers to serve
1
CHAPTER 1.INTRODUCTION 2
more parallel tasks to improve inter-task parallelism.The intra-task parallelism,on the
other hand,requires further efforts to utilize multiple devices for a single task under current
OpenCL programming model.That is,without careful modification,additional devices
would not benefit existing applications.
The OpenCL programming model is a flavor of distributed computing with a central-
ized management between host,a process on CPU,and working horses such as GPUs
or any OpenCL compatible devices.Programmer could apply suitable APIs to manage the
discrete computational resources.For the portability on different devices,the programmers
are required to design a software mechanismto fully adapt diverse combination of runtime
environments.The OpenCL memory model provides discrete memory spaces manner,
each device has its own memory space.This asks programmers to coordinate the transfer
between multiple memory spaces.Since the long transfer latency between memory spaces,
efficient transfer management is also a necessary concerned.Furthermore,apart from the
increasing complexity of software maintenance for additional devices under existing pro-
gramming model,the management of multiple kernels issuing onto multiple devices is yet
another scheduling problem with dependency analysis and synchronization issues.Hence
we have proposed and implemented an runtime abstraction layer,ViCL,acting as higher
level agent between user and OpenCL runtime to untangle these addressed problems with
backward-compatibility and more productivity on OpenCL development.
1.2 Thesis Organization
In this thesis,we propose and implement ViCL,an OpenCL runtime system accompany
with a static time data analysis tool on OpenCL kernel codes,to automatically parallelize
task issuing and management between multiple memory spaces.By providing single vir-
tual device and simplifying memory management to hide detail existence of multiple de-
vices,existing applications which was designed to be applied on single device,could di-
CHAPTER 1.INTRODUCTION 3
rectly gain performance if that were parallelizable.The single virtual device mechanism
also provides programmer getting more investigation on algorithm design and implemen-
tation instead of tedious management on multiple computational resources.
The remainder of this thesis is organized as follows.Chapter 2 introduces the basic of
OpenCL especially on memory model in software designing aspect.Chapter 3 introduces
the design of ViCL and implementation issues.The performance evaluations are presented
in Chapter 4.Chapter 5 introduces the related work.The conclusions and future works are
drawn in Chapter 6.
Chapter 2
Background
In this chapter,we will introduce the OpenCL framework and standard programming
phases of an OpenCL application.We will also indicate the candidate phases for resource
management,the detail of our implementation will be left in chapter 3 and related work
will be discussed in chapter 5.
2.1 Introduction to OpenCL
OpenCL (Open Computing Language) is an open industry standard for general-purpose
programming of heterogeneous systems.The OpenCL is a centralized management frame-
work that includes a set of API,libraries and a runtime system on host and a language
to provide unified programming environment for software developers to leverage certain
works to client heterogeneous processing devices such as CPUs,GPUs,DSPs,and Cel-
l/B.E.processors.OpenCL,programmers can write portable code efficiently with the
hardware-related details being exposed by the OpenCL runtime environment.The Khronos
group proposed OpenCL1.0 specification in December,2008.The current version,OpenCL
1.2 [19],was announced in November,2011.
4
CHAPTER 2.BACKGROUND 5
2.1.1 Anatomy of OpenCL
The OpenCL framework is divided into four categories of models:platform model,mem-
ory model,execution model and programming model.In this section we will introduce
basic definition and relation for each model.The detail information can be obtained from
the OpenCL specification.The OpenCL provides object oriented concept that every object
could be only regarded as an abstract handler depends on certain OpenCL implementa-
tion [1,2,4,6].
2.1.2 PlatformModel
Platformis the basic layer in the OpenCL framework.Figure 2.1 defines a platformmodel
with a single platformenvironment.The model contains one host connect to single or mul-
tiple compute devices.Each device may have numerous Compute Units (CUs).Compute
Unit is a group of Processing Elements (PEs).The Processing element is the smallest ba-
sic compute instance on a device.A whole system may contain multiple platforms;each
platformcould not explicitly coordinate with others.
The OpenCL application is acting as a full set of a traditional cluster systemwith host-
client model which the application acting as host to dispatch jobs to compute devices.The
jobs are processed by each processing elements within a device.The computation results
will be transferred back to host after execution is completed.Additionally,the processing
element processes the jobs in single instruction multiple data (SIMD) or single program
multiple data (SPMD) manner.The location of a data and program are managed by host
part in the following section.
2.1.3 Execution Model
The execution of an OpenCL programis divided into two parts:the control procedures on
host and the kernel executions on OpenCLdevices.The host defines a context for execution
CHAPTER 2.BACKGROUND 6
Figure 2.1:OpenCL platformmodel.(adapted from[19])
of kernels and creates command-queues to control,allocate resources.The host is able to
specify index spaces to locate resources for certain devices at the stage of issuing jobs.The
following paragraph will cover the detail of the above terms.
Context
Context is the essential represented record for any OpenCL applications,which is used by
the OpenCL runtime for managing command-queues,memory objects,program objects
and kernel objects.The context can be associated with one or more devices fromthe same
platform for executing kernels.The objects are not explicitly shared between different
contexts.This implies that objects are not directly shared between platforms.
Programand Kernel
The program objects are built from either OpenCL C source code from text files or load-
ing pre-built binaries.For the portability,programmer are recommended to write a pro-
gramoffline and automatically compiled by the runtime systemas well as OpenGL shader-
CHAPTER 2.BACKGROUND 7
programming mechanismto adapt and optimize for underlying devices.For the proprietary
issues,the pre-built binary method is applicable to ship programs without losing implemen-
tation details.
The program object is composed with list of kernel objects and subroutines that are
invoked during kernel executions.Kernel objects are the abstraction of host issuing jobs,
which contains argument information of the represented function.The kernel objects must
be created after the programexecutable has been built.
Command-Queues
Command queues are passage to pass commands to devices.There are three types of
commands:kernel issuing commands,memory I/O commands and synchronization com-
mands.Each device must have at least one command-queue to receive commands from
host.It is also possible to create multiple command-queues on the same device under the
same context in order to operate commands in parallel.In this scenario,the programmer
should manually insert synchronization commands to guarantee execution orders.
Index Space
The index space supported in OpenCL is divided into a three-level hierarchy:NDRange,
work-group,and work-item.An NDRange is an N-dimensional index space,where N is 1
to 3.An NDRange is composed of work-groups.Each work-group contains several work-
items,which are the most fundamental executing element of a kernel.The work-items in
a given work group execute concurrently on the processing elements of a single compute
unit.A work-item is identified by a unique global identifier (ID).Each Work-group is
assigned a unique work-group ID and each work-itemis assigned a unique local ID within
a work-group.Work-groups are assigned IDs using the similar approach to that used for
work-item global IDs.According to these identifiers,work-items can identify themselves
based on the global ID or by a combination of its local ID and work-group ID.
CHAPTER 2.BACKGROUND 8
Figure 2.2:An example of NDRange index space.(adapted from[19])
An example of NDRange index space relationships adapted from OpenCL Specifica-
tion is showed in Figure 2.2.This is a two-dimensional index space in which we define
the size of NDRange (G
x
;G
y
),the size of each work-groups (S
x
;S
y
) and the global ID
offset (F
x
;F
y
).The total number of work-groups is the product of G
x
and G
y
.The size
of each work-group is the product of S
x
and S
y
.The global ID (g
x
;g
y
) is defined as the
combination of the work-group ID (w
x
;w
y
),the local ID (s
x
;s
y
) and the global ID offset
(F
x
;F
y
):
(g
x
;g
y
) = (w
x
S
x
+s
x
+F
x
;w
y
S
y
+s
y
+F
y
)
The number of work-groups can be computed as:
(W
x
;W
y
) = (G
x
=S
x
;G
y
=S
y
)
The work-group ID can be computed by a global ID and the work-group size as:
(W
x
;W
y
) = ((g
x
s
x
F
x
)=S
x
;(g
y
s
y
F
y
)=S
y
)
CHAPTER 2.BACKGROUND 9
Global
Constant
Local
Private
Host
allocate
Dynamic
Dynamic
Dynamic
Dynamic
access
Read/Write
Read/Write
No
No
Devices
allocate
No
Static
Static
Static
access
Read/Write
Read-Only
Read/Write
Read/Write
Table 2.1:Memory regionallocation and memory access capabilities
A wide range of programming models can be mapped onto this execution model.
OpenCL explicitly supports data- and task-parallel programming models.It is possible
to assign different combination of index space to process partial region of memory objects
on each round of kernel execution if necessary.
2.1.4 Memory Model
There are four distinct memory regions:global,constant,local and private memory.The
scope of Global memory is able to be used by all work-items which will remain until host
explicitly deallocates the memory object.Constant memory is a region of global memory
that remains constant during kernel execution.Local memory can be shared by all work-
items in a work-group,and private memory can only be adapted by a single work-item.
Data in both local and private memory would be last during a kernel execution period.
OpenCL uses relaxed consistency memory model.Every memory objects are shared
with region limits,which are listed in Table 2.1,under the same context.There are no
guarantees of memory consistency between different work-groups and compute-units.The
consistency for memory objects shared between enqueued commands is promised at a syn-
chronous point.The host uses OpenCL APIs to create memory objects in global memory
and to enqueue memory commands for manipulating these memory objects.Data transfers
between the host and devices are done by explicitly copying data or by mapping regions
of a memory object in host memory space.The relationship between memory regions and
the platformmodel are described in Figure 2.3.
CHAPTER 2.BACKGROUND 10
Figure 2.3:Conceptual OpenCL device architecture with processing elements,compute
units and devices.(adapted from[19])
2.1.5 Programming Model
The OpenCL execution model supports data-parallel and task-parallel programming mod-
els.Data-parallel programming model is a sequence of instructions being applied to mul-
tiple elements of data.The index space defined in the OpenCL execution model is used to
locate region of memory objects for a single work-item to do the computation.Program-
mers can define the total number of work-items to form a work-group locally access slice
of data region or assign single work-itemto access collection of data.
The OpenCL task-parallel programming model defines a model that a single instance of
a kernel is executed independent of any index space.Users can exploit parallelism via the
following three methods:using vector data types implemented by the device,enqueuing
multiple tasks,or enqueuing native kernels developed by a programming model orthogonal
to OpenCL.
The synchronization occurs in OpenCL in two situations.For intra-kernel scenario,
work-items in a single work-group,a work-group barrier is able to hold the consistency.For
inter-kernel commands in the same context but issuing onto different command-queues or
CHAPTER 2.BACKGROUND 11
Figure 2.4:Relation graph of OpenCL objects for multiple platforms.
an out-of-order command-queue,programmers can use command-queue barrier and event
objects to performtask-level synchronization.
2.2 OpenCL Application Phases
In this section,we will introduce the standard phases of an OpenCL application with re-
quired hints on using multiple devices.There are phases to establish heterogeneous pro-
gramming under OpenCL framework through provided APIs.Figure 2.4 is an example
relation graph between objects for an OpenCL application attempting to utilize two in-
stances of platforms.The following paragraph will introduce the build-up-phases for the
chart.
2.2.1 Devices Setup Phase
The first step in Figure 2.5 is to select suitable devices for computations.Since the devices
are bundle with certain platform,the platform should be chosen first by queried instead.
After getting handlers on willing devices,it is able to associate contexts with them.Since
the command-queues are bridges from devices to host through context records,only the
valid context is able to bind command-queues to corresponding devices.
CHAPTER 2.BACKGROUND 12
Figure 2.5:Standard phases of an OpenCL application.
OpenCLis aimed to focus on portability;the devices setup phase is basic but tedious for
programmers.Traditionally,the programmers intend to write fix code binding first queried
device and platform by default instead of selecting the suitable combination.Further-
more,if attempting to use multiple devices from different platforms,programmers should
manually setup contexts individually since the context and binded objects are not cross
platforms.
2.2.2 Objects Allocation Phase
As mentioned in last section,objects are binded to certain context.Since objects are not
interchangeable and convertible by default,it requires further manipulations aside from
direct API calls.For program and kernel objects,these are associated with devices.With
the runtime compilation mechanism,the created program binaries may not compatible
between different combination of devices and target platforms.For memory objects,the
CHAPTER 2.BACKGROUND 13
memory allocation does not require binding on certain devices but assigned context.
2.2.3 Data Initialization Phase
The data initialization phase is a synchronization point for certain memory object.The
programmer must manually initialize input data fromhost memory space to memory space
on labor devices by using suitable API.
Programmer can also directly copy one buffer to another such as memory buffer A to
memory buffer B under the same context without considering where the valid data resid-
ing on which devices’ memory space,the synchronization is automatically handled with
default scheduling policies or programmer defined order with event objects.
As for multiple contexts case,if memory buffer Aand memory buffer Bbelongs to con-
text0 and context1 respectively,programmers should manually enqueue two steps transfer
with first transferring-API from buffer A to additional temporary storage on preallocated
host-side memory space and enqueue second transferring-API from temporary storage to
buffer B.
Although sharing objects frommultiple contexts case is rarely happened in most appli-
cation,the synchronization between parallel tasks,sharing the same memory objects would
result demands on temporary storages.This issue will be discussed in next paragraph.
2.2.4 Kernel Issuing Phase
The kernel issuing phase can be divided into two parts,the argument setting and enqueu-
ing kernel execution.The arguments are specified memory objects for kernel execution.
Each tuple of kernel argument for is located with clSetKernelArgs API.The latest set on
the same tuple would be the valid one before the kernel execution is enqueued.This leads
issuing the same kernel to multiple command-queues in parallel requires careful codes,if
there are multiple threads attempting setting different set of memory objects on the same
kernel.
CHAPTER 2.BACKGROUND 14
Furthermore,two different kernels execution may have dependency on memory objects.
In this case,manually copies of certain memory objects to break the dependency or fine
order of kernel issuing should be fully considered.
The dependency of memory objects is predictable,if read/write limitation is specified
among creation of memory objects.Otherwise,it requires data analysis on execution ker-
nels.
2.2.5 Retrieving Phase
The retrieving phase is a synchronization point for certain memory objects.There are
two type of retrieving mode:task flushing and real retrieving.Both modes lead updates
on memory objects,since transitions are done with kernel executions.The task flushing
method is to confirmed jobs committing in command-queues and as in Data initialization
phase,programmer using API to read back memory objects to update host side ones in real
retrieving mode.
2.2.6 Cleanup Phase
This is the final synchronization point.This phase responds for objects freed and the re-
sources are reclaimed to system.
In short,the Data Initialization,Kernel Issuing and Retrieving phases are the main
life cycle for an OpenCL application.The transition is straight forward for single kernel
execution on one device.The complexity may increase with multiple kernels,if targeting
themon multiple devices.
In this work we have proposed an OpenCL resource management system,ViCL,on top
of vendor provided runtime systems.By intercepting OpenCL APIs fromhost programand
static analysis on kernel codes,which are written in OpenCL C,we can generate suitable
CHAPTER 2.BACKGROUND 15
schedules onto multiple devices in order to obtain benefits frommultiple devices for appli-
cations that were designed for single device or easing programmers’ efforts on managing
additional computation resources.The design and implementation issues will be discussed
in Chapter 3.
Chapter 3
SystemDesign and Implementation
In this chapter,the the software architecture and details of the design for ViCL are de-
scribed.We firstly provide two examples 3-Matrix-Multiplication and Extended-3-Matrix-
Multiplication to help explaining the flow of system in section 3.1.The ViCL overview
will be introduced in section 3.2.There are two parts in ViCL framework:runtime man-
ager will be covered in section 3.3 and the prerequisite static analyzer will be explained in
section 3.4.
3.1 Case Study:A Parallelizable Example
Figure 3.1(a) is a 3-Matrix-Multiplication (3MM) which is helpful example for both our
system flow and motivation.The Extended-3-Matrix-Multiplication in Figure 3.1(b) is
based on 3MM with additional matrix addiction before the matrix multiplications.The
equation contains one matrix addition and three matrix multiplications:(AB),(B D)
and the last computation on result of (AB) and (B D).After the first matrix addition,
the latter two multiplications can be done concurrently and the final computation is the
joint point.Dense,the maximum parallelism on these two example are 2.Let T
origin
as
time cost for normal computation on 3MM and assume the size of the matrices are the
same:
T
origin
= Time
matrixmul(A;B)
+Time
matrixmul(C;D)
16
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 17
A = [
a
ij
]
n x
m


C= [
c
ij
]
n x m

Page
7

System Software Lab., Dept. of CS, NCTU


Do the matrix multiplication

B = [
b
ij
]
m x n


D= [
d
ij
]
m
x m

A task Given 4 matrices




where n > m


Motivation (Example)

3MM:

R = (A
x B) x (C x D)

(a) 3MM.
Do the matrix multiplication

A = [
a
ij
]
n x
n


X= [
x
ij
]
n
x m

B = [
b
ij
]
n
x n


Y= [
y
ij
]
m
x m

Ext
-
3MM:


Solve R ?


System Software Lab., Dept. of CS, NCTU




Motivation (Example)

Page
7

R = (A
x B) x (C x D
)

A task Given 4 matrices





B = X + Y


(b) Ext-3MM.
Figure 3.1:The 3MM-Family Examples.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 18
+Time
matrixmul(A
B
;C
D
)
!3Time
matrixmul()
The realistic time cost T
max
for full improved 3MMis derived as:
T
max
= (Time
matrixmul(A;B)
+Time
matrixmul(C;D)
)=2
+Time
matrixmul(A
B
;C
D
)
!2Time
matrixmul()
By Amdahl’s law,improvement Imp
Max
is:
Imp
Max
= T
origin
=T
max
= 1:5
Listing 3.1 provides required OpenCL kernel codes for 3MM-family.The standard host
controlling codes of 3MMfor single device is listed in Listing 3.2.In order to support for
multiple device,the host controlling codes should be modified in multiple section sections.
Additionally,if the devices are provided by different platforms,the memory objects such
as dev
A,dev
B and etc.should manually created and managed to associated with each
context of certain device and the synchronizations between devices is also on programmer’s
shoulder.The code lines may increase about Ntimes longer,where Nis the number of labor
devices.
This work aims to automatic support multiple devices fromexisting code toward single
device version to compete manually design or ease the design efforts on multiple ones.
1
k e r n e l voi d mat ri xmul
CL (
2
g l o b a l f l o a t  mata,
3
g l o b a l f l o a t  matb,
4
g l o b a l f l o a t  matc,i nt m,i nt p,i nt n ) f
5 i nt i = g e t
g l o b a l
i d ( 0);
6 i nt j = g e t
g l o b a l
i d ( 1);
7 f l o a t val ue =0;
8 f or ( i nt k=0;k<p;k++)
9 val ue +=mata [ i p + k ]  matb [ kn + j ];
10 matc [ i n+j ] = val ue;
11 g
12
13
k e r n e l voi d mat r i xadd
CL (
14
g l o b a l f l o a t  mata,
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 19
15
g l o b a l f l o a t  matb,
16
g l o b a l f l o a t  matc,i nt wi dt h ) f
17 i nt i = g e t
g l o b a l
i d ( 0);
18 i nt j = g e t
g l o b a l
i d ( 1);
19 matc [ i  wi dt h+j ] = mata [ i  wi dt h+j ] + matb [ i  wi dt h+j ];
20 g
Listing 3.1:3MM-Family OpenCL kernel Listing
1/ Devi ce Set up & Obj ect I n i t i a l i z a t i o n /
2 cl Get Pl at f or mI Ds ( 1,pl a t f or m
i d,&r e t
num
pl a t f or ms );
3 cl Get Devi ceI Ds (  pl a t f or m
i d,CL
DEVICE
TYPE
ALL,1,
de vi c e
i d,&r e t
num
de vi c e s );
4/ Cr eat e OpenCL c ont e xt /
5 c ont e xt = c l Cr e a t e Cont e xt ( NULL,1,de vi c e
i d,NULL,
NULL,&r e t );
6/ Cr eat e Command Queue /
7 command
queue = clCreateCommandQueue ( cont ext,de vi c e
i d
[ 0],0,&r e t );
8/ Cr eat e Memory Buf f er /
9 dev
A = c l Cr e a t e Buf f e r ( cont ext,CL
MEM
READ
ONLY,host
A
.t o t a l
s i z e,NULL,&r e t );
10 dev
B = c l Cr e a t e Buf f e r ( cont ext,CL
MEM
READ
ONLY,host
B
.t o t a l
s i z e,NULL,&r e t );
11 dev
C = c l Cr e a t e Buf f e r ( cont ext,CL
MEM
READ
ONLY,host
C
.t o t a l
s i z e,NULL,&r e t );
12 dev
D = c l Cr e a t e Buf f e r ( cont ext,CL
MEM
READ
ONLY,host
D
.t o t a l
s i z e,NULL,&r e t );
13 dev
A
B = c l Cr e a t e Buf f e r ( cont ext,CL
MEM
READ
WRITE,
host
A
B.t o t a l
s i z e,NULL,&r e t );
14 dev
C
D = c l Cr e a t e Buf f e r ( cont ext,CL
MEM
READ
WRITE,
host
C
D.t o t a l
s i z e,NULL,&r e t );
15 dev
R = c l Cr e a t e Buf f e r ( cont ext,CL
MEM
WRITE
ONLY,
host
R.t o t a l
s i z e,NULL,&r e t );
16/ Cr eat e Ker nel Program from t he s our ce /
17 program = cl Cr eat ePr ogr amWi t hSour ce ( cont ext,1,( cons t
char )&s o u r c e
s t r,( cons t s i z e
t )&s our c e
s i z e,&
r e t );
18/ Bui l d Ker nel Program /
19 r e t = cl Bui l dPr ogr am ( program,0,NULL,NULL,NULL,NULL)
;
20/ Cr eat e OpenCL Ker nel /
21 ke r ne l = c l Cr e a t e Ke r ne l ( program,” mat ri xmul
CL”,&r e t );
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 20
22 l o c a l
i t e m
s i z e [ 0] = l o c a l
i t e m
s i z e [ 1] =1;
23 cl EnqueueWr i t eBuf f er ( command
cur,dev
A,CL
TRUE,0,
host
A.t o t a l
s i z e,host
A.val,0,NULL,NULL);
24 cl EnqueueWr i t eBuf f er ( command
cur,dev
B,CL
TRUE,0,
host
B.t o t a l
s i z e,host
B.val,0,NULL,NULL);
25 cl EnqueueWr i t eBuf f er ( command
cur,dev
C,CL
TRUE,0,
host
B.t o t a l
s i z e,host
C.val,0,NULL,NULL);
26 cl EnqueueWr i t eBuf f er ( command
cur,dev
D,CL
TRUE,0,
host
D.t o t a l
s i z e,host
D.val,0,NULL,NULL);
27/ Execut e OpenCL Ker nel /
28/ A x B /
29 cl Set Ker nel Ar g ( ker nel,0,s i z e o f ( cl
mem),( voi d )&dev
A
);
30 cl Set Ker nel Ar g ( ker nel,1,s i z e o f ( cl
mem),( voi d )&dev
B
);
31 cl Set Ker nel Ar g ( ker nel,2,s i z e o f ( cl
mem),( voi d )&
dev
A
B);
32 cl Set Ker nel Ar g ( ker nel,3,s i z e o f ( i nt ),( voi d )&host
A.
c o l
s i z e );
33 cl Set Ker nel Ar g ( ker nel,4,s i z e o f ( i nt ),( voi d )&host
A.
r ow
s i ze );
34 cl Set Ker nel Ar g ( ker nel,5,s i z e o f ( i nt ),( voi d )&host
B.
r ow
s i ze );
35 g l o b a l
i t e m
s i z e [ 0] = host
A.c o l
s i z e;
36 g l o b a l
i t e m
s i z e [ 1] = hos t
B.r ow
s i ze;
37 cl EnqueueNDRangeKernel ( command
cur,ker nel,2,NULL,
g l o b a l
i t e m
s i z e,l o c a l
i t e m
s i z e,0,NULL,NULL);
38/ C x D /
39 cl Set Ker nel Ar g ( ker nel,0,s i z e o f ( cl
mem),( voi d )&dev
C
);
40 cl Set Ker nel Ar g ( ker nel,1,s i z e o f ( cl
mem),( voi d )&dev
D
);
41 cl Set Ker nel Ar g ( ker nel,2,s i z e o f ( cl
mem),( voi d )&
dev
C
D);
42 cl Set Ker nel Ar g ( ker nel,3,s i z e o f ( i nt ),( voi d )&host
C.
c o l
s i z e );
43 cl Set Ker nel Ar g ( ker nel,4,s i z e o f ( i nt ),( voi d )&host
C.
r ow
s i ze );
44 cl Set Ker nel Ar g ( ker nel,5,s i z e o f ( i nt ),( voi d )&host
D.
r ow
s i ze );
45 g l o b a l
i t e m
s i z e [ 0] = host
C
D.c o l
s i z e;
46 g l o b a l
i t e m
s i z e [ 1] = host
C
D.r ow
s i ze;
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 21
47 cl EnqueueNDRangeKernel ( command
cur,ker nel,2,NULL,
g l o b a l
i t e m
s i z e,l o c a l
i t e m
s i z e,0,NULL,NULL);
48/ (A x B) x (C x D) /
49 cl Set Ker nel Ar g ( ker nel,0,s i z e o f ( cl
mem),( voi d )&
dev
A
B);
50 cl Set Ker nel Ar g ( ker nel,1,s i z e o f ( cl
mem),( voi d )&
dev
C
D);
51 cl Set Ker nel Ar g ( ker nel,2,s i z e o f ( cl
mem),( voi d )&dev
R
);
52 cl Set Ker nel Ar g ( ker nel,3,s i z e o f ( i nt ),( voi d )&host
A
B
.c o l
s i z e );
53 cl Set Ker nel Ar g ( ker nel,4,s i z e o f ( i nt ),( voi d )&host
A
B
.r ow
s i ze );
54 cl Set Ker nel Ar g ( ker nel,5,s i z e o f ( i nt ),( voi d )&host
C
D
.r ow
s i ze );
55 g l o b a l
i t e m
s i z e [ 0] = host
A
B.c o l
s i z e;
56 g l o b a l
i t e m
s i z e [ 1] = host
C
D.r ow
s i ze;
57 cl EnqueueNDRangeKernel ( command
cur,ker nel,2,NULL,
g l o b a l
i t e m
s i z e,l o c a l
i t e m
s i z e,0,NULL,NULL);
58/ Re t r i e ve Re s ul t R /
59 cl EnqueueReadBuf f er ( command
cur,dev
R,CL
TRUE,0,
host
R.t o t a l
s i z e,host
R.val,0,NULL,NULL);
Listing 3.2:3MMhost controlling codes.
3.2 ViCL SystemOverview
ViCL is composed with two parts:static analyzer and runtime manager.Figure 3.2 is the
system overview.The static analyzer focuses on arguments of kernel codes,an OpenCL
C functions and subroutines,aiming to judge the read/write behavior on memory objects
for each pass through of a kernel and save as dependency hint respectively.The data flow
between multiple kernels can be derived at runtime (Program Initializer in figure 3.2) by
rebuilding dependency hints fromenqueued kernel objects.
The runtime manager is a client-and-server model which composed with two parts:the
client side is OpenCL front-end library and server side is additional management layer
on top of vendor dependent OpenCL runtime implementations.The front-end library in-
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 22
Figure 3.2:ViCL framework overview.The Read/Write Behavior Hints can be generated
stand-alone at static time or at runtime before enqueuing kernel object into command-
queues.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 23
tercepts OpenCL API calls in application and sends to runtime manager via inter-process
communication (IPC) mechanisms provided by underlying system.The runtime manager
is acting as a server process that serves all the processes invoking OpenCLAPI in the whole
system.
The client-server model is aiming to support better resource utilization for inter-task
parallelism.The intra-task scheduling will be also affected from other tasks from data
aware and device loading aware scheduling policies.In this work,we focus on intra-
task parallelism.The basic balance scheduling policy to support multiple processes is
implemented but further scheduling issues for inter-task parallelismis untouched.
ViCL reveals a single virtual platform with single device image to programmers.The
virtual configuration or concept of abstract device is not only trivially support existing ap-
plication toward single device design but also hiding the complexity on utilizing underlying
systeminformation by automatic arbitration.
3.3 Runtime Manager
In this section,the runtime framework of ViCL is described,including the software archi-
tecture,execution flow and the relationship among each software components.
3.3.1 Software Architecture
Figure 3.2 presents the architecture of ViCL runtime components in this work,which in-
cludes a Front-end Library linking in application processes and a server process:an in-
tegrated runtime manager with Memory Pool,Task Dispatching Thread,Kernel Objects
Initializer and Overlord Threads.The functionalities of each component are described as
follows:
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 24
Front-end Library and Interpret Threads
Front-end library is a wrapper-flavor library that redirects OpenCL API calls to runtime
manager.It is implemented as a replacement of vendor dependent libOpenCL.so and is
responsible for the Device Setup Phase in section 2.2.The current implementation pro-
vides the OpenCL Platform Layer API performing directly in application space in or-
der to create virtual image without the requirement of querying the detail of underlying
hardware configuration.Other classes of API requires passing to runtime manager to get
corresponding results.All the passing API calls are firstly performed basic verifications ac-
cording to the OpenCL specifications such as null pointer or integer value range checking
before further computation.
Currently the IPC connection between runtime manager and application is using Unix
Domain Socket.It is implemented as a replaceable module that is applicable to port as
remote OpenCL computing server with BSDSocket or virtual machine systemwith Virtio.
The IPC connection mimics OpenCL context,that is,only when the context is created in
application the runtime manager can get ready to serve the context owner.Multiple IPC
connections in application are applicable as multiple contexts are viable in original flow.
Interpret Thread ( IT) is the agent of context.Each IPC connection invokes a unique IT
to serve in runtime manager.IT processes the incoming API calls fromapplication includ-
ing creating the command-queues to Dispatching Thread,enqueuing kernel executions and
performing valid operation on memory objects.IT is blocked when attempting to perform
invalid operation.The validation of operation will be explained in section:Memory Pool.
Kernel Objects Initializer
The runtime manager allocates a predefined archive on filesystem,CLProgramBank,that
stores the Kernel R/W Behavior Hints and executed OpenCL program binaries which are
dumped fromclGetProgramInfo through vendor provided runtime to improve initialization
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 25
progress in future rounds.The Kernel Objects Initializer loads files fromCLProgramBank
when attempting create program objects.In this section,the Kernel R/W Behavior Hints
are assumed ready.This prerequisite will be left in section 3.4.
Current implementation requests programmer using clCreateProgramWithSource API
to initialize program objects,since the search key in CLProgramBank to target hints and
program binaries are generated from hashed tags of input kernel sources.The generated
program objects contain the dependency hints for inherited kernels and vendor dependent
programbinaries for underlying hardware.Since the target device for certain kernel is un-
defined in the creation of kernel object which is extract fromprogramobject,the additional
kernel abstraction is required to participate in Dispatching Thread.The kernel abstraction
will be configured into device dependent kernel object after binding information is gener-
ated.
Memory Pool
Since the target working devices for certain memory object is unknown in the allocation
phase of memory objects under ViCL,Memory Pool is provided as an abstract device
memory space for programmers temporarily storage to initialize memory objects with host
data.Thus an additionally inevitable transfer between memory spaces from application
to runtime manager rises but it is more flexible to relocate memory objects between mul-
tiple devices.Furthermore,the manual API calls for attempting on transferring memory
objects fromdifferent devices is now an automatic transaction that the runtime manager is
responsible for the synchronizations.Currently the implementation of the abstract device
memory using shared memory mechanismin Unix to minimize the overhead on additional
transfer between application and runtime manager.The shmid represents the handler of
memory objects in application semantic.
Memory Pool is a set of memory objects,each set member points to an abstract memory
object with N+1 tuples of memory blocks and state information for N+1 memory spaces,
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 26
where N refers to number of devices and additional one host state.The memory block on
host is the region that is regarded as abstract device memory by application.The memory
blocks on devices refer to handlers that are created by vendor provided runtime.There are
five states for certain memory object on host side:EMPTY,ALLOC,READY,INUSE
and STALE.
Application using API-calls to operate on memory objects.Interpret Thread (IT) rep-
resents the host agent to operate the calls.The valid operation is listed in the following
paragraph and state diagramis illustrated in figure 3.3.
 State starts and ends at EMPTY before being created and after being deallocated.
 The creation without host initialization is READY.
 The creation without host initialization is firstly ALLOC.Until data have been writ-
ten the state changes to READY.ALLOC also represents the state that application
is updating the data region.The object is locked by application at ALLOC,resource
manager will be block if attempting to operate on it.
 IT can directly operate only at state READY.
 When IT enqueues kernel execution,check the dependency hints.If the memory
object is marked as write victim,changing the state into INUSE.
 If IT attempts to operate on INUSE or STALE memory object,IT is blocked and
send request to Dispatching Thread to issue currently enqueued kernels or find the
latest data among devices’ memory space to redeemthe synchronizations.
 After the synchronization is done and before ITis unblocked,memory object is either
back to READY or remaining at STALE.Remaining at STALE is happened only
when the object is not used in current semantic.This policy is ensuring less transfer
between devices to host.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 27
Figure 3.3:The Finite automata of memory objects on host
Symmetrically,there are four states for certain memory object on each device:EMPTY,
READY,INUSE and STALE.The absent of ALLOC is hidden by vendor runtime,since
this transition is done by directly calling vendor provided API.The valid operation is listed
in the following paragraph and the state diagramis illustrated in figure 3.4.
 State starts and ends at EMPTY before being created and after being deallocated.
 After creation,it is READY
 If host or other device update the memory object,the state gets STALE
 The INSUE is a written transitional state between the kernel is issued and comple-
tion.In this state,memory object is locked,only this device can operate on it.
 The state transition from STALE to READY is happened when this device require
to use the memory object.It should follow the synchronization steps to find latest
data among spaces and update data on this device.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 28
Figure 3.4:The Finite automata of memory objects on devices
Dispatching Thread
Dispatching Thread (DT) comes to serve when there is a blocked Interpret Thread (IT).DT
commits enqueued kernels onto certain devices from schedule.The schedule is generated
fromKernel Dependency Graph (kDG),a simplification of Data FlowGraph that is derived
by dependency hints from enqueued kernels.When there is not blocked IT existed,DT is
getting into sleeping state.The detail of scheduling policies and schedule generating are
left in section 3.3.2.
The original flow of issuing timing is implementation dependent.The policy among
vendor implementation is as soon as possible when the kernels are enqueued in command-
queues.ViCL implicitly postpones the timing of kernel issuing until the interpret thread
is blocked by synchronization request that has been discussed in Memory Pool section.
This design concept is aiming to explore larger number of enqueued kernels to find the
opportunity for better parallelismamong these kernels to improve computation throughput.
Current implementation applies single thread to serve all the requests from ITs;this is
aiming to improve the devices utilization by centralized task dispatching without complex
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 29
threads communications.To left the scheduling mechanismbeing small and fast,the graph
generating is offloaded to ITs before ITs are getting into blocking state.
Overlord Threads
Overlord Threads (OT) are the final frontiers interfacing the vendor runtime.Each OT
binds a real device with vendor generated object handlers.OT follows the committed
schedule to reconstruct the abstract objects into vendor compatible object handler to com-
plete the device computation.When the task is done,OT must unblock the requested
Interpret Thread to return compute results to application.Since the Overlord Threads are
technically using different OpenCL contexts,current synchronization between OTs and
unblocking IT is being done through additional procedures written with aid of pthread
library.
3.3.2 Execution Flow
In this subsection,the execution flow of ViCL will be told including detail design of dis-
patching policy.The traversing flowis the same as mentioned OpenCL Application Phases
as in section 2.2.
Prologue
Initially,the ViCL runtime manager must start up before any OpenCL requests from ap-
plication in System.ViCL starts up with complete setup of system environment:Interpret
Threads in thread pool manner,Overlord Threads for each devices,Memory Pool initializa-
tion,prepared configuration of virtual information for future querying and spawning socket
to listen requests fromOpenCLapplication.Note that Dispatching Thread generates sched-
ules and pushes scheduled kernels to Overlord Threads invoking vendor provided API to
process the commands.These communications need implicit queues.Current implemen-
tation provides the back-end command-queue (BCQ) to meet this need,in comparison to
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 30
the explicit front-end part of command-queue (FCQ) in OpenCL.Each OT binds to one
BCQ to receive scheduled commands.
Device Setup
ViCL provides single virtual platformand device combination.The true setup is completed
before application being started.For all APIs that query platform and device information
will be returned with a virtual configuration that is derived from the most conservative
configuration of underlying devices.The honor of ViCL is to keep application behavior
with original flow for backward compatibility and correctness.After creating the context
object bound on virtual device,the application is connected with runtime manger and able
to apply OpenCL computing requests.
Objects Allocation
The only difference between ViCL and bare OpenCL runtime in this phase is the additional
abstraction of objects.The logic behind is to create additional keys as handler in application
space and using these keys to map certain high level abstraction objects on runtime manger.
The high level objects will translate into real handler provided by vender dependent runtime
when final issuing under the charge of overlord threads.
The requirements of these abstractions have been explained in previous texts.For ex-
ample,the abstraction of memory object is using shmid from shared memory to represent
handler for memory object because the binding devices has not been decided when creating
this memory object.
Data Initialization
ViCL runtime manager allocates Memory Pool to abstract device memory in OpenCL.
The memory transferring to physical devices is deferred when binding information has
been generated by dispatching thread.Thus,the data initialization will directly copy into
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 31
Memory Pool through shared memory mechanismfromapplication space.
Kernel Issuing
In this section,the policy of kernel issuing or dispatching policy will be covered.
Overview This phase involves all types of threads in ViCL runtime manager.Interpret
Thread sets the argument kernel and enqueues kernels into front-end command-queues
(FCQ).Dispatching Thread extracts kernels from FCQs to generate schedule onto back-
end command-queues (BCQ).Overlord Thread reads commands fromBCQ to execute the
real computations.
Kernel Dependency Graph ViCL uses kernel dependency graph (kDG) to generate dis-
patching schedule.The node is single kernel execution and edge represents kernel depen-
dencies in kDG.The dependencies may result fromdata dependencies or kernel execution
orders that are specified by programmers.
The kDG is a simplified data flow graph (DFG).There are two nodes in DFG,data
node represents memory object and operation node represents kernel.There are only two
possible edges from nodes to nodes,edge from data node to operation node represents the
kernel object requires reading the memory object;edge from operation node to data node
represents the kernel object writes result to the memory objects.ViCL builds DFG from
Kernel Read/Write Dependency Hints fromenqueued kernels.The hints is accompany with
kernel abstraction.The argument setting is assigned on kernel abstraction to form a set of
kernel node and dependent memory objects.figure 3.5 illustrates the kernel abstractions
and kernel node for single kernel execution in Ext-3MM.Note that the input or output
edge in kernel abstraction represent read/write on
global region of device memory.Other
regions listed Table 2.1 are not concerned here,because those scopes are not valid outside
of a kernel execution.The seq number in figure 3.5 (b) below the box represents the
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 32
Figure 3.5:(a) kernel abstraction for kernel add and kernel MMboth with two inputs and
single output.(b) kernel node after argument setting.Box is kernel node.Circle represents
memory object
enqueued order of this kernel.
Selection Policy and Dispatching Algorithm As mentioned in section 3.3.1,Interpret
Threads (ITs) trigger Dispatching Thread (DT) to serve when IT encountering an invalid
operation on memory objects.DT selects one of pending ITs to serve.Current selection
policy is in First-In-First-Out fashion.There are many selection policies can be applied.
The related research on selection policies will be left in chapter 6.DT is in sleeping state
when there are no pending ITs.
DT dequeues all the kernels fromFCQ of selected IT and generates the kDG to sched-
ule kernels on devices.The scheduled kernel will be pushed onto corresponding back-end
command-queue.In Ext-3MM example,there are 4 kernel executions enqueued.Fig-
ure 3.6 illustrates the DFG and derived kDG for the Ext-3MMexample.
The kernel node in kDG has two special types:Split Kernel and Retrieve Kernel (see
labels around Kernels in Figure 3.6 (b) ).Split Kernel represents a synchronized point
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 33
Figure 3.6:Example graphs for Ext-3MM (a) Data Flow Graph (b) kernel Dependency
Graph.Box is kernel node.Circle represents memory object
before split into multiple concurrent flows among the kernel graph.The kDG will be cut
on Split Kernel.Succeeding kernels will be ignored and DT finds other possible kernels
to schedule in topological order among kDG.Kernel schedule will be pushed to back-end
command-queues first before generating the left kernels in kDG.Retrieve Kernel represents
the returning point of OpenCL computation.Retrieve Kernel can be regarded as the kernel
that is the last kernel that modifies the memory objects which IT last requested.Note
that the returning point is a signal to awake pending IT to proceed.The rest of kernels
will still be processed overlaps with awaken IT cycles.The detail algorithmthat describes
DT’s working behavior is described in Algorithm1 and Procedure SCHED.Note that
K
0
is set of scheduled kernel node with schedule information,and Load is loading vector
corresponds to each q 2 Qwhere Qis the set of back-end command-queues.Limit is the
maximumcapability of loading vector.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 34
Algorithm1:Dispatching Thread
Input:T,a set of SLEEPITs,each of which is associated with a kernel dependency
graph;Q,a set of command queues attached on OTs.
Output:A parallelized dispatching schedule push to Q.
1 while true do
2 if T 6=;then
3 Load
~
0;
/
*
Load is a loading vector of size |Q|,and each
element corresponds to the loading of an individual
queue q 2 Q.
*
/
4 K
0
;;
5 Limit 1;
6 foreach T
i
2 T do
7 if jK
0
j 6= 0 then
8 Limit MAX(Load)
y
;
9 end
10 K
i
set of kernels in the kernel dependency graph that is associated with T
i
;
11 K
0
K
0
[SCHED(K
i
;Q;Load;Limit);
12 if BALANCED(Load)
z
then
13 break;
14 end
15 end
16 foreach k
n
2 K
0
do
17 Commit k
n
to q
j
2 Q;according to schedule information;
18 if k
n
is a Split Kernel then
19 foreach successor k
x
of k
n
do
20 Remove edge fromk
n
to k
x
x
;
21 end
22 end
23 if k
n
is a Retrieve Kernel then
24 Commit retrieve command;
25 end
26 end
27 else
28 SLEEP
{
;
29 end
30 end
y
MAX(vector) returns the maximumvalue of the given vector.
z
BALANCED(vector) returns whether the coefficient of variation of the values in the
given vector is less than 0.2.
x
Note that the dependency information fromk
n
to k
x
should be reserved separately for later
scheduling.
{
SLEEP is a thread state,which would be signal by other threads.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 35
Function SCHED(K,Q,Load,Limit)
Input:K,a set of kernels to be scheduled;Q,a set of command queues attached on OTs;
Load,a loading vector corresponding to each q 2 Q;Limit,the maximum
capability of loading vector.
Output:K
0
,a set of kernels that are scheduled.
1 K
0
;;
2 foreach k
i
2 K,in topological order do
3 AQ ;;
4 foreach k
j
2 K that is a predecessor of k
i
do
5 AQ AQ[ fq
n
g;
/
*
q
n
is provided by previous schedule information of k
j
*
/
6 end
7 if AQ =;then
8 AQ Q;
9 end
10 LQ MINLoad(AQ);
/
*
MINLoad() returns set of minimum load queues
*
/
11 q
m
MINTransfer(LQ);
/
*
MINTransfer() returns first queue with minimum cost of
data transfer,return NULL if jLQj = 0
*
/
12 while (Load[q
m
] +1 > Limit) _(q
m
= NULL) do
13 LQ LQfq
m
g;
14 q
m
MINTransfer(LQ);
15 end
16 if q
m
6= NULL then
17 Mark schedule information of k
i
with q
m
;
18 Load[q
m
]++;
19 K
0
K
0
[k
i
;
20 if k
i
is not a Split Kernel then
21 foreach successor k
x
from k
i
do
22 Remove edge fromk
i
to k
x
;
23 end
24 end
25 else
26 break;
27 end
28 end
29 return K
0
;
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 36
Figure 3.7:Example schedule for kDG in Ext-3MM.
Figure 3.7 shows the generated schedule for Ext-3MM.Kernel seq0 schedules to dev0.
Since kernel seq0 is the split kernel,the kDG will be divided into two rounds to transform
into schedule.After round 1,the kernel seq0 has been pushed into back-end command-
queue to dev0,the remaining graph is ready to be transformed into second round of sched-
ule.Kernel seq1 and kernel seq2 could be concurrently executed and scheduled on dev0
and dev1 respectively as expected.Finally,kernel seq3 merges the result from above two
kernels.Since the amount of transfer for inputs on kernel seq3 are equal,kernel seq3 is
scheduled on dev1 to result a balanced workload on both devices.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 37
Balanced Degree The expressions on line 6 to line 15 in Algorithm1 are the scheduling
part in the Dispatching Thread.The boundary condition is either the trivial condition,
that there are empty unscheduled ITs for DT,or the BALANCED procedure producing
Balanced Degree as the determined condition on line 12.The condition will be taken if
Balanced Degree is less than 0.2 which stands for uniform threshold for given population
[22].That is,the schedule uniformly distributed the kernels on devices.
Balanced Degree represents degree of balance loading on devices for current computed
schedule.Current implementation apply coefficient of variation on the population com-
posed by loading of individual device and total devices to measure this degree.
The coefficient of variation (CV),by definition is:
C
v
= =
where  is standard variation and  stands for mean value of the given population respec-
tively.CVshows the extent of variability in relation to mean of the population.The Smaller
value of coefficient of variation represents less differences between elements to each others
in given population.This simulates balanced loads on devices.The Balanced Degree is
evaluated as follows:
The computed schedule can be regarded as a rectangle with area of M  N square,
where M is max value in Load and N is number of devices with non-zero loading up to
number of devices in whole system.Each unit square in rectangle represents the scheduled
slots for kernels.The schedule fills kernels into the slot in top down orders.Figure 3.8
illustrates this rectangle.In Figure 3.8(b),M = 4 and N = 4.The rectangle can be divided
into N small rectangles with size M,each small rectangle belongs to corresponding device.
Here defines utilization as:
util
i
= load
i
=M
where i is device number.And total utilization is defined as:
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 38
util
total
=
P
n
i=1
load
i
=(M N)
Finally the population P for Balanced Degree is N tuples of util
i
plus util
total
.
BalancedDegree =Coefficient of Variation on P
(a) Initially,DT will schedule IT with gray first.
(b) Split Kernel Scheduled.
(c) Other topological order kernels have been
scheduled.Balanced Degree is 0.348,attempting
pulling kernels fromIT with indigo
(d) Available kernels fromIT with indigo have been
scheduled.Balanced Degree is 0.196
Figure 3.8:Balanced Degree working example.Green nodes are nodes ready to be sched-
uled.Purple node is Split Kernel.Red node is Retrieve Kernel.
Degree of Balanced Degree provides DT an opportunity to pull works fromsucceeding
pending IT to full utilize current batch of scheduled tasks on devices.Figure 3.8 illustrates
an example of the contribution of Balanced Degree.
Assume there are two pending ITs running with 4 devices configuration.DT schedule
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 39
kernels fromIT with gray to kernel 6 which is Spilt Kernel (Figure 3.8(b)) and finds other
possible kernels to schedule in topological order (Figure 3.8(c)).DT selects IT with indigo
and fills kernel to the less utilized devices to promote balance utilization for devices (Fig-
ure 3.8(d)).Note that kernel 4 in indigo IT can not be scheduled in this round,because
kernel 4 has dependency on kernel 3.DT will schedule kernel 4 in indigo IT on Dev1,
while this breaks the Max loading M which is an invalid move.
3.4 Static Analyzer
The static analyzer reads source codes of kernels and generates Kernel Read/Write Depen-
dency Hints for each kernel.The source codes are written in OpenCL C which is based on
C99 with extensions,such as diverse length of vector types and memory region qualifiers
(listed in Table 2.1).In this work the dependency result is derived fromControl Data Flow
Graph (CDFG) [12] that is a combination formof data flow and control flow on the source
code.
3.4.1 Kernel Read/Write Dependency Hints
The hints contain read/write behavior for certain argument of kernels.Since only the
scopes of arguments with
global qualifier are cross kernels,static analyzer focuses on
arguments with
global qualifier.The usage of hints has been explained in above sections.
Static analyzer in ViCL is slightly different from traditional way,because the kernel
arguments involves blocks of data instead of single entity.The transfer penalty among
transferring blocks of data is the bottleneck of heterogeneous multi-core system.The ker-
nel may not require to reference full block of data result a partial reference on block of
data.It is worth to transfer as small as possible fraction of data through the computation.
Static analyzer provides analysis on reference range analysis in kernel scope.Since the
reference range may not resolve at static time such as impact from control flow and dy-
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 40
namic reference values,static analyzer generates reference range expressions accompany
with arguments that can be evaluated at runtime to minimized transferring of data.
3.4.2 Control Data Flow Graph
The data flowgraph is sufficient to find read/write behavior of given source codes.In order
to preserve reference expression for runtime evaluation,it requires the control flow graph
to traverse at runtime.The Control Data Flow Graph (CDFG) is the combination of data
flow and control flow graph,where the main graph are the basic blocks and transitions
fromcontrol flow graph.Each block contains local simplified data flow graph which node
represents data object and edges are dependencies.
The kernel Read/Write dependency hints provide the default behavior that assume ker-
nel traverse all the basic blocks of the control flow and CDFG left for runtime evaluation.
The default behavior is the most conservative manner to evaluate Read/Write dependencies
of arguments that simply regards block of data as single entity.The evaluation on CDFG
may provide less redundant transfer of data,since it gradually check the reference range
inside memory objects with smaller coverage of data.
1
k e r n e l voi d mat ri xmul
CL (
g l o b a l f l o a t  mata,
2
g l o b a l f l o a t  matb,
3
g l o b a l f l o a t  matc,
4
g l o b a l f l o a t  matd,
5 i nt m,i nt p,i nt n ) f
6 i nt i = g e t
g l o b a l
i d ( 0);
7 i nt j = g e t
g l o b a l
i d ( 1);
8 f l o a t val ue = 0;
9
10 f or ( i nt k = 0;k < p;k++)
11 val ue += mata [ i p + k ]  matb [ kn + j ];
12 i f ( n%2==0)
13 matc [ i n+j ] = val ue;
14 e l s e
15 matd [ i n+j ] = val ue;g
Listing 3.3:Modified MMListing
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 41
Current implementation utilizes static information from clang to build CDFG.3.3 is
an example of a modified matrix multiplication with runtime-demanded information for
argument n to control the result storage.Figure 3.9 illustrates the CDFG of this example.
The default behavior and reference expressions for 4 arguments with
global qualifier
are listed in Table 3.1).Note that reference expression are evaluated at runtime when the
control flow is decided.The work-item built-in functions such as get
global
id() will be
evaluated at runtime with configured range fromsetting in arguments of
clEnqueueNDRangeKernel API calls.For example,assume the work
dimsets to 2 and
global
work
size sets to f256;256g.
The referent range for mata is derived as:
i  p +k  get
global
id(0)  p +k  [0;255]  p +[0;p 1]
where [0;255] and [0;p 1] are range from0 to 255 and 0 to p 1 respectively including
boundary values.
Although this is a simple example,the detail evaluation method is based on Run-Time
Buffer Access Range Analysis in [21].
R/WBehavior
Reference Expression
mata
READ
i  p +k
matb
READ
k  n +j
matc
WRITE
i  n +j
matd
WRITE
i  n +j
Table 3.1:Memory region - allocation and memory access capabilities
Finally,the full reference range for certain memory object is the full union range of
runtime flow.For example,when runtime suggests n%2 is taken,the flow would traverse
the CDFG in the order of B7,B6,B5,B4,B3 and B2.matc is referenced with listed range.
Otherwise the flow traversing order would be B7,B6,B5,B4,B3 and B1.matc remains
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 42
Figure 3.9:Example CDFG.Circle node is memory object.Rectangle is basic block.
unused and is not required to load into devices’ space.
3.5 Related Issues of Implementation
In this section,the related issues of design and implementation of ViCL,including dis-
advantage of client-server based implementation and possible improvement,unsupported
feature in OpenCL and issue for optimizing partial reference on memory objects,etc.,will
be discussed.
3.5.1 Drawbacks on Using Separated Model
There are two separated models inside the ViCL framework:The client-server based model
for application to runtime manager and multi-threaded implementation in runtime man-
ager.The client-server based model involves at least two processes in a system.The
inter-process communication (IPC) is slow in comparison with intra-process communica-
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 43
tion.IPC involves larger scope of resource management in the operating system such as
process scheduling and kernel memory allocations.To minimize the IPC overhead,current
implementation using UNIXdomain socket to bridge the application and runtime manager,
avoiding and leaving further supports on distributed systemwith BSDsocket or virtualiza-
tion systemwith V MSocket.It is possible to allocate a shared memory region in between
to get least overhead of IPC but without ways to support in distributed system.
The runtime manager is composed with at least 3+N threads to serve OpenCL calls:a
server listen thread,a interpret thread,a dispatching thread and N overlord threads.De-
spite running 3+N threads concurrently may be seldom to happen,the communication
between threads is yet another resource management problem.Current implementation
using pthread library to implement the multi-threaded system,the overheads of context
switching and locking mechanism are negligible.The overhead analysis is left in chapter
4.
3.5.2 Dependencies among Kernel Issuing
As mentioned in chapter 3,there are two special types of kernels node in Kernel Depen-
dency Graph (kDG):Split Kernel and Retrieve Kernel.The lack of type for Merge Kernel is
because the topological order of kDG.The order,that dispatching thread tries to schedule,
is based on topological order of kDG.Only the ancestors have been finished,the succeed-
ing kernel is fulfilled and could be chosen to schedule.The dispatching detail is listed in
Algorithm1.The scheduled kernel will be pushed into overlord thread space for vendor
provided runtime computation.
There are two types of command-queue in OpenCL:in-order and out-of-order.In-
order queue guarantees the finished order of the enqueued commands.Out-of-order queue
provides extra opportunity to improve overall performance since it automatically detects
the enqueued command is able to issue or not,decided by event objects that are configured
by programmers.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 44
Current implementation uses in-order queue in vendor provided command-queue but
provides out-of-order flavor for overlord threads extracting tasks to issue,since the order
of issuing is inherited from kDG to prevent write-after-write and write-after-read depen-
dencies on kernel issuing.Note that programmers can manually control the order of kernel
issuing by specifying the event objects.This order is transformed into kDG as mentioned
in chapter 3.
Prevention for read-after-write dependencies is the lock mechanism on memory ob-
jects.The memory object will be locked if the issued kernel has write dependency on it.
The lock time falls on enqueued into vendor provided command-queue in corresponding
overlord thread space.And the unlock time falls on the exit of the kernel task.
3.5.3 Unsupported Feature in OpenCL
Except the deferring kernel issuing in ViCL,the non-blocking read/write on memory ob-
ject is currently unsupported.Application calling non-blocking API to enqueue read/write
commands to manipulate memory objects between device memory and host side memory
storage by.After commands have been enqueued,the calling API is immediately return
to host side program.The host side memory storage remains invalid for application until
the runtime complete the commands.Since current implementation on front-end library is
a single-threaded linking library,the asynchronous updating on status of host side storage
is hard to simulate.Although the non-blocking implementation could be done with help
of process level signal,the manual additional signal handler may result cripple behavior in
application which has predefined signal handler.
Feature to map buffer in host memory space,which maps device memory into host
memory space and the synchronization is guaranteed by runtime system,is also unsup-
ported due to the lack of ability to asynchronous synchronization.The solution is to design
the front-end library in multiple-threaded manner.This will be left in future works.
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 45
Figure 3.10:The shaded grids are referenced regions.Assume the reference attempts to
alternate the values on referenced regions.Region with color black is the overlap region.
Note that the grids inside rectangle stand for a region of data.
3.5.4 Optimization for Partial Reference
Although static analyzer provides reference expression to evaluation the referenced range,
it is not simple to implement optimization for partial reference.The referenced method [21]
is designed to spilt one memory objects into multiple parts.It does not prevent overlaps
between these parts.There is an issue on the merging phase after task finished by devices.
Figure 3.10 illustrates this issue.
Since the two kernels reference on approximately disjoint regions with small overlaps,
there is an opportunity to mark the two kernels as being able to execute concurrently to
improve parallelism.Note that the reference range does not guarantee all the elements are
referenced but an approximate coverage.The compiler techniques such as affine transfor-
mation on reference indices cannot be applied,because the references are between pair of
CHAPTER 3.SYSTEMDESIGN AND IMPLEMENTATION 46
kernels that are dynamic decided at runtime.The full size of allocation must also be taken
on each device which attempts to access the memory object.The synchronization between
discrete memory regions requires further analysis by using mechanism such as dirty bit
stamps for each split region.
Current hardware or runtime software does not provide method to operate on dirty bits
for device memory.This required feature is supported by modern operating systemin host
memory space.For systemwhere devices shared the same memory space with host can get
better efficient on this optimization.Another trivial solution is software based:a brute force
method by allocating storages to compare differences between regions.The overhead may
result a tremendous impact on overall performance.Thus,current implementation refuses
to parallelize the issuing in Figure 3.10.The two kernels would be regarded as dependent
and serially issued to the same device.
Chapter 4
Experimental Results
Evaluations of the ViCL framework are discussed in this chapter,including the environ-
ments,experimental results and discussions.
4.1 Environment
4.1.1 PlatformSetup
All of the evaluations were conducted under the environments of dual Intel Xeon E5620
processor (4 cores with 8 threads at 2.4GHz,total 8 cores with 16 threads) with 48 Gi-
gabytes of memory.To express the scalability of ViCL,the configuration contains quad
devices with NVIDIA Fermi architecture (GF110):1 NVIDIA GeForce GTX 590 (Dual
GF110) and 2 GeForce GTX 580s.The system runs with on board integrated VGA,that
the graphic operations are independent fromGPGPUs.
The software configuration is listed as follows:The underlying operating system is
Gentoo Linux with kernel version of 3.8.13-gentoo;ViCL is compiled with LLVM-3.2 [8]
and the parser of static analyzer is using clang-3.2 [8];Test instances are built with GCC
4.6.3;the vendor runtime system is CUDA SDK of version 4.2.9 with NVIDIA driver
version 325.15.
47
CHAPTER 4.EXPERIMENTAL RESULTS 48
Name
Source
Domain
Behavior
3MM/Ext-3MM
hand made
Linear Algebra
single pass of multiple ker-
nels
CFD Solver [cfd]
Rodinia
Benchmark
Suite [9]
Fluid Dynamics
for-loops with decomposable
flow
Heart Wall Track-
ing [heartwall]
Rodinia
Benchmark
Suite
Medical Imaging
for-loops with independent it-
erations
Speckle Reduc-
ing Anisotropic
Diffusion [srad]
Rodinia
Benchmark
Suite
Image Processing
for-loops with dependent ker-
nels steps and iterations
Streamcluster
Rodinia
Benchmark
Suite
Data Mining
for-loops with interchanging
parts between CPU and GPU
in each iteration
Parallel-AES
[PAES]
Gervasi et
al.[18]
Cryptography
Single pass on single kernel
for multiple times
Monte Carlo eX-
treme [MCXCL]
Fang et al.
[16]
Medical Imaging
for-loops with independent it-
erations
Speeded Up
Robust Features
with OpenCL
[clsurf]
Northeastern
Univer-
sity and
AMD [23]
Image Processing &
Feature Extraction
for-loops with decomposable
flow and independent itera-
tions
BitCoin [24]
cgminer
Cryptocurrency
infinite loops on searching
keys with independent itera-
tions
Table 4.1:Summary Detail for Test Patterns.
4.1.2 Test Pattern
The test instances we use to evaluate ViCL are collected frommultiple sources frombench-
marks to real world applications and our examples described in chapter 3:3MMand Ext-
3MM are included.Table 4.1 summarizes the detail of test instances including sources,
utilizing domains and description of program behaviors.Table 4.2 lists input data and
parameter for each test instances.
CHAPTER 4.EXPERIMENTAL RESULTS 49
Name
Configuration
Ext-3MM/3MM
Size is set to 2048x2048,for all matrices.
cfd
Default setting in Rodinia benchmark suite.
heartwall
Default setting in Rodinia benchmark suite.
srad
Default setting in Rodinia benchmark suite.
streamcluster
Default setting in Rodinia benchmark suite.
PAES
Built-in test
performance.py under test directory.
MCXCL
Built-in quicktest with 100 rounds.
clsurf
Video mode on input 1000 frames,frame size:1920x1080.
cgminer
Default parameters and minding on 50BTC mining pool.
Table 4.2:Configuration for each test instance.
4.2 Evaluation
We compare different configurations on number of devices in ViCL with aspect to default
bare vendor runtime systemwith single device configuration.Since ViCL focuses on intra-
task parallelism,all data are collected individually.
4.2.1 Result
The evaluation results are listed in Table 4.3,since the comparisons differ frominstance to
another,the measurement unit is listed behind the name tag in parenthesis.The normalized
comparison is in Figure 4.1.
It is observed that ViCL with multiple devices is outperformed in most test patterns,
except for cfd and streamcluster.The outperformed difference is also scalable with the
configuration on number of devices.The discussion on individual test instance will be
covered in next section.
4.2.2 Analysis &Discussion
In this section,we discuss test instances individually,except for the Rodinia group would
be discussed together.We provide breakdown information to illustrates the behavior of
CHAPTER 4.EXPERIMENTAL RESULTS 50
Name
Vendor
ViCLx1
ViCLx2
ViCLx4
Ext-3MM(sec)
43.84
42.99
28.58
28.66
3MM(sec)
43.79
42.95
28.53
28.61
cfd (sec)
7.53
8.92
15.36
13.18
heartwall (sec)
5.06
2.84
1.99
1.28
srad (sec)
3.93
1.15
1.16
1.14
streamcluster (sec)
14.13
15.57
15.69
15.60
PAES (msec)
1653.77
1664.09
893.93
436.35
MCXCL (photon/msec)
147.47
146.59
271.84
484.92
clsurf (sec)
9.92
11.60
7.77
5.13
cgminer (MHash/sec)
122.6
119.8
237.1
460.2
Table 4.3:Raw scores for test patterns on vendor runtime and ViCLxN,N is#of devices.
2
ViCLx4
2.5
ViCLx1
3
ViCLx2
0
1.5
4
3.5
1
3.96

0.5
Vendor
Figure 4.1:Normalized scores to vendor runtime.
CHAPTER 4.EXPERIMENTAL RESULTS 51
Tag
Description
compute
GPU computing time in Overlord Threads.
Effective Pro/Epo
The overhead in Overlord Threads including synchroniza-
tions among memory objects between devices.
callback
Transfer time of computed result.
abstraction
The mentioned abstraction except Trans
mmobj.
genGraph
The kDG generation time.
Sched
Time cost in procedure SCHED in Dispatching Thread.
Dispatch
Time cost in Dispatching Thread except procedure SCHED.
Trans
mmobj
Additional transfer overhead which is resulted from Mem-
ory Pool.This also reflects size of data transfer between
host and devices.
Init
The runtime initalization cost,this is negligible in ViCL but
heavy for bare OpenCL runtime.
Table 4.4:Detail representations for tuples in breakdown charts.
certain test instances.Table 4.4 provides detail of each tuple in breakdown chart.The
efficiency is described as epilogue of this section.
 3MM-family
The whole kernel computing time is about 1,000 times longer than contribution from
data transfer time,the kernel computing dominates the time measurement for 3MM-
family.Ext-3MM acts merely as same as 3MM,because the computing time of
additional kernel ADD before the matrix multiplications is approximately equal to
the transfer time for a single block of 2048x2048 matrix.Since the 3MM-family
is designed to reveal parallelism at most 2,the family could not be beneficial from
additional devices in ViCLx4.By Amdahl’s law,the realistic improvement Imp
Max
is 1.5 (section 3.1).Figure 4.2 shows the comparison between hand-coded on bared
runtime with two devices and ViCL series.The surpassing result is due to the run-
time loading cost in vendor’s case.Without considering the runtime initialization
(Init portion of column of Vendor x1 and Vendor x2 in Figure 4.2) in bared runtime
system,the efficiency are (99%,99%) for ViCLx1 and ViCLx2 respectively.
CHAPTER 4.EXPERIMENTAL RESULTS 52
Trans_mmobj
45000000