Kernel Weaver: Automatically Fusing Database

spongemintSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

91 views

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY


Sponsors
: National Science Foundation, LogicBlox Inc.
,
and NVIDIA

Kernel Weaver: Automatically Fusing Database
Primitives for Efficient GPU Computation


Haicheng Wu
1
, Gregory Diamos
2
,
Srihari

Cadambi
3
,

Sudhakar Yalamanchili
1



1
Georgia Institute of Technology

2
NVIDIA Research

3
NEC Laboratories America


SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Data Warehousing Applications on GPUs

2


The Opportunity


Significant
potential data parallelism


If data fits in GPU
memory,
2x

27x
speedup has been
shown

1



The
Challenge


Need to process 1
-
50 TBs of data
2


15

90
% of the total time
*

spent in
moving data between CPU and GPU

*


Fine grained computation





1
B
. He, M. Lu, K. Yang, R. Fang, N. K.
Govindaraju
, Q.
Luo
, and P. V. Sander. Relational query co
-
processing on graphics
processors.
In
TODS
, 2009
.

2
Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011
IOUG Data Warehousing Survey
.


SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Relational Algebra (RA) Operators

RA operators
are
the building
blocks of DB
applications


Set Intersection


Set Union


Set Difference


Cross Product


Join


Select


Project

Key

Value

3

True, a

3

False, b

4

True, a

Example: Select [Key == 3]

Key

Value

3

True, a

3

False, b

4

True, a

3

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Relational Algebra (RA) Operators

RA are building blocks of DB
applications


Set Intersection


Set Union


Set Difference


Cross Product


Join


Select


Project

Key

Value

3

a

3

b

4

a

Key

Value

3

c

4

d

5

e

Example: Join

Key

Value

3

a,c

3

b,c

4

a,d

New Key = Key(A) ∩ Key(B)

New
Vallue

= Value(A) U Value(B)

A

B

JOIN (A, B)

4

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Data Movement in Kernel Execution

5

~250GB/s



Input



䕸散畴u



Result

M

N

T

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion
-

A Data Movement Optimization

6


Increase the granularity of
kernel computation


Reduce data movement
throughout the hierarchy


Inspired by loop fusion


Compile
-
time automation


Input is an optimized query
plan

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion

GPU MEM

GPU Core

A1

A2

A3

Temp

A1

A2

A3

Temp

Result

Result

Before Fusion

GPU MEM

GPU Core

A1

A2

A3

A1

A2

A3

Result

Result

After Fusion

Temp

Kernel A

Kernel B

Fused Kernel A&B

Kernel
A

A1

A2

A3

Kernel
B

Result

Temp

A1

A2

A3

Fused
Kernel A , B

Result

7

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Major Benefits


Reduce Data Footprint


Reduction in accesses to global memory


Access to common data across kernels improves temporal locality


Reduction in
PCIe

transfers


Expand optimization scope of the compiler


Data re
-
use


Increase textual scope of optimizers


8

Kernel
A

A1

A2

A3

Kernel
B

Result

Temp

A1

A2

A3

Fused
Kernel A , B

Result

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Red Fox Compilation Flow

9

RA
-
to
-
PTX

(
nvcc

+ RA
-
Lib)

Runtime

LogicBlox

Front
-
End

Language
Front
-
End

Translation
Layer

Back
-
End

Datalog

Queries

Query Plan

PTX/Binary Kernel

Kernel
Weaver


Kernel Weaver



CUDA source to source
transformation to apply kernel fusion

PTX


Parallel Thread Execution

RA Primitives Library

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example of SELECT


* G.
Diamos
, H. Wu, J. Wang, A.
Lele
, and S.
Yalamanchili
.
Relational Algorithms for
Multi
-
Bulk
-
Synchronous Processors
. In
PPoPP
, 2013.

RA Implementation
-
Multi
-
Stage Algorithms

10

All primitives have the same three stages

*

Each stage normally maps to 1 CUDA kernel

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion


Three Steps

1.
Opportunity
: Find candidates meeting fusion criteria.



2.
Feasibility
: Choose kernels to fuse according to available
resources.



3.
Fusion
: Kernel fusion.





11

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion Criteria (1)

12


Compatible kernel configurations (CTA & thread dimensions)


Implementations of RA primitives are parametric


Empirically choose configurations after fusion







M1

N1

M2

N2

T
1

T2

M

N

T

Kernel

A

Kernel

B

Fused Kernel

A & B

12

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY


Dependence Restriction


Thread dependence







Kernel Fusion Criteria (2)

Kernel A

Kernel B

Input data have 2 attributes


Operations of each thread are independent


Use registers to communicate

13

Kernel A

Kernel B

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY


Dependence Restriction


Thread dependence


CTA (Thread Block) dependence







Kernel Fusion Criteria (2)

14

Kernel A

Kernel B

14

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion Criteria
-
CTA Dependence

15


Threads in the same CTA have
dependence



No dependence between CTAs



Can be fused



After fusion


Use Shared MEM to
communicate


Synchronization is needed






Example of 2 back
-
to
-
back JOINs

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY


Dependence Restriction


Thread dependence


CTA (Thread Block) dependence


Kernel dependence







Kernel Fusion Criteria (2)

16

Can be fused

Kernel A

Kernel B

16

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Fusion
Criteria
-

Candidates for Fusion


Only exhibit thread or CTA dependence


Bounded by operators with kernel dependence

17

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Choosing Operators to Fuse

18

Dependence Graph

1.
Topo

Sort

2. Incrementally add operators

3. Stop When the Estimated
Usage is Larger than Budget


Kernel fusion will increase resource usage, e.g., registers


Greedy heuristic to choose

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Kernel Weaving
and Fusion

19

Interweaving and Fusing

individual stages (CUDA kernels)


Use registers or shared memory
to store temporary
result



SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Fusing Thread Dependent Only Operators

20

Example of fusing 2 SELECTs




Unary operators only


No Synchronization required


Register
-
based communication


Select

Select

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Gather

Partition

Compute

Fusing CTA and Thread Dependent Operators

21



Partition multiple inputs


Synchronization necessary


Communication via shared
memory


Example Pattern

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Experimental Environment

CPU

2 quad
-
core Xeon

E5520 @ 2.27GHz

Memory

48

GB

GPU

1 Tesla C2070

(6GB GDDR5 memory)

OS

Ubuntu 10.04 Server

GCC

4.4.3

NVCC

4.0


Use micro
-
benchmarks derived from TPC
-
H


Measure m
emory allocation
,
memory access demand,
effect of optimization scope, and
PCIe

traffic




Full queries from TPC
-
H

22

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

TPC
-
H Benchmark Suites

23


A popular decision making benchmark suite


Micro
-
benchmarks
are common patterns from TPC
-
H

Baseline: directly using primitive implementation without fusion

Optimized: fusing all primitives of each pattern



SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

7.89

1.42

1.58

1.11

2.45

0
1
2
3
4
5
6
7
8
9
10
a
b
c
d
e
Speedup

Fused
vs
. Not Fused

Small Inputs
-
PCIe

excluded

24

Average 2.89x speedup

Small inputs (64MB
-
1GB) fitting the GPU memory

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Small Inputs
-
Analysis

25

Memory Allocation

Compiler Optimization

(Speedup of O3)


4.43

4.64

4.26

3.76

4.70

1.61

3.13

0.76

4.70

2.68

0
1
2
3
4
5
6
a
b
c
d
e
Size (GB)

Not fused
Fused
75.29%

29.46%

56.12%

13.43%

67.69%

0%
20%
40%
60%
80%
a
b
c
d
e
Memory Access Reduction

1.08

2.45

2.31

1.25

1.00

1.78

2.80

2.90

1.37

1.09

0
0.5
1
1.5
2
2.5
3
3.5
a
b
c
d
e
Speedup

Not Fused
Fused
Memory Access Reduction

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Large Inputs
-
PCIe

included

26

0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Not Fused
Fused
Not Fused
Fused
Not Fused
Fused
Not Fused
Fused
Not Fused
Fused
a
b
c
d
e
Normalized Time

PCI
Compute
Average
2.22x speedup
overall

and 2.35x speedup in
PCIe

Large inputs (1GB
-
1.6GB) fitting the GPU memory

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Resource Usage & Occupancy

27

PTX
Reg

#

Shared
MEM

(Byte)

Occupancy

(%)

PROJECT

11

0

100

SELECT

22

3848

88

JOIN

47

13580

38

+/
-

10

0

100

Multiply

13

0

100

PTX
Reg

#

Shared
MEM

(Byte)

Occupancy

(%)

(a)

22

2308

88

(b)

55

23560

33

(c)

62

23048

17

(d)

30

4612

67

(e)

27

0

75


Kernel fusion may increase resource usage and thus
decrease occupancy


These two do not negate the other benefits

Individual primitive

After kernel fusion

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Real Queries (scale factor = 1)

28

TPC
-
H Q1

1.25x
speedup

TPC
-
H Q21

1.22x
speedup

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Extensions


Different Domains


Require multi
-
stage algorithm


Dependence classification still applies



Different Representation


PTX,
OpenCL
, LLVM



Different Platform


CPU, GPU/CPU hybrid

29

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Conclusions


Kernel Fusion can reduce data transfer and speeds up the
computation for Data Warehousing Apps.



Definition of basic dependences and general criteria for kernel fusion
applicable across multiple application domains



Quantification of the impact of kernel fusion on different levels of the
CPU
-
GPU memory hierarchy for a range of RA operators
.



Proposes and demonstrates the utility of compile
-
time data movement
optimizations based on kernel fusion






30

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Thank You


Questions?

31