Optimization of massively parallel workload using GPGPU

spongemintSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

175 views

Optimization of massively parallel
workload using GPGPU

CS530 Group 7

Giyoung

Nam,
Shinjo

Park,
Barosl

Lee

Introduction to GPGPU


Old graphic cards had fixed function
shaders


Shaders

are application for GPU that produces image from traits


Starting from NVIDIA’s GeForce 8000 and ATI’s R600
programmable
shaders

are introduced


GPGPU starts from here: can we use
shaders

for
computation rather than graphics?


Operating system itself doesn’t use much GPU resource


Different design, programming model from ordinal CPU


Efficient usage of GPGPU is another challenge

Why GPGPU?

Cost Effective


Cheaper than dedicated cards
based on ASIC/FPGA

New Paradigm of Parallelism


Array of multiple simple core


Did not exist before

http://mw2.google.com/mw
-
panoramio/photos/medium/2397438.jpg

Problems

CPU

Xeon
X5690

IOH

Intel 5520

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

CPU

Xeon
X5690

IOH

Intel 5520

PCIe

Devices

(Disk, NIC, etc.)

PCIe

Devices

(Disk, NIC, etc.)

1) Data acquisition

2) GPU offloading

3) GPU usage optimization

Big Picture

Efficient usage of GPU

Fast Data
Acquisition

GPU
Offloading

GPU Usage
Optimization

Disk I/O

Threshold Based
Offloading

Dynamic
Offloading

Raw Computing
Power

Code Level
Optimization

Network I/O

Hash text
generation

Fast Response
Time

Contents


Fast data acquisition


GPU offloading strategies


Threshold
-
based offloading (single, multiple)


Dynamic offloading


GPU usage optimization


Raw GPU computing power


Code level optimization


Fast response time


Conclusion

Fast Data Acquisition


Modeling, video encoding: most data residing on disk


Traditional disk I/O acceleration


Direct I/O, asynchronous I/O, I/O thread separation


Hash calculation: data dynamically generated


No data reside in external storage device


GPU can generate plaintext when pattern is simple


Complex pattern:
maskprocessor

from
HashCat

Fast Data Acquisition


Network system: high
-
speed
packet acquisition
engine


Plain
pcap

library does not scale well in high
speed


Read bunch of packets bypassing OS networking stack


Examples of high speed packet I/O engine


mmap
() enhanced
pcap


PF_RING,
PacketShader

I/O
Engine


Intel DPDK (Data Plane Development Kit)

GPU Offloading


Offloading based on multiple factors


For small workload CPU is faster due to copy latency


GPU uses more power than CPU


Memory bandwidth is often constrained


GPU is not a part of CPU


Connected via
PCIe

bus


PCIe

talks to CPU/memory directly or via chipset


Memory access must pass
PCIe

bus even using DMA

Threshold Based Offloading


Offload everything, or no consideration


Papers focusing on raw GPU performance


Workload without fluctuation, constantly generated


Offloading based on single threshold


Papers focusing on system building


Workload with fluctuation


Mostly based on “job queue” concept: when queue is filled over
certain limit, job is transferred to GPU

Threshold Based Offloading


Offloading based on multiple thresholds


Kargus

has asymmetric offloading threshold for CPU
-
GPU and
GPU
-
CPU transition


SSLShader

has maximal offloading limit due to GPU throughput








Thresholds are determined statically or dynamically

CPU

CPU

GPU

l

n

m

l

m

n

α

β

γ

Internal packet queue of each engine

GPU

Queue

Length

Dynamic Offloading


Offloading based on time and resource usage


Cycle distribution


Calculate CPU and GPU time after given cycle passed


Decide processor for next cycle by calculated time


Best time distribution


Calculate first CPU and GPU time and store them as best time


Update best time for each cycle


If processing time of cycle is less than best time of other
processor, switch to other processor


Resource distribution


Check both time and utilization of each processor

GPU Usage Optimization


GPU aware programming model


Large sized vector, vector
-
oriented operation and no branching


Limited fast GPU memory and slow system memory


Can my workload benefit from GPU offloading?


Data parallelism: each data is independent


Code parallelism: each code could be executed independently


Repetition of same workloads
en masse


We will cover strategies by large category

How to Optimize?


Researched GPU papers were covering “total system”
rather than “individual components”


Each GPGPU system is organic compound of problems


Different emphasis on individual problem by papers


Our categorization


Code and data model for raw computing power


GPU code level optimization


Fast data transmission between CPU and GPU

Raw Computing Power


Hash calculation


Utilize multiple GPU cores for simultaneous job


GPU can perform more simultaneous integer operation


Most hash function consists of multiplication and bitwise
operation


Parallelism
in data level: hash of independent
string


Bitcoin

mining


Alternative currency based on peer
-
to
-
peer network


Certain number of leading zeroes on SHA
-
256 hash of nonce


If number of zero matches given value, then block is mined


Raw Computing Power


Cryptography


Combination of hash function, PKI operation, AES


Like hash function cryptographic function is largely integer
calculation


Parallelism in data level: independent plain/encrypted text


Pattern matching


IDS works as pattern matching against incoming packet


Multiple pattern matching and PCRE used in IDS


Problem solved as DFA matching: no branching


Parallelism in data level: independent network
flows

Code Level Optimization


Naïve GPU coding can’t get maximum performance


SSLShader’s

performance analysis for optimization






GPU needs specific optimization


Simple core without branching and reordering


Operation based on large sized vectors


Other generic optimization techniques

Code Level Optimization


Optimization due to architectural difference


Minimizing divergence caused by branching → replace to other
arithmetic operations


In
-
order execution: giving more chances to compiler is beneficial,
like loop
unrolling


Vectors instead of multiple integers


can reduce # of ops


Scheduling consideration: NVIDIA CUDA uses warp (group of 32
threads) for scheduling unit, fully utilizing each warp can improve
performance

Code Level Optimization


Pipelined approach


Do not make CPU idle during GPU working


Separate CPU and GPU’s work and execute alternately


Simultaneous copy and execute


Related to pipelined approach


Independent thread can talk to GPU independently


Possible with GPU’s dedicated DMA engine


Example
:
MIDeA’s

implementation


Code Level Optimization


Reducing memory footprint


Store only needed value, use small sized data structure


Can reduce memory
bandwidth between CPU and GPU
due to
small amount of data →
overall performance
increase


Example:
MIDeA

stores state transition table on GPU’s memory.
Storing only possible state reduced memory footprint.


Fast Response Time


Data transfer between GPU and CPU is bottleneck


Kargus
’ performance breakdown for packet processing







Good offloading strategy is one of the solution


Hardware improvements


Faster interconnecting bus, more close interconnection

Conclusion


Efficient usage of GPU depends on external factors


Totally different computational element


Architectural optimization, parallelism in code and data


Good offloading strategy required for fluctuating jobs


Not every job can benefit from GPU processing


GPU
-
based system also covers fast data acquisition


Rapid advancement in GPU is reflected on the papers


Each paper tries to use “state
-
of
-
the
-
art” in time of writing

Questions?