Application of GPGPU on High-speed Network Environment

tackynonchalantΛογισμικό & κατασκευή λογ/κού

3 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

207 εμφανίσεις

Application of GPGPU on

High
-
speed Network Environment

CS530 Group 7

Giyoung

Nam,
Shinjo

Park,
Barosl

Lee

Contents


Introduction to GPGPU and high
-
speed IDS


Overview of GPGPU
-
based IDS system


Various challenges on using GPGPU


Packet acquisition


GPU offloading


GPU usage Optimization


Conclusion

Once Upon a Time

Time Passes

More Time Passes

http://www.bgamester.com/2012/10/euro
-
truck
-
simulator
-
2.html

History of GPGPU Development


Shaders

are applications on graphics card that produce
images from its traits


Vertex shader (position, texture), Pixel shader (color, z
-
depth,
alpha), Geometry shader


Shader
s

evolved from fixed function to unified
shaders


NVIDIA GeForce 8000 series, AMD (nee ATI) R600 series


Unified shader can run arbitrary application in shader language


GPGPU started from usage of
shaders

in general
purpose computing


NVIDIA CUDA,
OpenCL
, Microsoft
DirectCompute
, …

Intel Xeon E7 Series


Codenamed
Westmere
-
EX,
Intel’s lineup for high
-
end
multi
-
socket CPU


Contains 10 cores connected
by ring bus


http://techreport.com/news/20725/intel
-
unveils
-
10
-
core
-
32
-
nm
-
xeon
-
processors

NVIDIA GeForce GTX 680

http://www.anandtech.com/show/5699/nvidia
-
geforce
-
gtx
-
680
-
review/2


Each green square represents
CUDA core, which
is actually
unified shader


Number of unified shader is a
factor of GPU performance


GTX680 has
1536 unified
shaders


GPGPU vs. CPU

GPGPU


Large number of simple core


Programming model should
exploit massive parallelism

CPU


Small number of versatile core

GPGPU Applications


Signal
processing


Video transcoding and processing


Cryptography:
HashCat
,
Bitcoin

mining, etc.


Scientific modeling


Network intrusion detection


Especially high
-
speed IDS system

Workload Properties


Modeling, calculation, video processing


Large input data from disk


Hash cracking, cryptanalysis


Most data on memory, CPU and GPU


Networking


Bandwidth constraint: large amount of incoming data


Real
-
time constraint: packet will drop when processing is long

Towards High
-
speed NIDS


Cyber attacks are growing significantly


Host monitoring has limitation, focusing on
middlebox


Centralized monitoring system by network admin


Firewalls, intrusion detection systems (IPS/IDS), etc.

Network Intrusion Detection System


Monitors network activities for malicious action


Pattern matching


Reports to a management station


Stores for future attack

Middlebox

Implementations


Hardware
-
based implementation


Dedicated hardware for high
performance: ASICs, FPGAs, etc.


Expensive, limited flexibility: tied to what chip maker provides


Software
-
based
implementation


Commodity general
-
purpose hardware


Higher flexibility


Low performance


Not optimized for powerful latest commodity hardware


Multicore CPUs


Manycore

general
-
purpose GPUs


10G network cards supporting multiple hardware
queues

Motivation


High
-
speed software
-
based
network IDS


High flexibility, scalability & cost efficiency


Performance optimization


Gnort
1)


Integration of software
-
based IDS Snort with GPU


MIDeA
2)


New software IDS architecture utilizing parallelism


Kargus
3)


Another approach of exploiting parallelism


1) G.
Vasiliadis
, S.
Antonatos
, M.
Polychronakis
, E.
Markatos

and S. Ioannidis, “
Gnort
: High
performance network intrusion detection using graphics processors”, RAID 2008

2) G.
Vasiliadis
, M.
Polychronakis

and S. Ioannidis, “
MIDeA
: a multi
-
parallel intrusion detection
architecture”, CCS 2011

3) M
.
Jamshed
, J. Lee, S. Moon, I. Yun, D. Kim, S. Lee, Y. Yi, and K. Park,


Kargus
: A Highly
-
scalable Software
-
based Intrusion Detection System”,
CCS
2012

Background: Snort


One of the most popular software IDS


Free software, free and commercial version available


All papers are based on Snort or uses part of it


Pattern based matching on network packets


Various
ruleset

available for multiple type of attacks



Pattern matching is offloaded to GPU in all works

Architecture Overview

CPU

Xeon
X5690

IOH

Intel 5520

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

CPU

Xeon
X5690

IOH

Intel 5520

NIC

Intel X520

NIC

Intel X520

QPI

IMC

IMC

QPI

QPI

QPI

Gnort


One of the earliest work of GPU
-
based IDS


GeForce 8000 era, GPGPU now became feasible


Early work
PixelSnort

used GeForce 6800GT, programming was
hard due to fixed shader and showed no speedup


Following classic Snort, only detection was offloaded

MIDeA


Advanced architecture for 10Gbps environment


Utilizing more elements of modern computer structure

Kargus


Implementation for 10+Gbps environment


Different optimization from
MIDeA

Packet Acquisition

CPU

Xeon
X5690

IOH

Intel 5520

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

CPU

Xeon
X5690

IOH

Intel 5520

NIC

Intel X520

NIC

Intel X520

QPI

IMC

IMC

QPI

QPI

QPI

High
-
speed Packet Acquisition


Plain
pcap

library does not scale well in high speed


High
-
speed packet acquisition engine


Does not create
sk_buff

or trigger interrupts for every packets


Direct data access in a chunk of packets


NUMA and multithread,
multiqueue

aware


Each paper has different high
-
speed packet acquisition


Gnort
: Plain
pcap

+
mmap
() enhanced
pcap


MIDeA
: PF_RING


Kargus
:
PacketShader

I/O
Engine


Packets are reassembled into
flows
and sent to
GPU

Receive Side Scaling


Goal: evenly distribute packet processing load to multiple
cores


Hardware support required: modern server NIC has
multiple RX queue


Each queue have separate interrupt and processor affinity


RSS maps packets in single flow to the same RX queue


Selection of queue is based on
Toeplitz

hash function
calculated inside NIC


MIDeA

uses additional hashing after RSS to map both flows onto
same core


Kargus

uses symmetric RSS (different hash key) to do same

Packet I/O Benchmark


Result from
Kargus
, PCAP, PF_RING and PSIO








PCAP shows lowest performance


PF_RING DNA uses polling, high CPU usage

GPU Offloading

CPU

Xeon
X5690

IOH

Intel 5520

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

CPU

Xeon
X5690

IOH

Intel 5520

NIC

Intel X520

NIC

Intel X520

QPI

IMC

IMC

QPI

QPI

QPI

Pattern Matching


Multi pattern string and PCRE matching is used


Single pattern matching: only one pattern matched for given
string. Example: Boyer
-
Moore, KMP


Multi pattern matching: simultaneously matches for occurrences
of multiple pattern strings. Example:
Aho
-
Corasick


Aho
-
Corasick

and PCRE could be solved as DFA


Gnort

and
MIDeA

covers
Aho
-
Corasick

acceleration


Kargus

additionally covers PCRE acceleration


Matching DFA fits to GPU’s structure


Iterative job with no branching


Incoming packets are independent, separately processed



Offloading Strategies


Memory copy between CPU and GPU is bottleneck


By DMA engine GPU can copy from CPU’s memory space


PCIe

payload copy is done in multiple byte stride


Page
-
locked memory to avoid swapping of low
-
latency data


Gnort
: Simply offloading everything to GPU


Packet grouping based on port group: each rule’s port number


Once packets are grouped then it is copied to GPU


After 100ms timeout packets are copied to GPU regardless of
group

MIDeA
: Simple Offloading


Packets are grouped by flows and higher
-
level protocol
contents are normalized


Shared buffer per process, no port grouping


Each packets has tag for processing engine’s type

PCIe

Performance


PCIe

transfer rate for different buffer
sizes










Use large buffer for attaining maximum performance


Offloading Threshold


Processing small sized packet is faster using CPU


Possible cause: latency for copying packets to GPU








Simple threshold based offloading


Packets larger than specified size is offloaded to GPU

Kargus
: Dynamic Offloading


Pattern matching is always fast in GPU







Surrounding latencies: preparation and cleanup

Dynamic offloading


CPU and GPU has own properties


CPU has lower latency, GPU has higher performance


Power consumption of GPU is higher than CPU


Workload queue: contains batches of packets


Offloading based on processing rate


CPU can handle lower input rate


Different threshold for increasing and decreasing processing rate

CPU

CPU

GPU

l

n

m

l

m

n

α

β

γ

Internal packet queue of each engine

GPU

Queue

Length

GPU Usage Optimization

CPU

Xeon
X5690

IOH

Intel 5520

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

GPU

NVIDIA GTX580

DDR3

DDR3

DDR3

CPU

Xeon
X5690

IOH

Intel 5520

NIC

Intel X520

NIC

Intel X520

QPI

IMC

IMC

QPI

QPI

QPI

MIDeA
: Compacting State Table


Aho
-
Corasick

uses transition
function for input data


Reason for compaction


Possible state transition is
limited


GPU memory and
bandwidth
between CPU
is limited


Only non
-
zero elements are
stored for compression

AC
-
Full vs. AC
-
Compact


Increased GPU throughput


Peak performance of 16.4Gbps vs. 21.1Gbps







Memory usage reduced from 890.46MB to 24.18MB

Pipelined Approach


Grouped bytes to match
transaction size


32 byte memory transaction


128
-
bit integer vector


Prepare next data to GPU
while GPU is processing,
possible by separate DMA
engine

Kargus
: Compacting DFA Table


Kargus

also support PCRE along with
Aho
-
Corasick


DFA with less than 5000 states are handled in GPU


Memory bandwidth is bottleneck, state explosion


Half sized DFA table entry increased performance


2B table entry doubled performance than 4B entry


Saturated at 8192
pkts
/batch, 39.1Gbps using
Aho
-
Corasick


PCRE vs.
Aho
-
Corasick

Aho
-
Corasick


Saturated at 8192
pkts
/batch


Performance: 39.1Gbps

PCRE


Saturated at 1024
pkts
/batch


Performance: 8.9Gbps

Evaluation

http://imnews.imbc.com/news/2013/politic/article/3264048_11199.html

Gnort


1000 small random patterns with random payload


Multiple pattern matching showed better performance


CPU is faster in small packets


Highest speed was 2.3Gbps using
Aho
-
Corasick

Gnort
: Real Trace Benchmarks


Based on Snort using only pattern matching


Used university trace at 1Gbps rate


Packet acquisition: standard
pcap

and
pcap
-
mmap


Aho
-
Corasick

showed zero drop up to 500Mbps in
standard
pcap
, 600Mbps in
pcap
-
mmap

MIDeA


Synthetic traffic


7.67Gb/s for 1500b packet,
1.86Gb/s for 200b packet


Drop starts at 7.22Gb/s
(1500b packet), 1.5Gb/s (200b
packet)


Real traffic


74 minutes, 46GB of trace


5.7Gb/s replay rate

(Ideal 7.8Gb/s)


No drops up to 5.2Gb/s using
GPU

0
5
10
15
20
25
30
35
64
218
256
818
1024
1518
Throughput (Gbps)

Packet size (Bytes)

MIDeA
Snort w/ PF_Ring
Kargus CPU-only
Kargus CPU/GPU
Kargus
: Innocent
Traffic Performance

43


2.7
-
4.5x faster than Snort


1.9
-
4.3x faster than
MIDeA

Malicious Traffic Performance

44

0
5
10
15
20
25
30
35
64
256
1024
1518
Throughput (
Gbps
)

Packet size (Bytes)

Kargus, 25%
Kargus, 50%
Kargus, 100%
Snort+PF_Ring, 25%
Snort+PF_Ring, 50%
Snort+PF_Ring, 100%

5
x faster than Snort

Real Network Traffic


T
hree 10
Gbps

LTE backbone traces of a major ISP in Korea


Time duration of each trace: 30
mins

~ 1 hour


TCP/IPv4 traffic:


84 GB of PCAP traces


109.3 million packets


845k TCP sessions



Total analyzing rate:

25.2
Gbps


Flow reassembly causes the bottleneck


Additional
memory
allocation and
packet
copy


Multi
-
string
pattern matching in the assembled flow
data

45

Power Consumption


Polling shows higher CPU% and power consumption


Load balancing decreased power consumption

Conclusion


GPU can accelerate repeated parallelized workloads


To acquire maximum performance, understanding the
components surrounding GPU is required


Papers are presenting not only challenges of using GPU,
but also how the GPU hardware advanced


Newer GPU architecture provided flexibility and performance


IDS is one of the workload using nearly all parts
of
system

Questions?