MakineiIyer

puffyyaphankyonkersNetworking and Communications

Oct 26, 2013 (4 years and 13 days ago)

109 views

HPCA
-
10

Architectural Characterization of TCP/IP
Processing on the Intel® Pentium® M
Processor

Srihari Makineni & Ravi Iyer

Communications Technology Lab

Intel® Corp.

{srihari.makineni & ravishankar.iyer}@intel.com

HPCA
-
10

HPCA
-
10

2

Outline


Motivation


Overview of TCP/IP


Setup and Configuration


TCP/IP Performance Characteristics


Throughput and CPU Utilization


Architectural Characterization


TCP/IP in server workloads


Ongoing work

HPCA
-
10

3

Motivation


Why TCP/IP?


TCP/IP is the protocol of choice for data communications


What is the problem?


So far system capabilities allowed TCP/IP to process
data at Ethernet speeds


But Ethernet speeds are jumping rapidly (1 to 10 Gbps)


Requires efficient processing to scale to these speeds


Why architectural characterization?


Analyze performance characteristics and identify
processor architectural features that impact TCP/IP
processing

HPCA
-
10

4

TCP/IP Overview

User

Kernel

Application

Sockets Interface

NIC

Driver

TCP/IP Stack

Network
Hardware

DMA

Buffer

TCB

Desc 1

ETH

IP

TCP

Desc 2

Tx 1

Tx 2

Tx 3

Eth Pkt 1


Transmit

HPCA
-
10

5

Rx 2

TCP/IP Overview

User

Kernel

Application

Sockets Interface

NIC

Driver

TCP/IP Stack

Network
Hardware

Copy

Signal/

Copy

Payload

Rx 3

Rx 1

ETH

IP

TCP

Buffer

TCB

Descriptor

Eth Pkt 1

DMA


Receive

HPCA
-
10

6

Setup and Configuration


Test setup


System Under Test (SUT)


Intel® Pentium® M processor @
1600MHz, 1MB (64B line) L2 cache


2 Clients


Four way Itanium® 2 processor @
1GHz; 3MB L3 (128B line) cache


Operating System


Microsoft Windows* 2003 Enterprise Edition


Network


SUT


4Gbps total (2 dual port Gigabit NICs)


Clients


2Gbps per client (1 dual port Gigabit NIC)

HPCA
-
10

7

Setup and Configuration


Tools


NTttcp


Microsoft application to measure TCP/IP performance


Tool to extract CPU performance counters


Settings


16 connections (4 per NIC port)


Overlapped I/O


Large Segment Offload (LSO)


Regular Ethernet frames (1518 bytes)


Checksum offload to NIC


Interrupt coalescing


HPCA
-
10

8

Throughput and CPU Utilization


Lower Rx performance for > 512 byte buffer sizes


Rx and Tx (no LSO) CPU utilization is 100%


Benefit of LSO is significant (~250% for 64KB buffer)


Lower throughput for < 1KB buffers is due to buffer locking

TCP/IP processing @ 1Gbps

& 1460 bytes requires >1 CPU

HPCA
-
10

9

Processing Efficiency


64 byte buffer


Tx (lso)


17.13 and Rx


13.7


64 KB buffer


Tx (lso)


0.212, Tx (no LSO)


0.53 and Rx


1.12

Several cycles are needed to move a bit, especially for Rx


Hz/bit

HPCA
-
10

10

Architectural Characterization


Rx CPI higher than Tx for >512 byte buffers


Tx (LSO) CPI is higher than Tx (no LSO)!!!


CPI needs to come down to achieve TCP/IP scaling


CPI

HPCA
-
10

11

Architectural Characterization


Rx pathlength increase is significant after 1460 byte buffer sizes


For 64KB, TCP/IP stack has to receive and process 45 packets


Lower CPI for Tx (no LSO) over Tx (LSO) is due to higher PL


High PL shows that there is room for stack optimizations


Pathlength

HPCA
-
10

12

Architectural Characterization


Rx has higher misses


Primary reason for higher CPI


Lot of compulsory misses


Source buffer, descriptors, may be destination buffer


Tx (no LSO) has slightly higher misses per bit

Rx performance does not scale with cache size (many compulsory misses)


Last level Cache Performance

HPCA
-
10

13

Architectural Characterization


32KB of data cache in Pentium® M processor


As expected L1 data cache misses are more for Rx


For Rx, 68% to 88% of L1 misses resulted in L2 hits

Larger L1 data cache has limited impact on TCP/IP


L1 Data Cache Performance

HPCA
-
10

14

Architectural Characterization


L1 Instruction Cache Performance


32KB instruction cache in Pentium® M processor


Tx (no LSO) MPI is lower because of code temporal locality


Rx code path generated L1 instruction capacity misses


Larger L1 instruction cache helps RX processing


L1 Instruction Cache Performance

HPCA
-
10

15

Architectural Characterization


TLB Performance


Size


128 instruction and 128 data TLB entries


iTLB misses increase faster than dTLB misses


TLB Performance

HPCA
-
10

16

Architectural Characterization


19
-
21% branch instructions


Misprediction rate is higher in Tx than Rx for <
512 byte buffer size

>98% accuracy in branch prediction


Branch Behavior

HPCA
-
10

17

Architectural Characterization


CPI Contributors


RX is more memory
intensive than TX




Frequency Scaling


Poor Frequency
scaling due to memory
latency overhead


Frequency Scaling alone will not deliver 10x gain

HPCA
-
10

18

TCP/IP in Server Workloads


Webserver


TCP/IP data path overhead is ~28%


Back
-
End (database server with iSCSI)


TCP/IP data path overhead is ~35%


Front
-
End (e
-
commerce server)


TCP/IP data path overhead is ~29%

TCP/IP Processing is significant in commercial server workloads

HPCA
-
10

19

Conclusions


Major Observations


TCP/IP processing @ 1Gbps & 1460 bytes requires >1 CPU


CPI needs to come down to achieve TCP/IP scaling


High PL shows that there is room for stack optimizations


Rx performance does not scale w/ cache size (=> compulsory misses)


Larger L1 data cache has limited impact on TCP/IP


Larger L1 instruction cache helps RX processing


>98% accuracy in branch prediction


Frequency Scaling alone will not deliver 10x gain


TCP/IP Processing is significant in commercial server workloads



Key Issues


Memory Stall Time Overhead


Pathlength (O/S Overhead, etc)

HPCA
-
10

20

Ongoing work


Investigating Solutions to the Memory Latency Overhead


Copy Acceleration


Low cost synchronous/asynchronous copy engine


DCA


Incoming data is pushed into processor’s cache instead of memory


Light weight Threads to hide memory access latency


Switch
-
on
-
event threads + small context & low switching overhead


Smart Caching


Cache structures and policies for networking



Partitioning


Optimized TCP/IP stack running on dedicated processor(s) or core(s)



Other Studies


Connection processing, bi
-
directional data


Application interference

HPCA
-
10

Q&A