Implementation of TCP/IP in Linux (kernel 2.2)

dingdongboomNetworking and Communications

Oct 27, 2013 (3 years and 9 months ago)

69 views

Implementation of TCP/IP in
Linux (kernel 2.2)

Rishi Sinha

Goals


Goals


To help you implement your customized
stack by identifying key points of the code
structure


To point out some tricks and optimizations
that evolved after 4.3BSD and that are part
of Linux TCP/IP code

TCP/IP source code


/usr/src/linux/net/


All relative pathnames in this document are
relative to /usr/src/linux/


http://lxr.linux.no

cross
-
references all
the Linux kernel code


You can install and run it locally; I haven’t
tried

The various layers (yawn…)

IP

TCP/UDP

INET socket

BSD socket

Appletalk

IPX

(Physical)

(Link)

Address families supported


include/linux/socket.h


UNIX


Unix domain sockets


INET


TCP/IP


AX25


Amateur radio


IPX


Novell IPX


APPLETALK

Appletalk


X25


X.25


More; about 24 in all

Setting things up


socket
-
side

How the INET address family
registers itself with BSD socket
layer

struct socket


BSD socket


short type


SOCK_DGRAM, SOCK_STREAM


struct proto_ops *ops


TCP/UDP operations
for this socket; bind, close, read, write etc.


struct inode *inode


the file inode associated
with this socket


struct sock *sk


the INET socket associated
with this socket

BSD socket

INET socket?

Operations to use?

(How to create socket?)

No connections

struct sock


INET socket


struct socket *socket


associated BSD socket


struct sock *next, **pprev


socks are in
linked lists


struct dst_entry *dst_cache


pointer to the
route cache entry used by this socket


struct sk_buff_head *receive_queue


head
of the receive queue


struct sk_buff_head *write_queue


head of
the send queue

struct sock continued


__u32 daddr


foreign IP address


__32 rcv_saddr


bound local IP
address


__u16 dport


destination port


unsigned short num


local port


struct proto *prot


contains TCP/UDP
specific operations (repetition with
struct socket’s ops field)

INET socket

Reaching transport layer?

BSD socket?

No connections

protocols vector


Array of struct net_proto, which has


name, say INET, UNIX, IPX, etc


initialization function, say inet_proto_init


This protocols array is static in
net/protocols.c


This file uses conditional compilation to
include protocols as chosen in make
config


inet_proto_init


protocols vector is traversed at system
init time, and each init function called


Each of these protocol init functions
registers itself with BSD sockets by
giving its name and socket create
function


Where does the BSD socket layer store
this information?

net_families


BSD socket layer stores info for each
registering protocol in this array


This is an array of struct
net_proto_family, which is


int family


int (*create)(struct socket *sock, int
protocol)

BSD socket layer now has

INET

inet_create()

IPX

ipx_create()

UNIX

unix_create()

So in socket() call


BSD socket layer looks for specified address
family, say INET


BSD socket layer calls create function for that
family, say inet_create()


inet_create() does switch (BSD_socket
-
>type)


case SOCK_DGRAM: fill BSD_socket
-
>proto_ops
with UDP operations


case SOCK_STREAM: fill BSD_socket
-
>proto_ops
with TCP operations

Socket layer is satisfied

BSD socket: AF_INET,
SOCK_STREAM

INET socket

TCP’s proto_ops

Write queue

Receive queue

Lots of other TCP data

Reaching sockets through file
descriptors


Per process file table > inode > BSD
socket etc.


Not describing here

Setting things up


device side

How network interfaces come up
and attach themselves to the
stack

No connections

Network interface card

What is my name (since I don’t have a /dev file)?

Give packets to whom?

struct device


No device file for network devices


Why? Design choice, probably because
network devices “push” data


Each interface is represented by a
struct device


All struct devices are chained and the
chain head is called dev_base

struct device continued


char *name


say eth0


unsigned long base_addr


I/O address


unsigned int irq


IRQ number


struct device *next


int (*init)(struct device *dev)


int (*hard_start_xmit)(struct sk_buff
*skb, struct device *dev)


transmission
function

dev_base


drivers/net/Space.c cleverly threads
struct devices for
all

possible interfaces
into a list starting at dev_base (static
data structure declaration, no code
execution yet)



List includes limited number of devices
of each type, i.e. eth0 to eth7
and no
more possible

ethif_probe()


For each of these 8 struct devices,
names are eth0 to eth7 and init funtion
is ethif_probe()


During system init time the list of struct
devices is traversed, and the init
function called for each


So ethif_probe() called for eth0; calls
probe_list()

probe_list()


probe_list() goes through a list of
all

ethernet
devices the system has drivers for


The probe function for each driver is called,
and


if success, assign proper function pointers from
the driver code to this struct device (ethx)


if failure, no more eth devices exist, remove this
struct device from the list and return

After all devices in Space.c
traversed through

lo0

eth0,
3Com
card

eth1, HP
card

functions
from 3com
driver

functions
from HP
driver

Give packets to whom?

dev_base

Modularized driver


Much simpler, because the driver’s
probe is executed at module load time


If it finds a device, it appends a struct
device to the end of the dev_base list

backlog queue


Very very distinct from socket listen
backlog queue!


Systemwide queue that interfaces
immediately drop packets onto


Device driver writers simply call
netif_rx(), which does the actual
queueing

Link layer is satisfied

lo0

eth0,
3Com
card

eth1, HP
card

functions
from 3com
driver

functions
from HP
driver

dev_base

backlog queue

Setting things up


between
link and network layers

How packets reach the correct
protocol stack

No connections

backlog queue

IP?

ARP?

IPX?

BOOTP?

Who takes packets off the backlog queue?

Who gets these packets?

net_bh()


Bottom
-
half handler for network interrupt
interrupt


Executes when network interrupt is not
masked


So the fast handler (actual ISR), is driver
code that calls netif_rx() to queue the packet
onto backlog queue, and marks net_bh() for
execution


net_bh() takes packets off backlog and
passes to the protocol specified in ethernet
header

ptype_base


ptype_base is the head of a list of possible
packet types the link layer may receive (IP,
ARP, IPX, BOOTP, etc.) that the system can
handle


How is it built?


For every protocol in the protocols vector,
when its init function is called
(inet_proto_init), it calls functions like
ip_init(), tcp_init() and arp_init()

dev_add_pack completes the
picture


Those subprotocols interested in registering a
packet type (IP, ARP), get their init functions
(ip_init(), arp_init()) to call dev_add_pack(),
specifying a handler function


This adds the packet type to ptype_base


So net_bh( ) hands off packets to the right
protocol stack

Setting things up


between
network and transport layers

How packets reach the correct
transport protocol

inet_protos


An array of transport layer protocols in
INET


Built at the time of inet_proto_init()


By calling inet_add_protocol() for every
transport protocol


Registers handlers for transport
protocols

Packet movement through
stack

Transmission and reception,
queues, interrupts

struct sk_buff


Each packet that arrives on the wire is
encased in a buffer called sk_buff


An sk_buff is just the data with a lot of
additional information about the packet


There is a one
-
to
-
one relationship between
packets and sk_buffs, i.e. one packet, one
buffer


sk_buffs can be allocated in multiples of 16
bytes

struct sk_buff continued


INET sock queues are queues of
sk_buffs


Data coming from the socket calls are
copied into sk_buffs


Data arriving from the network is
copied into sk_buffs


sk_buff picture with fields


struct sk_buff continued

Queues


backlog queue


INET sock queues


TCP has a number of queues for out
-
of
-
order, connection backlog, error packets
(?)

Packet reception


Packet received by hardware


Receive interrupt generated


Driver handler copies data from hardware
into fresh sk_buff


Calls netif_rx() to queue on backlog


Schedules net_bh() with mark_bh(NET_BH)


net_bh() executes the next time the
scheduler is run or a system call returns or a
slow interrupt handler returns

Packet reception continued


net_bh() tries to send any pending packets,
then dequeues packets from the backlog and
passes them to correct handler, say ip_rcv()


ip_rcv() may call ip_local_deliver() or
ip_forward()


ip_local_deliver() results in call to
tcp_v4_rcv() through the inet_protos list


tcp_v4_rcv() queues data at the correct
socket’s queue

Packet reception continued


When the socket’s owner reads,
tcp_recvmsg() is invoked through BSD
socket’s proto_ops


If instead the socket’s owner had
blocked on a read, that process will be
woken using wake_up (wait queue)

Packet transmission


Quite different for TCP and UDP in terms of
copying of user data to kernel space


TCP does its own checksumming, while IP
does checksumming for UDP. Why? Next
section.


net_bh() again takes care of flushing out
packets that have piled up at the device’s
queue

Tricks and optimizations

TCP/IP enhancements, most due
to Van Jacobson, arrived after
4.3BSD

Checksum and copy

Checksum and copy continued


Linux goes over every byte of data only
once (if the packet does not get
fragmented)


Uses checksum_and_copy()


TCP data from socket gets filled into
MSS
-
sized segments by TCP, so
checksum
-
copying happens here

Checksum and copy continued



INET Socket

(
struct sock
)

write_queue

User Buffer (
ubuff
)




sk_buff

structure




partially used
sk_buff


newly allocated

sk_buff

Checksum and copy continued


UDP, on the other hand, does not stuff
anything into MSS
-
sized buffers, so there is
no need to copy data from user space at UDP
layer


UDP passes data and a callback function to IP


IP copies this data into an sk_buff, using the
callback function, which is a
checksum_and_copy function


Large ping replies from a Linux host srrive in
reverse order of frgaments! Why?

This fragment
leaves first,
the partial
checksum for
its data
calculated and
remembered

This fragment
leaves second,
its checksum
added to the
partial
checksum

This fragment
leaves last, so
that final
checksum can
be written
into the UDP
header


UDP datagram

UDP
header

Why UDP fragmentation happens in reverse order

Fixed size buffer, sk_buff


mbufs were potentially very clumsy


“There is exactly one, contiguous,
packet per pbuf (none of that mbuf
chain stupidity).” Van Jacobson


Allocation of fixed size buffers at the
transport layer implies knowledge of
network and link layer header sizes


Linux is not shy of such indiscretions

Incremental checksum
updates


At every hop, TTL changes (is decremented)


But IP checksum covers the header, and
therefore the TTL also


So it needs to be calculated at every hop


Linux does this in one step


RFCs 1071, 1141, 1624 discusses both
copy_and_checksum and this incremental
checksum update

Cached hardware headers


Routes cache hardware headers for
quick construction of outgoing packets.