Achieving TCP/IP performance in embedded systems

hollowtabernacleNetworking and Communications

Oct 26, 2013 (4 years and 8 months ago)


Page 2

A buffer contains a header portion used by the protocol stack. This header provides information
regarding the contents of the buffer. The data portion contains data that has either been received by the
Network Interface Card (NIC) and thus will be processed by the stack, or data that is destined for
transmission by the NIC.
Figure 2  Network buffer
The maximum network buffer size is determined by the maximum size of the data that can be
transported by the networking technology used. Today, Ethernet is the ubiquitous networking
technology used for Local Area Networks (LANs).

Originally, Ethernet standards defined the maximum frame size as 1518 bytes. Removing the Ethernet,
IP and TCP encapsulation data, this leaves a maximum of 1460 bytes for the TCP segment. A segment
is the data structure used to encapsulate TCP data. Carrying an Ethernet frame in one of the TCP/IP
stack network buffers requires network buffers of approximately 1600 bytes each. The difference
between the Ethernet maximum frame size and the network buffer size is the space required for the
network buffer metadata.
It is possible to use smaller Network buffers. For example, if the application is not streaming multimedia
data but rather transferring small sensor data periodically, it is possible to use smaller network buffers
than the maximum allowed.
TCP segment size is negotiated between the two devices that are establishing a logical connection. It is
known as the Maximum Segment Size (MSS). An embedded system could take advantage of this
protocol capability. On an embedded target with 32K RAM, when you account for the all the middleware
RAM usage, there is not much left for network buffers!

Network operations
Many networking operations affect system performance. For example, network buffers are not released
as soon as their task is completed. Within the TCP acknowledgment process, a TCP segment is kept
until its reception is acknowledged by the receiving device. If it is not acknowledged within a certain
timeframe, the segment is retransmitted and kept again.

If a system has a limited number of network buffers, network congestion (packets being dropped) will
affect the usage of these buffers and the total system performance. When all the network buffers are
assigned to packets (being transmitted, retransmitted or acknowledging received packets), the TCP/IP
stack will slow down while it waits for available resources before resuming a specific function.

The advantage of defining smaller network buffers is that more buffers exist that allow TCP (and UDP)
to have more protocol exchanges between the two devices. This is ideal for applications where the
information exchanged can be in smaller packets such as a data logging device sending periodic sensor
A disadvantage is that each packet carries less data. For streaming applications, this is less than
desirable. HTTP, FTP and other such protocols will not perform well with this configuration model.

Ultimately, if there is insufficient RAM to define a few network buffers, the TCP/IP stack will crawl.

Page 3

TCP Performance
TCP has a flow control mechanism called Windowing that is used for Transmit and Receive. A field in
the TCP header is used for the Windowing mechanism so that:

This Window field indicates the quantity of information (in terms of bytes) that the recipient is
able to accept. This enables TCP to control the flow of data
Data receiving capacity is related to memory and to the hardwares processing capacity
(network buffers)
The maximum size of the window is 65,535 bytes (a 16-bit field)
A value of 0 (zero) halts the transmission
The source host sends a series of bytes to the destination host.

Figure 3  TCP Windowing
Within Figure 3:
Bytes 1 through 512 have been transmitted (and pushed to the application using the TCP PSH
flag) and have been acknowledged by the destination host
The window is 2,048 bytes long
Bytes 513 through 1,536 have been transmitted but have not been acknowledged
Bytes 1,537 through 2,560 can be transmitted immediately
Once an acknowledgement is received for bytes 513 through 1,536, the window will move
1,024 bytes to the right, and bytes 2,561 through 3,584 may then be sent.

On an embedded device, the window size should be configured in terms of the network buffers
available. For example:
With an embedded device that has 8 network buffers with an MSS of 1460, lets reserve 4 buffers for
transmission and 4 buffers for reception. Transmit and receive window sizes will be 4 times 1460 (4 *
1460 = 5840 bytes).On every packet receive, TCP decreases the Receive Window size by 1460 and
advertise the newly calculated Receive Window Size to the transmitting device. Once the stack has
processed the packet, the Receive Window Size will be increased by 1460, the network buffer will be
released and the Receive Window Size will be advertised with the next packet transmitted.

Typically, the network can transport packets faster than the embedded target can process them. If the
Receiving device has received 4 packets without being able to process them, the Receive Window Size
will be decreased to zero. A zero Receive Window Size advertised to the Transmitting device tells that
device to stop transmitting until the Receiving device is able to process and free at least one network
Page 4

buffer. On the transmit side, the stack will stop if network buffers are not available. Depending how the
stack is designed/configured, the transmitting function will retry, time-out or exit (Blocking/Non-blocking
UDP does not have such a mechanism. If there are insufficient network buffers to receive the
transmitted data, packets are dropped. The Application needs to handle these situations.

TCP connection bandwidth product
The number of TCP segments being received/transmitted by a host has an approximate upper bound
equal to the TCP window sizes (in packets) multiplied by the number of TCP connections:

Tot # TCP Pkts ~= Tot # TCP Conns * TCP Conn Win Sizes

This is known as the TCP connection bandwidth product.

The number of internal NIC packet buffers/channels limits the target host's overall packet bandwidth.
Coupled with the fact that most targets are slower consumers, data being received by the target by a
faster producer will consume most or all NIC packet buffers/channels & thereby drop some packets.
However, even when the throughput is exceptionally low; TCP connections should still be able to
transfer data via re-transmission.
Windowing with multiple sockets
The given Windowing example assumes that the embedded device has one socket (one logical
connection) with a foreign host. Imagine a system where multiple parallel connections are required. The
discussion above can be applied to each socket. With proper application code, the connection
throughput is a divisor of the total connection bandwidth. This means that the TCP/IP stack configured
Window size needs to take into consideration the maximum number of sockets running at any point in
Using the same example with 5 sockets and providing a Receive Window size of 5840 bytes to every
socket, 20 network buffers (4 buffers per Window * 5 sockets) will have to be configured. Assuming that
the largest network buffers possible (about 1600 bytes) are used, this means about 32K RAM of network
buffers (20 * 1600) is required; otherwise, the system will slow down due excessive retransmission
How does one find the Tx and Rx window sizes for a system? A reverse calculation is probably what
happens most of the time.
When 20 network buffers are reserved for reception and that the system needs a maximum of 5 sockets
at any point in time, then:
Rx Window Size = (Number of buffers * MSS) / Number of sockets

If the result is less than one MSS, more RAM for additional buffers is required.

Delayed Acknowledgement
Another important factor needs to be taken in to consideration with TCPthe network congestion state.
TCP keeps each network buffer transmitted until it is acknowledged by the receiving host. When packets
are dropped or never delivered because of a number of network problems, TCP retransmits the packets.
This means that unacknowledged buffers are set aside and used for this purpose.

TCP does not necessarily acknowledge every packet received, a situation called Delayed
Acknowledgement. Without delayed acknowledgement, half of the buffers used for transmission are
used for acknowledging every received packet. With delay acknowledgement, this number is reduced to
Page 5

Knowing the number of buffers that can be used for transmission, based on the quantity of RAM that
can used for network buffers and the maximum number of sockets in use at any point un time, the
Transmit Window Size can be calculated:
Without Delayed Acknowledgement:
Tx Window Size = (Number of buffers * MSS) / (Number of sockets * 2)

With Delayed Acknowledgement:
Rx Window Size = (Number of buffers * MSS) / Number of sockets

Note that a similar analysis can be done with a UDP application. Flow control and congestion control
instead of being implemented in the Transport Layer Protocol are moved to the Application Layer. For
example: TFTP (Trivial File Transfer Protocol). Acknowledgement and retransmission are part of any
data communications protocols. If it is not performed by the communication protocols, the application
must take care of it.
It is the developers decision to use UDP or TCP. If TCP is not required, it can be removed from the
stack (reducing the application code space), however the application will need to take care of the
network problems responsible for the non-delivery of packets.

DMA and CPU speed
As stated previously, most embedded targets are slow consumers. Packets generated by a faster
producer and received by the target will consume most or all NIC network buffers and some packets will
be dropped. Hardware features such as DMA and CPU speed can improve this situation. The latter is
trivial, the faster the target can receive and process the packets, the faster the network buffers can be
DMA support for the NIC is another means to improve packet processing. It is easy to understand that
when packets are transferred quickly to and from the stack, that network performance improves. DMA
also relieves the CPU from the transfer task, allowing the CPU to perform more of the protocol
When implementing a TCP/IP stack, the design intentions need to be clear. If the goal is to use the
Local Area Network without any consideration for performance, a TCP/IP stack or a subset of it can be
implemented with very few RAM (approximately 32K).
However, if the application requires the capabilities of the TCP protocol at a few Megabits per second, a
more complete TCP/IP stack is required. In this case, embedded system requirements dictate in the
range of 96K of RAM, resources need to be allocated to the protocol stack so that it can perform its