Modeling TCP/IP Software Implementation ... - Investigación

hollowtabernacleNetworking and Communications

Oct 26, 2013 (3 years and 7 months ago)

97 views


Thesis
Modeling TCP/IP Software Implementation Performance And Its
Application For Software Routers
By
Oscar Iván Lepe Aldama
M.Sc. Electronics and Telecommunications
CICESE Research Center, México, 1995
B.S. Computer Engineering
Universidad Nacional Autónoma de México, 1992
Submitted to the Department of Computer Architecture
in partial fulfillment of the requirements for the degree of
D
OCTOR OF
P
HILOSOPHY IN
C
OMPUTER
E
NGINEERING

At the
U
NIVERSITAT
P
OLITECNICA
D
E
C
ATALUNYA

Presented and defended on December 3, 2002
Advisor:
Prof. Dr. Jorge García-Vidal
Jury:
Prof. Dr. Ramón Puigjaner-Trepat Presidente
Profa. Dra. Olga Casals-Torres Examinador
Prof. Dr. José Duato-Marín Examinador
Prof. Dr. Joan García-Haro Examinador
Prof. Dr. Sebastian Sallent-Ribes Examinador
 2002 Oscar I. Lepe A.
All rights reserved


Dr.
President


Dr.
Secretari


Dr.
Vocal


Dr.
Vocal


Dr.
Vocal


Data de la defensa pública

Qualificació


M
ODELING
TCP/IP S
OFTWARE
I
MPLEMENTATION
P
ERFORMANCE
A
ND
I
TS
A
PPLICATION
F
OR
S
OFTWARE
R
OUTERS
5
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

Contents
List of figures
List of tables
Preface
Chapter 1 Introduction.............................................................................17

1.1 Motivation.........................................................................................................17
1.2 Scope.................................................................................................................18
1.3 Dissertation’s thesis..........................................................................................19
1.4 Synopsis............................................................................................................19
1.5 Outline...............................................................................................................20
1.6 Related work.....................................................................................................20
Chapter 2 Internet protocols’ BSD software implementation..............23

2.1 Introduction.......................................................................................................23
2.2 Interprocess communication in the BSD operating system..............................24
2.2.1 BSD’s interprocess communication model..............................................24
2.2.2 Typical use of
sockets
..............................................................................25
2.3 BSD’s networking architecture.........................................................................26
2.3.1 Memory management plane.....................................................................26
2.3.2 User plane................................................................................................27
6 C
ONTENTS

O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

2.4 The software interrupt mechanism and networking processing.......................28
2.4.1 Message reception....................................................................................28
2.4.2 Message transmission..............................................................................29
2.4.3 Interrupt priority levels............................................................................30
2.5 BSD implementation of the Internet protocols suite........................................33
2.6 Run-time environment: the host’s hardware.....................................................34
2.6.1 The central processing unit and the memory hierarchy...........................34
2.6.2 The busses organization...........................................................................35
2.6.3 The input/output bus’ arbitration scheme................................................36
2.6.4 PCI hidden bus arbitration’s influence on latency...................................37
2.6.5 Network interface card’s system interface...............................................37
2.6.6 Main memory allocation for direct memory access network interface
cards.........................................................................................................38
2.7 Other system’s networking architectures..........................................................40
2.7.1 Active Messages [Eicken et al. 1992]......................................................40
2.7.2 Integrated Layer Processing [Abbot and Peterson 1993]........................40
2.7.3 Application Device Channels [Druschel 1996].......................................42
2.7.4 Microkernel operating systems’ extensions for improved networking
[Coulson et al. 1994; Coulson and Blair 1995].......................................43
2.7.5 Communications oriented operating systems [Mosberger and Peterson
1996]........................................................................................................44
2.7.6 Network processors..................................................................................45
2.8 Summary...........................................................................................................46
Chapter 3 Characterizing and modeling a personal computer-
based software router...................................................................47

3.1 Introduction.......................................................................................................47
3.2 System modeling...............................................................................................48
3.3 Personal computer-based software routers.......................................................49
3.3.1 Routers’ rudiments...................................................................................49
3.3.2 The case for software routers...................................................................49
3.3.3 Personal computer-based software routers..............................................50
3.3.4 Personal computer based-IPsec security gateways..................................51
3.4 A queuing network model of a personal computer-based software IP router..51
3.4.1 The forwarding engine, the network interface cards and the packet
flows........................................................................................................53
3.4.2 The service stations’ scheduling politics and the mapping between
networking stages and model elements...................................................53
3.4.3 Modeling a security gateway...................................................................55
3.5 System characterization....................................................................................56
3.5.1 Tools and techniques for profiling in-kernel software.............................56
C
ONTENTS
7
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

3.5.2 Software profiling....................................................................................56
3.5.3 Probe implementation..............................................................................57
3.5.4 Extracting information from the kernel...................................................60
3.5.5 Experimental setup...................................................................................60
3.5.6 Traffic patterns.........................................................................................61
3.5.7 Experimental design.................................................................................62
3.5.8 Data presentation.....................................................................................63
3.5.9 Data analysis............................................................................................63
3.5.10 “Noise” process characterization...........................................................65
3.6 Model validation...............................................................................................66
3.6.1 Service time correlations..........................................................................67
3.6.2 Qualitative validation...............................................................................68
3.6.3 Quantitative validation.............................................................................69
3.7 Model parameterization....................................................................................72
3.7.1 Central processing unit speed..................................................................72
3.7.2 Memory technology.................................................................................74
3.7.3 Packet’s size.............................................................................................75
3.7.4 Routing table’s size..................................................................................76
3.7.5 Input/output bus’s speed..........................................................................76
3.8 Model’s applications.........................................................................................77
3.8.1 Capacity planning....................................................................................77
3.8.2 Uniform experimental test-bed................................................................79
3.9 Summary...........................................................................................................89
Chapter 4 Input/output bus usage control in personal computer-
based software routers..................................................................90

4.1 Introduction.......................................................................................................90
4.2 The problem......................................................................................................91
4.3 Our solution......................................................................................................91
4.3.1 BUG’s specifications and network interface card’s requirements...........93
4.3.2 Low overhead and intrusion.....................................................................93
4.3.3 Algorithm.................................................................................................94
4.3.4 Algorithm’s details...................................................................................95
4.3.5 Algorithm’s a priori estimated costs........................................................99
4.3.6 An example scenario................................................................................99
4.4 BUG performance study.................................................................................101
4.4.1 Experimental setup.................................................................................102
4.4.2 Response to unbalanced constant packet rate traffic.............................104
4.4.3 Study on the influence of the activation period.....................................105
4.4.4 Response to on-off traffic......................................................................106
8 C
ONTENTS

O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

4.4.5 Response to self-similar traffic..............................................................107
4.5 A performance study of a software router incorporating the BUG................109
4.5.1 Results....................................................................................................109
4.6 An implementation.........................................................................................111
4.7 Summary.........................................................................................................111
Chapter 5 Conclusions and future work...............................................113

5.1 Conclusions.....................................................................................................113
5.2 Future work.....................................................................................................114

Appendix A
Appendix B
Bibliography

M
ODELING
TCP/IP S
OFTWARE
I
MPLEMENTATION
P
ERFORMANCE
A
ND
I
TS
A
PPLICATION
F
OR
S
OFTWARE
R
OUTERS
9
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

List of figures
Figure 2.1—OMT object model for BSD IPC...............................................................25
Figure 2.2—BSD’s two-plane networking architecture. The user plane is depicted
with its layered structure, which is described in following sections. Bold
circles in the figure represent defined interfaces between planes and layers:
A) Socket-to-Protocol, B) Protocol-to-Protocol, C) Protocol-to-Network
Interface, and D) User Layer-to-Memory Management. Observe that this
architecture implies that layers share the responsibility of taking care of the
storage associated with transmitted data...............................................................26
Figure 2.3—BSD networking user plane’s software organization.................................27
Figure 2.4—Example of priority levels and kernel processing......................................31
Figure 2.5—BSD implementation of the Internet protocol suite. Only chief tasks,
message queues and associations are shown. Please note that some control
flow arrows are sourced at the bounds of the squares delimiting the
implementation layers. This is for denoting that a task is executed after an
external event, such as an interrupt or a CPU scheduler event.............................33
Figure 2.6—Chief components in a general purpose computing hardware...................34
Figure 2.7—Example PCI arbitration algorithm............................................................36
Figure 2.8—Main memory allocation for device drivers of network interface cards
supporting direct memory access..........................................................................39
Figure 2.9—Integrated layering processing [Abbot and Peterson 1993].......................41
Figure 2.10—Application Device Channels [Druschel 1996].......................................42
10 L
IST OF FIGURES

O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

Figure 2.11—SUMO extensions to a microkernel operating system [Coulson et al.
1994; Coulson and Blair 1995].............................................................................43
Figure 2.12—Making paths explicit in the Scout operating system [Mosberger and
Peterson 1996]......................................................................................................44
Figure 3.1—A spectrum of performance modeling techniques.....................................48
Figure 3.2—McKeown’s [2001] router architecture......................................................49
Figure 3.3—A queuing network model of a personal computer-based software
router that has two network interface cards and that is traversed by a single
packet flow. The number and meaning of the shown queues is a result of the
characterization process presented in the next section.........................................52
Figure 3.4—A queuing network model of a software router that shows a one-to-
one mapping between C language functions implementing the BSD
networking code and the depicted model’s queues. In order to simplify the
figure, this model does not model the software router’s input/output bus nor
the noise process. Moreover, it simplistically models the network interface
cards......................................................................................................................54
Figure 3.5—A queuing network model of a software router configured as a
security gateway. The number and meaning of the shown queues is a result
of the characterization process presented in next section.....................................55
Figure 3.6—Probe implementation for FreeBSD..........................................................59
Figure 3.7—Experimental setup....................................................................................60
Figure 3.8—Characterization charts for a security gateway’s protocols layer..............64
Figure 3.9—Characterization charts for a software router’s network interfaces
layer......................................................................................................................64
Figure 3.10—Comparison of the CCPFs computed after both, measured data from
the software router under test, and predicted data from a corresponding
queuing network model, which used a one-to-one mapping between C
language networking functions and model’s queue..............................................67
Figure 3.11—Example chart from the service time correlation analysis. It shows
the plot of
ip_input
cycle counts versus
ip_output
cycle counts. A
correlation is clearly shown..................................................................................67
Figure 3.12—Model’s validation charts. The two leftmost columns’ charts depict
per-packet latency traces. The right column’s chart depicts latency traces’
CCPFs...................................................................................................................70
L
IST OF FIGURES
11
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

Figure 3.13—Relationship between measured execution times and central
processing unit operation speed. Observe that some measures have
proportional behavior while others have linear behavior. The main text
explains the reasons to these behaviors and why the circled measures do not
all agree with the regression lines.........................................................................72
Figure 3.14—Outliers related to the CPU’s instruction cache. The left chart was
drawn after data taken from the
ipintrq
probe. The right chart
corresponds to the ESP (DES) probe at a security gateway. Referenced
outliers are highlighted.........................................................................................74
Figure 3.15—Relationship between measured execution times and message size........75
Figure 3.16—Capacity planning charts..........................................................................78
Figure 3.17—Queuing network model for a BSD based software router with two
network interface cards attached to it and three packet flows traversing it..........81
Figure 3.18—Basic system’s performance analysis: a) system’s overall
throughput; b) per flow throughput share for system one; c) per flow
throughput share for system two...........................................................................82
Figure 3.19—Queuing network model for a Mogul and Ramakrishnan [1997]
based software router with two network interface cards attached to it and
three packet flows traversing it.............................................................................84
Figure 3.20—Mogul and Ramakrishnan [1997] based software router’s
performance analysis: a) system’s overall throughput; b) per flow
throughput share for system one; c) per flow throughput share for system
two........................................................................................................................85
Figure 3.21—Queuing network model for a software router including the receiver
live-lock avoidance mechanism and a QoS aware CPU scheduler, similar to
the one proposed by Qie et al [2001]. The software router has two network
interface cards and three packet flows traverse it.................................................87
Figure 3.22—Qie et al [2001] based software router’s performance analysis: a)
per flow throughput share for system one; b) per flow throughput share for
system two............................................................................................................88
Figure 4.1—The BUG’s system architecture. The BUG is a piece of software
embedded in the operating system’s kernel that shares information with the
network interface card’s device drivers and manipulates the vacancy of each
DMA receive channel...........................................................................................92
Figure 4.2—The BUG’s periodic and bistate operation for reducing overhead and
intrusion................................................................................................................94
Figure 4.3—The BUG’s packet-by-packet GPS server emulation with batch
arrivals..................................................................................................................95
12 L
IST OF FIGURES

O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

Figure 4.4—The BUG is work conservative..................................................................96
Figure 4.5—The BUG’s unfairness counterbalancing mechanism................................97
Figure 4.6—The BUG’s bus utilization grant packetization policy. In the
considered scenario, three packets flows with different packet sizes traverse
the router and the BUG has granted each an equal number of bus utilization
bytes. Packet sizes are small, medium and large respectively for the orange,
green and blue packets flows. After packetization, some idle time gets
induced..................................................................................................................98
Figure 4.7—An example of the behavior of the BUG mechanism. Vectors A, D,
N, G and g are defined as: A = (A
1
, A
2
, A
3
), etc. It is assumed that the
system serves three packets flows with the same shares and with the same
packet lengths, and that in a period T up to six packets can be transferred
through the bus. The BUG does not consider all variables at all times. At
every activation instant, the variables that the BUG ignores are printed in
gray.......................................................................................................................100
Figure 4.8—Queuing network models for: a) PCI bus, b) WFQ bus, and c) BUG
protected PCI bus; all with three network interface cards attached to it and
three packets flows traversing them......................................................................103
Figure 4.9—BUG performance study: response comparison to unbalanced
constant packet rate traffic between a WFQ bus, a PCI bus and a BUG
protected PCI bus; first, middle and left columns respectively. At row (a)
flow3 is the misbehaving flow while flow2 and flow1 are for (b) and (c),
respectively...........................................................................................................104
Figure 4.10—BUG performance study: on the influence of the activation period........106
Figure 4.11— BUG performance study: response comparison to on-off traffic
between an ideal WFQ bus, a PCI bus, and a BUG protected PCI bus................107
Figure 4.12—QoS aware system’s performance analysis: a) system’s overall
throughput; b) per packets flow throughput share for system one; c) per
packets flow throughput share for system two.....................................................110

M
ODELING
TCP/IP S
OFTWARE
I
MPLEMENTATION
P
ERFORMANCE
A
ND
I
TS
A
PPLICATION
F
OR
S
OFTWARE
R
OUTERS
13
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

List of tables
Table 3-I.........................................................................................................................76
Table 3-II........................................................................................................................76
Table 3-III......................................................................................................................79
Table 4-I.........................................................................................................................93
Table 4-II........................................................................................................................99
Table 4-III......................................................................................................................102
Table 4-IV......................................................................................................................108

M
ODELING
TCP/IP S
OFTWARE
I
MPLEMENTATION
P
ERFORMANCE
A
ND
I
TS
A
PPLICATION
F
OR
S
OFTWARE
R
OUTERS
15
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

Preface
Dedico esta obra a mi esposa, Tania Patricia, y a nuestros hijos,
Pedro Darío y Sebastián.
Al mismo tiempo, quiero agradecer a Tania y a Pedro su apoyo y paciencia
durante el tiempo que estuvimos fuera de casa, viviendo no en las mejores condiciones.
Sólo espero que esta precariedad haya sido mitigada por lo enriquecedor de las
experiencias que hemos vivido, que siento nos harán mejores personas al habernos
recalcado la importancia de la unidad familiar, y el valor de nuestras costumbres,
nuestra gente y nuestra tierra.
Agradezco a mi director de tesis, Jorge García-Vidal, por su invaluable tutela y
dedicación. Difícilmente olvidaré las incontables y largas reuniones de discusión que le
dieron vida a esta obra, ni las numerosas noches en vela que pasamos ultimando los
detalles de los artículos que enviamos a revisión. Pero también difícilmente olvidaré las
tardes de café o las tardes en el parque con nuestros hijos, donde hablábamos de todo
menos del trabajo.
Agradezco al pueblo de México, que con su esfuerzo permite la existencia de
programas de becas para que mexicanos hagan estudios de postgrado en el extranjero,
como el programa gestionado por el Consejo Nacional de Ciencia y Tecnología, o el
gestionado por el Centro de Investigación Científica y de Educación Superior de
Ensenada, que me dieron cobijo.
Y en general agradezco a todas las personas que de forma directa o indirecta me
ayudaron a la consecución de esta obra, por ejemplo a los profesores del DAC/UPC y a
los revisores anónimos en los congresos. En particular quiero mencionar, en orden al-
fabético, a Alejandro Limón Padilla, David Hilario Covarrubias Rosales, Delia White,
Eva Angelina Robles Sánchez, Francisco Javier Mendieta Jiménez, Jorge Enrique Pre-
ciado Velasco, José María Barceló Ordinas, Llorenç Cerdà Alabern, Olga María
Casals Torres, Oscar Alberto Lepe Peralta, Victoriano Pagoaga. A todos ellos muchas
gracias.


M
ODELING
TCP/IP S
OFTWARE
I
MPLEMENTATION
P
ERFORMANCE
A
ND
I
TS
A
PPLICATION
F
OR
S
OFTWARE
R
OUTERS
17
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

Chapter 1


Introduction
1.1 Motivation
Evidently, computer networks and the computer applications than run over them
are fundamental to today’s human society. Indeed, some kind of these telematic systems
is fundamental for the proper operation of the New York Stock Exchange, for instance,
but also some other kind is fundamental for providing telephony service to little villages
in hard-to-reach places, like the numerous villages at the mountains of Chiapas, México.
As it is well known, telematic technology’s application for better human society is only
limited by humans’ imagination.
Today’s human’s necessities for information processing and transportation present
complex problems. For solving these problems, scientists and engineers are required to
produced ideas and products with an ever-increasing number of components and inter-
components relationships. In order to cope with this complexity, developers invented
the concept of layered architectures, which allow a structured “divide to conquer”
approach for designing complex systems by providing a step-by-step enhancement of
system services.
Unfortunately, layered structures can result in relatively low performance
telematic systems—and often they are—if implemented carelessly [Clark 1982;
Tennenhouse 1989; Abbott and Peterson 1993; Mogul and Ramakrishnan 1999]. In
order to understand this, let us consider the following:
18 S
COPE
—1.2
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

• Telematic systems are mostly implemented with software.

Each software layer is designed as an independent entity that concur-
rently and asynchronously communicates with its neighbors through a
message-passing interface. This allows for better interoperability, man-
ageability and extensibility.
• In order to allow software layers to work concurrently and
asynchronously, the host computer system has to provide a secure and
reliable multiprogramming environment through an operating system.

Ideally, the operating system should perform its role without consuming
a significant share of processing resources. Unfortunately, as reported
elsewhere [Druschel 1996], current operating systems are threatening to
become bottlenecks when processing input/output data streams.
Moreover, they are the source of statistical delays—incurred as each
data unit is marshaled through the layered software—that hamper the
correct deployment of important services.
Others have recognized this problem and have conducted studies analyzing some
aspects of some operating systems’ computer networks software, networking software
for short. For us it is striking that although these studies are numerous, (a search through
the ACM’s Digital Library on the term “(c.2 and d.4 and c.5)<IN>ccs” returns 54
references, and a search through IEEE’s Xplore on the term “‘protocol
processing’<IN>de” returns 81 references) only one of them pursued to build a general
model of the networking software—see this chapter’s section on “related work”. Indeed,
although different systems’ networking software has more similarities than differences,
as we will later discuss, most of these studies have only focused in identifying and
solving particular problems of particular systems. When saying this we do not deny the
importance of this work, however, we believe that modeling is an important part of
doing research.
The Internet protocol suite (TCP/IP) [Stevens 1994] is, nowadays, the preferred
technology for networking. Of all possible implementations, the one done at the
University of California at Berkeley for the Berkeley Software Distribution operating
system, or BSD, has been used as the starting point for most available systems. The
BSD operating system [McKusick et al. 1996], which is a flavor of the UNIX system
[Ritchie and Thompson 1978], was selected as the foundation for implementing the first
TCP/IP suite back in the 80's.
1.2 Scope
Naturally, for modeling a system, high degrees of observability and controllability
are required. For us this means free access to both networking software’s source code
and host computer’s technical specifications. (When we speak of free, we are referring
to freedom, not price.) Today’s personal computer (PC) technology provides this.
Indeed, not only there is tons of freely available detailed technical documentation on
Intel’s IA32 PC hardware but also there are several PC operating systems with open
source policy. Of those we decide to work with FreeBSD, a 4.4BSD-Lite derived
operating system optimized to be run on Intel’s IA32 PCs.
1.3—D
ISSERTATION

S THESIS
19
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

When searching for a networking application for which the appliance of a per-
formance model could be of importance we found software routers. A software router
can be defined as a general-purpose computer that executes a computer program capable
of forwarding IP datagrams among network interface cards attached to its input/output
bus (I/O bus). Evidently, software routers have performance limitations because they
use a single central processing unit (CPU) and a single shared bus to process all packets.
However, due to the ease with which they can be programmed for supporting new
functionality—securing communications, shaping traffic, supporting mobility,
translating network addresses, supporting applications’ proxies, and performing n-level
routing—software routers are important at the edge of the Internet.
1.3 Dissertation’s thesis
From all the above we came up with the following dissertation thesis:
Is it possible to build a queuing network model of the Internet
protocols’ BSD implementation that can be used for predicting
with reasonable accuracy not only the mean values of the
operational parameters studied but also their cumulative
probability function? And, can this model be applied for
studying the performance of PC-based software routers
supporting communication quality assurance mechanisms, or
Quality-of-Service (QoS) mechanisms?
1.4 Synopsis
Three are the main contributions of this work. In no particular order:
• A detailed performance study of the software implementation of the
TCP/IP protocols suite, when executed as part of the kernel of a BSD
operating system over generic PC hardware
• A validated queuing network model of the studied system, solved by
computer simulation
• An I/O bus utilization guard mechanism for improving the performance
of software routers supporting QoS mechanisms and built upon PC
hardware and software
This document presents our experiences building a performance model of a PC-
based software router. The resulting model is an open multiclass priority network of
queues that we solved by simulation. While the model is not particularly novel from the
system modeling point of view, in our opinion, it is an interesting result to show that
such a model can estimate, with high accuracy, not just average performance-numbers
but the complete probability distribution function of packet latency, allowing perform-
ance analysis at several levels of detail. The validity and accuracy of the multiclass
model has been established by contrasting its packet latency predictions in both, time
and probability spaces. Moreover, we introduced into the validation analysis the predic-
tions of a router’s single queue model. We did this for quantitatively assessing the ad-
20 O
UTLINE
—1.5
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

vantages of the more complex multiclass model with respect to the simpler and widely
used but not so accurate, as here shown, single queue model, under the considered sce-
nario that the router’s CPU is the system bottleneck and not the communications links.
The single queue model was also solved by simulation.
Besides, this document addresses the problem of resource sharing in PC-based
software routers supporting QoS mechanisms. Others have put forward solutions that
are focused on suitably distributing the workload of a software router's CPU—see this
chapter’s section on “related work”. However, the evident increasing gap in operation
speed between a PC-based software router's CPU and I/O bus means to us that attention
must be paid to the effect the limitations imposed by this bus has on system’s overall
performance. Consequently, we propose a mechanism that jointly controls both I/O bus
and CPU operation for improved PC-based software router performance. This
mechanism involves changes to the operating system kernel code and assumes the
existence of certain network interface card’s functions, although it does not require
changes to the PC hardware. A performance study is shown that provides insight into
the problem and helps to evaluate both the effectiveness of our approach, and several
software router design trade-offs.
1.5 Outline
The rest of this chapter is devoted to discuss about related work. Chapter 2’s
objective is to understand the influence that operating system design and
implementation technique have over the performance of the Internet protocol’s BSD
software implementation. Chapter 3 presents our experiences building, validating and
conducting the parameterization of a performance model of a PC-based software router.
Moreover, it presents some results from applying the model for capacity planning.
Chapter 4 addresses the problem of resource sharing in PC-based software routers
supporting communication quality assurance mechanisms. Furthermore, it presents our
mechanism for jointly controlling the router’s CPU and I/O bus, indispensable for a
software router to support communication quality assurance mechanisms.
1.6 Related work
Cabrera et al. [1988] (in reality, they presented their study’s results on July, 1985)
is the earliest work we found across the publicly accessible literature on TCP/IP’s im-
plementation experimental evaluation. Their study was an ambitious one, which objec-
tive was to assess the impact that different processors, network hardware interfaces, and
Ethernets have on the communication across machines, under various hosts’ and com-
munication medias’ load conditions. Their measurements highlighted the ultimate
bounds on communication performance perceived by application programs. Moreover,
they presented a detailed timing analysis of the dynamic behavior of the networking
software. They studied TCP/IP’s implementation within 4.2BSD when ran by then state
of the art minicomputers, attached to legacy Ethernets. Consequently, their study’s re-
sults are no longer valid. Worst yet, they used the
gprof(1)
and
kgmon(8)
tools for pro-
filing. These tools are only supported by software and, consequently, produce results
that have limited accuracy when compare to results produce by performance monitoring
1.6—R
ELATED WORK
21
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

hardware counters, as we do. However complete their study was, they did not pursue to
build a system model, like we do.
Sanghi et al. [1990] is the earliest work we found across the publicly accessible
literature on TCP/IP’s implementation experimental evaluation that uses software pro-
filing. Their study is very narrow when compare to ours in the sense that their study’s
objective was only to determine the suitability of roundtrip time estimators for TCP im-
plementations.
Papadopoulos and Gurudatta [1993] presented results of a more general study on
TCP/IP’s implementation performance, obtained using software profiling. They studied
TCP/IP’s implementation within a 4.3BSD-derived operating system—SUN OS. Like
our work, their study’s objective was to characterize the performance of the networking
software. Conversely, their study was not aimed to produce a system model.
Kay and Pasquale [1996] (in reality, they presented their study’s results on Sep-
tember, 1994) conducted another TCP/IP’s implementation performance study. Differ-
ently from previous work, their study was carried out at a different granularity level;
that is, they went inside the networking functions and analyzed how these functions
used the source code—touching data, protocol-specific processing, memory buffers ma-
nipulation, error checking, data structure manipulation, and operating system overhead.
Moreover, their study’s objective was to guide a search for bottlenecks in archiving high
throughput and low latency. Once again, they did not pursue building a system model.
Ramamurthy [1988] modeled the UNIX system using a queuing network model.
However, his model characterizes the system’s multitasking properties and therefore
cannot be applied to study the networking software, which is governed by the software
interrupt mechanism. Moreover, Ramamurthy’s model was only solved for predicting
mean values, something that is not enough when conducting performance analyses of
today’s networking software. What is required is a model capable of producing the
complete probability function of operational parameters so analysis at several levels of
detail may be performed.
Björkman and Gunningberg [1998] modeled an Internet protocols’ implementa-
tion using a queuing network model, however, their model characterizes a user-space
implementation (base on a parallelized version of University of Arizona’s x-kernel
[Hutchinson and Peterson 1991]) and therefore disregards the software interrupt mecha-
nism. Moreover, their model was aimed to predict only system throughput (measured in
packets per second) when the protocols are hosted by shared-memory multiprocessor
systems. Besides, their study was aimed only for high-performance distributed comput-
ing, where it is considered that connections are always open with a steady stream of
messages—that is, no retransmissions or other unusual events occur. All this impedes
the utilization of their model for our intended applications.
Packet service disciplines and their associated performance issues have been
widely studied in the context of bandwidth scheduling in packet-switched networks
[Zhang 1995]. Arguably, several such disciplines now exist that are both fair—in terms
of assuring access to link bandwidth in the presence of competing packets flows—and
easy to implement efficiently—with both hardware and software. Recently, an interest
has arisen to map these packet service disciplines in the context of CPU scheduling for
22 R
ELATED WORK
—1.6
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

programmable and software routers. Qie et al. [2001], and Chen and Morris [2001], for
a software router, and Pappu and Wolf [2001], for a programmable router, have put
forward solutions that are focused on suitably distributing the workload of a router’s
processor. However, neither the formers nor the latters consider the problem of I/O bus
bandwidth scheduling. And this problem is important in the context of software routers,
as we demonstrate.
Chiueh and Pradhan [2000] recognized the suitability and inherent limitations of
using PC technology for implementing software routers. In order to overcome the per-
formance limitations of PCs’ I/O buses, and to construct high-speed routers, they have
proposed using clusters of PCs. Upon this architecture, several PCs having at most one
network interface card each are tightly couple by means of a Myrinet system area net-
work to conform a software router with as many subnetwork attachments as computing
nodes. After solving all internodes communication, memory coherency, and routing ta-
ble distribution problems—arguably not an easy task—a “clustered router” may be able
to overcome the limitations of current PCs’ I/O buses and not only provide high per-
formance (in term of packets per second) but also support QoS mechanisms. Our work,
however, is aimed at supporting QoS mechanisms in clearly simpler and less expensive
PC-based software routers.
Scottis, Krunz and Liu [1999] recognized the performance limitations of the om-
nipresent Peripheral Component Interconnect (PCI) PC I/O bus to support QoS mecha-
nisms, and proposed an enhancement. Conversely to our software enhancement, theirs
introduced a new bus arbitration protocol that has to be implemented with hardware.
Moreover, their bus arbitration protocol is aimed to improve bus support for periodic
and predictable real-time streams only, and clearly this is not suitable for Internet
routers.
There is general agreement in the PC industry that the demands of current user
applications are quickly overwhelming the shared parallel architecture and limited
bandwidth of the various types of PCI buses. (Erlanger, L. Editor “High-performance
busses and interconnects,” http://www.pcmag.com/article2/0,4149,293663,00.asp cur-
rent as 8 July 2002.) With this increasing demand and its lack of QoS, PCI has been due
for a replacement in different system and application scenarios for more than a few
years now. 3GIO, InfiniBand, RapidIO and HyperTransport are new technologies de-
signed to improve I/O and device-to-device performance in a variety of system and ap-
plication categories, however, not all are direct replacements for PCI. In this sense,
3GIO and InfiniBand may be considered as direct candidates to succeed PCI. All these
technologies have in common that they define packet-based, point-to-point serial con-
nections with layered communications, which readily support QoS mechanisms. How-
ever, due to the large installation base of PCI equipment, it is expected that the PCI bus
will be used for still some years. Consequently, we think our work is important.


M
ODELING
TCP/IP S
OFTWARE
I
MPLEMENTATION
P
ERFORMANCE
A
ND
I
TS
A
PPLICATION
F
OR
S
OFTWARE
R
OUTERS
23
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

Chapter 2


Internet protocols’ BSD software
implementation
2.1 Introduction
This chapter’s objective is to understand the influence that operating system de-
sign and implementation techniques have over the performance of the Internet proto-
cols’ BSD software implementation. Later chapters discuss how to apply this knowl-
edge for building a performance model of a personal computer-based software router.
The chapter is organized as follows. The first three sections set the conceptual
framework for the document but may be skipped by a reader familiar with BSD’s net-
working subsystem. Section 2.2 presents a brief overview on BSD’s interprocess com-
munication facility. Section 2.3 presents BSD’s networking architecture. And BSD’s
software interrupt mechanism is presented in section 2.4. Following sections present the
chief features, components and structures of the Internet protocol’s BSD software im-
plementation. These sections present several ideas and diagrams that are referenced in
this document’s later sections and should not be skipped. Section 0 presents the soft-
ware implementation while section 2.6 the run-time environment. Finally, section 2.7
presents brief descriptions of other system’s networking architectures and section 2.8
summarizes.
24 I
NTERPROCESS COMMUNICATION IN THE
BSD
OPERATING SYSTEM
—2.2
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

2.2 Interprocess communication in the BSD
operating system
The BSD operating system [McKusick et al. 1996] is a flavor of UNIX [Ritchie
and Thompson 1978] and historically, UNIX systems were weak in the area of inter-
process communication [Wright and Stevens 1995]. Before the 4.2 release of the BSD
operating system, the only standard interprocess communication facility was the pipe.
The requirements of the Internet [Stevens 1994] research community resulted in a sig-
nificant effort to address the lack of a comprehensive set of interprocess communication
facilities in UNIX. (At the time 4.2BSD was being designed there was no global Inter-
net but an experimental computer network sponsored by the United States of America’s
Defense Advanced Research Projects Agency.)
2.2.1 BSD’s interprocess communication model
4.2BSD’s interprocess communication facility was designed to provide a suffi-
ciently general interface upon which distributed-computing applications—sharing of
physically distributed computing resources, distributed parallel computing, computer
supported telecommunications—could be constructed independently of the underlying
communication protocols. This facility has outlasted and is present in the current 4.4 re-
lease. For now on, when referring to the BSD operating system, we mean the 4.2 release
or any follow-on release like the current 4.4. While designing the interprocess-
communication facilities that would support these goals, the developers identified the
following requirements and developed unifying concepts for each:
• The system must support communication networks that use different
sets of protocols, different naming conventions, different hardware, and
so on. The notion of a communication domain was defined for these
reasons.
• A unified abstraction for an endpoint of communication is needed that
can be manipulated with a file descriptor. The socket is the abstract ob-
ject from which messages are sent and received.
• The semantic aspects of communication must be made available to ap-
plications in a controlled and uniform way. So, all sockets are typed ac-
cording to their communication semantics.

Processes must be able to locate endpoints of communication so that
they can rendezvous without being related. Hence, sockets can be
named.
Figure 2.1 depicts the OMT Object Model [Rumbaugh et al. 1991] for these re-
quirements.
2.2—I
NTERPROCESS COMMUNICATION IN THE
BSD
OPERATING SYSTEM
25
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

2.2.2 Typical use of sockets
First, a socket must be created with the
socket
system call, which returns a file
descriptor that is then used in later socket operations:
s = socket(domain, type, protocol);
int s, domain, type, protocol;

After a socket has been created, the next step depends on the type of socket being
used. The most commonly used type of socket requires a connection before it can be
used. The creation of a connection between two sockets usually requires that each socket
have an address (name) bound to it. The address to be bound to a socket must be formu-
lated in a
socket address structure
.
error = bind(s, addr, addrlen);
int error, s, addrlen;
struct sockaddr *addr;

A connection is initiated with a
connect
system call:
error = connect(s, peeraddr, peeraddrlen);
int error, s, peeraddrlen;
struct sokaddr *peeraddr;

When a socket is to be used to wait for connection-requests to arrive, the system
call pair
listen
/
accept
is used instead:
error = listen(s, backlog);
int error, s, backlog;

Connections are then received, one at a time, with:
snew = accept(s, peeraddr, peeraddrlen);
int snew, s, peeraddrlen;
struct sockaddr *peeraddr;
Figure 2.1—OMT object model for BSD IPC
Protocol
Communication
Domain
obeys a
communicates with
runs in
links
belongs
*
Communication
Channel
named services
named hosts
named networks
named
protocols
is bound to
Socket Name
Protocol Name
Service Name
*
Process
Client
Server
Service
*
Network
*
Host
*2
Host Name
Network Name
Stream Socket
Datagram Socket
Raw Socket
Socket
Name Scheme
1..*
1..* 1..*
1..* 1..*
1..*
1..*
1..*
1..*
1..*
1..*
1..*1..*
1..*

26 BSD’
S NETWORKING ARCHITECTURE
—2.3
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

A variety of calls are available for transmitting and receiving data. The usual
read

and
write
system calls, as well as the newer
send
and
recv
system calls can be used
with sockets that are in a connected state. The
sendto
and
recvfrom
system calls are
most useful for connectionless sockets, where the peer’s address is specified with each
transmitted message. Finally, the
sendmsg
and
recvmsg
system calls support the full in-
terface to the interprocess-communication facilities.
The
shutdown
system call terminates data transmission or reception at a socket.
Sockets are discarded with the normal
close
system call.
2.3 BSD’s networking architecture
BSD’s networking architecture has two planes, as shown in Figure 2.2: the user
plane and the memory management plane. The user plane defines a framework within
which many communication domains may coexist and network services can be imple-
mented. The memory management plane defines memory management policies and
procedures that comply with the user plane’s memory requirements. More on this a lit-
tle further.
2.3.1 Memory management plane
It is well known [McKusick et al. 1996; Wright and Stevens 1995] that the re-
quirements placed by interprocess communication and network protocols on a memory
management scheme tend to be substantially different from those of other parts of the
operating system. Basically, network messages processing require attaching and/or de-
taching protocol headers and/or trailers to messages. Moreover, some times these head-
ers’ and trailers’ sizes vary with the communication session’s state; some other times
the number of these protocol elements is, a priori, unknown. Consequently, a special-
Figure 2.2—BSD’s two-plane networking architecture. The user plane is depicted with its
layered structure, which is described in following sections. Bold circles in the figure
represent defined interfaces between planes and layers: A) Socket-to-Protocol, B)
Protocol-to-Protocol, C) Protocol-to-Network Interface, and D) User Layer-to-
Memory Management. Observe that this architecture implies that layers share the
responsibility of taking care of the storage associated with transmitted data
Network-Interface Layer
Protocol Layer
User Plane
Socket Layer
Memory-Management Plane
A
B
D
D'
D"
C

2.3—BSD’
S NETWORKING ARCHITECTURE
27
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

purpose memory management facility was created by the BSD developing team for the
use of the interprocess communication and networking systems.
The memory management facilities revolve around a data structure called an
mbuf
.
Mbuf
s, or memory buffers, are 128 bytes long, with 100 or 108 of this space re-
served for data storage. There are three sets of header fields that might be present in an
mbuf
and which are used for identifying and managing purposes. Multiple
mbuf
s can be
linked forming
mbuf
-chains to hold an arbitrary quantity of data. For very large mes-
sages, the system can associate larger sections of data with an
mbuf
by referencing an
external
mbuf-cluster
from a private virtual memory area. Data is stored either in the
internal data area of the
mbuf
or in the external cluster, but never in both.
2.3.2 User plane
The user plane, as said before, provides a framework within which many commu-
nication domains may coexist and network services can be implemented. Networking
facilities are accessed through the socket abstraction. These facilities include:
• A structured interface to the socket layer.

A consistent interface to hardware devices.
• Network-independent support for message routing.
The BSD developing team devised a pipelined implementation for the user plane
with three vertically delimited stages or layers. As Figure 2.2 and Figure 2.3 show, these
layers are the
sockets layer
, the
protocols layer
, and the
network-interfaces layer
.
Jointly, the protocols layer and the network-interfaces layer are named the networking
support. Basically, the sockets layer is a protocol-independent interface used by appli-
cations to access the networking support. The protocols layer contains the implementa-
tion of the communication domains supported by the system, where each communica-
Figure 2.3—BSD networking user plane’s software organization
kernel
user process
Socket layer
user process
user process
Interfaces layer
socket
queues
protocol queue
(IP input)
interface
queues
Protocols layer
(IP, ESP, AH, cryptographic algorithms)
system calls
software interrupt @ splnet
(caused by interface layer)
software interrupt @ splimp
(caused by hardware-interrupt handler)
function call
function call

28 T
HE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING
—2.4
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

tion domain may have its own internal structure. Last but not least, the network-
interfaces layer is mainly concerned with driving the transmission media involved.
Entities at different layers communicate through well-defined interfaces and their
execution is decoupled by means of message queues, as shown in Figure 2.3. Concur-
rent access to these message queues is controlled by the software interrupt mechanism,
as explained in the next section.
2.4 The software interrupt mechanism and
networking processing
Networking processing within the BSD operating system is pipelined and inter-
rupt driven. In order to show how this works, let us describe the sequence of chief
events occurred during message reception and transmission. If you feel lost during the
first read, please keep an eye on Figure 2.3 during the second pass. It helps. During the
following description, when we say “the system” we mean a computer system executing
a BSD derived operating system.
2.4.1 Message reception
When a network interface card captures a message from a communications link, it
posts a hardware interrupt to the system’s central processing unit. Upon catching this in-
terrupt—preempting any running application program and entering supervisor mode and
the operating system kernel’s address space—the system executes some network-
interfaces layer task and marshals the message from the network interface card’s local
memory to a protocols layer’s
mbuf
queue in main memory. During this marshaling the
system does any data-link protocol duties and determines to which communication do-
main the message is destined. Just after leaving the message in the selected protocol’s
mbuf
queue and before terminating the hardware interrupt execution context, the system
posts a software interrupt addressed to the corresponding protocols layer task. Consider-
ing that the arrived message is destined to a system’s application program and that the
addressed application has an opened socket, the system, upon catching the outstanding
software interrupt, executes the corresponding protocols layer task and marshals the
message from the protocol’s
mbuf
queue to the addressed socket’s
mbuf
queue. All pro-
tocols processing within the corresponding communication domain takes place at this
software interrupt’s context. Just after leaving the message into the addressed socket’s
mbuf
queue and before terminating the software interrupt execution context, the system
flags for execution any application program that might be sleeping over the addressed
socket, waiting for a message to arrive. When the system is finished with all the inter-
rupts execution contexts and its scheduler schedules for execution the application pro-
gram that just received the message, the system executes the corresponding sockets
layer task and marshals the message from the socket’s
mbuf
queue to the corresponding
application’s buffer in user address space. Afterwards, the system exits supervisor mode
and the address space of the operating system’s kernel and resumes the execution of the
communicating application program.
2.4—T
HE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING
29
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

Here let us spot a performance detail of the previous description. The message
marshalling between
mbuf
queues does not always imply a data copy operation. There
are copy-operations involve when marshalling messages between a network interface
card’s local memory and a protocol’s
mbuf
queue, and between a socket’s
mbuf
queue
and an application program’s buffer. But there is no data copy operation between a pro-
tocol’s and a socket’s
mbuf
queues. Here, only
mbuf
references—also know as point-
ers—are copied.
2.4.2 Message transmission
Message transmission network processing may be initiated by several events. For
instance, by an application program issuing the
sendmsg
—or similar—system call. But
it can also be initiated when forwarding a message, when a protocol timer expires or
when the system has deferred messages.
When an application program issues the
sendmsg
system call, giving a data buffer
as one of the arguments, (other arguments are, for instance, the communication domain
identification and the destination socket’s address) the system enters into supervisor
mode and into the operating system’s kernel address space and executes some sockets
layer task. This task builds an
mbuf
upon the selected communication domain and
socket type and, considering that the given data buffer fits inside one
mbuf
, it copies into
the built
mbuf
’s payload the contents of the given data buffer. In case that the communi-
cation channel protocol’s state allows the system to immediately transmit a message, the
system executes the appropriate protocols layer task and marshals the message through
the arbitrary protocol structure of the corresponding communication domain. Among
other protocol-dependent tasks, the system here selects a communication link for trans-
mitting the message out. Considering that the network interface card attached to the se-
lected communications link is idle, the system executes the appropriate network-
interfaces layer task and marshals the message from the corresponding
mbuf
in main
memory to the network interface card’s local memory. At this point the system hands
over the message delivery’s responsibility to the network interface card.
Observe that under the considered situation the system processes the message
transmission in a single execution context—that of the communicating application—and
no intermediary buffering is required. On the contrary, if for instance the system finds
an addressed network interface card busy, the system would place the
mbuf
in the corre-
sponding network interface’s
mbuf
queue and would defer the execution of the network-
interfaces layer task. For cases like this, network interface cards are built to produce a
hardware interrupt not just when receiving a message but at the end of every busy pe-
riod. Moreover, network interface card’s hardware interrupt handlers are built to always
check for deferred message at the corresponding network interface’s output
mbuf
queue.
When deferred messages are found, the system does whatever is required to transmit
them out. Observe that in this case the message transmission is done in the execution
context of the network interface card’s hardware interrupt.
Another scenario happens if a communication channel protocol’s state impedes
the system to immediately transmit a message. For instance, when a TCP connection’s
transmission window is closed [Stevens, 1994]. In this case, the system would place the
message’s
mbuf
in the corresponding socket’s
mbuf
queue, and defers the execution of
30 T
HE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING
—2.4
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

the protocols layer task. Of course, a deferring protocol must have some built-in means
for later transmitting or discarding any deferred message. For instance, TCP may open a
connection’s transmission window after receiving one or more segments from a peer.
Upon opening the transmission window, TCP will start transmitting as many deferred
messages as possible as soon as possible—that is, just after finishing message reception.
Observe that in this case the message transmission is done in the execution context of
the protocols layer reception software interrupt. Also observe that when transmitting de-
ferred messages at the protocols layer, the system may defer again at the network-
interfaces layer as explained above.
Communications protocols generally defined timed operations that require the in-
terchange of messages with peers, and thus require transmitting messages. For instance,
TCP’s delayed acknowledge mechanism [Stevens 1994]. In this cases, protocols may
relay on the BSD callout mechanism [McKusick et al, 1996] and request for the system
to execute some task at predefined times. The BSD callout mechanism uses the sys-
tem‘s real-time clock for scheduling the execution of any enlisted task. It arranges itself
to issue software interrupts every time an enlisted task is required to execute. If the
called out task initiates the transmission of a networking message, this message trans-
mission is done in the execution context of the callout mechanism software interrupt.
Once again, as explained above, transmission hold off may happen at the network-
interfaces layer as explained above.
Finally, let us consider the message-forwarding scenario. In this scenario some
communications protocol—implemented at the protocols layer—is capable of forward-
ing messages; for instance, the Internet Protocol (IP) [Stevens 1994]. During message
reception, a protocol like IP may find out that the received message is not addressed to
the local system but to another system to which it knows how to get to by means of a
routing table. In this case, the protocol will launch a message transmission task upon the
message being forwarded. Observe that this message transmission processing is done in
the execution context of the protocols layer reception software-interrupt.
2.4.3 Interrupt priority levels
The BSD operating system assigns a priority level to each hardware and software
interrupt handler. The ordering of the different priority levels means that some interrupt
handler preempts the execution of any lower-priority one. One concern with these dif-
ferent priority levels is how to handle data structures shared between interrupt handlers
executed at different priority levels. The BSD interprocess communication facility code
is sprinkled with calls to the functions
splimp
and
splnet
. These two calls are always
paired with a call to
splx
to return the processor to the previous level. The result of this
synchronization mechanism is a sequence of events like the one depicted in Figure 2.4.
1)
While a sockets layer task is executing at
spl0
, an Ethernet card re-
ceives a message and posts a hardware interrupt causing a network in-
terfaces layer task—the Ethernet device driver—to execute at
splimp
.
This interrupt preempts the sockets layer code.
2.4—T
HE SOFTWARE INTERRUPT MECHANISM AND NETWORKING PROCESSING
31
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

2) While the Ethernet device driver is running, it places the received mes-
sage into the appropriate protocols layer’s input
mbuf
queue—for in-
stance IP—and schedules a software interrupt to occur at
splnet
. The
software interrupt won’t take effect immediately since the kernel is cur-
rently running at a higher priority level.
3) When the Ethernet device driver completes, the protocols layer exe-
cutes at
splnet
.
4)
A terminal device interrupt occurs—say the completion of a SLIP
packet. It is handled immediately, preempting the protocols layer, since
terminal processing’s priority, at
spltty
, is higher than protocols
layer’s.
5) The SLIP device driver places the received packet onto IP’s input
mbuf

queue and schedules another software interrupt for the protocols layer.
6) When the SLIP device driver completes, the preempted protocols layer
task continues at
splnet
and finishes processing the message received
from the Ethernet device driver. Then, it processes the message re-
ceived form the SLIP device driver. Only when IP’s input
mbuf
queue
gets empty the protocols layer task will return control to whatever it
preempted—the sockets layer task in this example.
7) The sockets layer task continues form where it was preempted.
Figure 2.4—
Example of priority levels and kernel processing

spl0
splnet
spltty
splimp
socket
Ethernet
device driver
protocol
(IP input + TCP input)
SLIP
device driver
socket
preempted
preempted
step 1
2
3
4
5
6
7
protocol
(IP input + TCP input)

32 BSD
IMPLEMENTATION OF THE
I
NTERNET PROTOCOLS SUITE
—2.5
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

2.5 BSD implementation of the Internet protocols
suite
Figure 2.5 shows a control and data flow diagrams of the chief tasks that imple-
ment the Internet protocols suite within BSD. Furthermore, it shows its control and data
associations with chief tasks at both the sockets layer and the network-interfaces layer.
Within the 4.4BSD-lite source code distribution, the files implementing the Internet pro-
tocols suite are located at the
sys/netinet
subdirectory. On the other hand, the files
implementing the sockets layer are located at the
sys/kern
subdirectory. The files im-
plementing the network-interfaces layer are scattered among few subdirectories. The
tasks implementing general data-link protocol tasks, such as Ethernet, the address reso-
lution protocol or the point-to-point protocol, are located at the
sys/net
subdirectory.
On the other hand, the tasks implementing hardware drivers are located at hardware de-
pendent subdirectories, such as the
sys/i386/isa
or the
sys/pci
subdirectories.
As can be seen in Figure 2.5, the protocols implementation, in general, provides
output and input tasks per protocol. In addition, the IP protocol has a special
ip_forwarding
task. It can also be seen that the IP protocol does not have an input
task. Instead, the implementation comes with an
ipintr
task. The fact that IP input
processing is started by a software interrupt may be the cause of this apparent fault to
the general rule. (The FreeBSD operating system drops the
ipintr
task in favor of an
ip_input
task.) Observe that the figure depicts all the control and data flows corre-
sponding to the message reception and message transmission scenarios described in the
previous section.
In order to complete the description let me note some facts on the network-
interfaces layer. The tasks shown at the bottom half of the layer depict hardware de-
pendent tasks. The names depicted,
Xintr
,
Xread
and
Xstart
are not actual task names
but name templates. For building actual task names the capital “x” is substituted by the
name of a hardware device. For example, the FreeBSD source code distribution has
xlintr
,
xlread
and
xlstart
for the
xl
device driver, which is the device deriver used
for the 3COM’s 3C900 and 3C905 families of PCI/Fast-Ethernet network interface
cards.
2.5—BSD
IMPLEMENTATION OF THE
I
NTERNET PROTOCOLS SUITE
33
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

Figure 2.5—BSD implementation of the Internet protocol suite. Only chief tasks, mes-
sage queues and associations are shown. Please note that some control flow ar-
rows are sourced at the bounds of the squares delimiting the implementation lay-
ers. This is for denoting that a task is executed after an external event, such as an
interrupt or a CPU scheduler event

kernel
Network-interfaces layer
Xintr
ether_input
Xread
Xstart
ehter_output
if_snd
transmitted
packets
user process
user process
user process
system calls
software interrupt @ splnet
(caused by interface layer)
software interrupt @ splimp
(caused by hardware-interrupt handler)
Protocols layer
ipintr
udp_input
icmp_input
tcp_input
ip_forward
ip_output
udp_output
tcp_output
ipintrq
Sockets layer
soreceive
recv
sosend
send
socket
transm. buffer
socket
receive buffer
Legend
Function call
Data flow
Function call (other tasks involved)
received
packets
Task
Message queue

34 R
UN
-
TIME ENVIRONMENT
:
THE HOST

S HARDWARE
—2.6
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

2.6 Run-time environment: the host’s hardware
The BSD operating system was devised to run on a computing hardware with an
organization much like the one shown in Figure 2.6. This computing hardware organiza-
tion is widely used for building personal computers, low-end servers and workstations
or high-end embedded systems. The shown organization is an instance of the classical
stored-program computer architecture with the following features [Hennessy and Patter-
son 1995]:
• A single central processing unit
• A four-level, hierarchic memory (not shown)
• A two-tier, hierarchic bus
• Interrupt driven input/output processing (not shown)
• Programmable or direct memory access network interface cards
2.6.1 The central processing unit and the memory hierarchy
Nowadays, personal computers and the like computing systems are provisioned
with high-performance microprocessors. These microprocessors in general leverage the
following technologies: very-low operation cycle period, pipelines, multiple instruction
issues, out-of-order and speculative execution, data prefetching or trace caches.
In order to sustain a high operation throughput, this kind of microprocessors re-
quires very fast access to instructions and data. Unfortunately, current memory technol-
ogy lags behind microprocessor technology in its performance/price ratio. That is, low
latency memory components have to be small in order to remain economically feasible.
Consequently, personal computers and the like computing systems—but also other
computing systems using high-performance microprocessors—are suited with hierar-
Figure 2.6—Chief components in a general purpose computing hardware.
CPU
Main
Memory
Bridge
System Bus
I/O Bus
NIC
NIC
Others
Others
Other
devices

2.6—R
UN
-
TIME ENVIRONMENT
:
THE HOST

S HARDWARE
35
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

chically organized memory. Ever faster and thus smaller memory components are
placed lower in the hierarchy and thus closer to the microprocessor. Several caching
techniques are used for mapping large address spaces onto the smaller and faster mem-
ory components, which in consequence are named cache memories [Hennessy and Pat-
terson 1995]. These caching techniques mainly consist of replacement policies for
swapping out of the cache memory computer program’s address space sections (named
address space pages) that are not expected to be used in the near future in favor of active
ones. The caching techniques also determine what to do with the swapped out address
space sections—it may or may not be stored in the memory component at the next
higher level, considering that the computer program’s complete address space is always
resident in the memory component at the top of the hierarchy.
Another important aspect of the microprocessor-memory relationship is the wire
latency. That is, the time required for a data signal to travel from the output ports of a
memory component to the input ports of a microprocessor, or vice versa. Nowadays, the
lowest wire latencies are obtained when placing a microprocessor and a cache memory
in the same chip. The next worst step happens when placing these components within a
single package. The next worst step occurs when the cache memory is part of the main
memory component and thus it is at the opposite side of the system bus with respect to
the microprocessor.
Let us cite some related performance numbers of an example microprocessor. The
Intel’s Pentium 4 microprocessor is available at speeds ranging from 1.6 to 2.4 GHz. It
has a pipelined, multiple issue, speculative, and out-of-order engine. It has a 20 KB, on-
chip, separated data/instruction level-one cache, whose wire latency is estimated at two
clock cycles. And it has a 512 or 256 KB on-chip and unified level-two cache, whose
wire latency is estimated at 10 clock cycles.
2.6.2 The busses organization
For reasons not relevant to this discussion, the use of a hierarchical organization
of busses is attractive. Nowadays, personal computers and the like computing systems
come with two-tier busses. One bus, the so named system bus, links the central process-
ing unit and the main memory through a very fast point-to-point bus. The second bus,
named the input/output bus, links all input/output components or periphery devices, like
network interface cards and video or disk controllers, through a relatively slower multi-
drop input/output bus.
For quantitatively putting these busses on perspective, let us note some perform-
ance numbers of two widely deployed busses: the system bus of Intel’s Pentium 4 mi-
croprocessor and the almost omnipresent Peripheral Component Interconnect in-
put/output bus. [Shanley and Anderson 2000] The specification for the Pentium 4’s sys-
tem bus states a speed operation of 400 MHz and a theoretical maximum throughput of
3.2 Gigabytes per second. (Here, 1 Gigabytes equals 10^9 bytes.) On the other hand, the
PCI bus specification states a selection of path widths between 32 and 64 bits and a se-
lection of speed operations between 33 and 66 MHz. Consequently, the theoretical
maximum throughput for the PCI bus stays between 132 and 528 Mbytes per second for
the 33-MHz/32-bit PCI and the 66-MHz/64-bit PCI, respectively. (Here, 1 Mbytes
equals 10^6 bytes.)
36 R
UN
-
TIME ENVIRONMENT
:
THE HOST

S HARDWARE
—2.6
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

2.6.3 The input/output bus’ arbitration scheme
One more important aspect to mention with respect to the input/output bus is its
arbitration scheme. Because the input/output bus is a multi-drop bus, its path is shared
by all components attached to it and thus some access protocol is required.
The omnipresent PCI bus [Shanley and Anderson 2000] uses a set of signals for
implementing a use-by-request master-slave arbitration scheme. These signals are emit-
ted through a set of wires separated from the address/data wires. There is a request/grant
pair of wires for each bus attachment and a set of shared wires for signaling an initiator-
ready event, (FRAME and IRDY) a target-ready event, (TRDY and DEVSEL) and for
issuing commands (three wires).
A periphery device attached to the PCI bus (device for short) that wants to transfer
some data, requests the PCI bus mastership by emitting a request signal to the PCI bus
arbiter. (Bus arbiter for short.) The bus arbiter grants the bus mastership by emitting a
grant signal to a requesting device. A granted device becomes the bus master and drives
the data transfer by addressing a slave device and issuing to it read or writes commands.
A device may request bus mastership and the bus arbiter may grant it at any time, even
when other device is currently performing a bus transaction, in what is called “hidden
bus arbitration.” This seams a natural way to improve performance. However, devices
may experience reduced performance or malfunctioning if bus masters are preempted
too quickly. Next subsection discusses this and other issues regarding performance and
latency.
The PCI bus specification does not defines how the bus arbiter should behave
when receiving simultaneous requests. The 2.1PCI bus specification only states that the
arbiter is required to implement a fairness algorithm to avoid deadlocks. Generally,
some kind of bi-level round robin policy is implemented. Under this policy, devices are
separated in two groups: a fast access and a slow access group. The bus arbiter rotates
grants through the fast access group allowing one grant to the slow access group at the
end of each cycle. Grants for slow access devices are also rotated. Figure 2.7 depicts
this policy.
Figure 2.7—Example PCI arbitration algorithm
Fast access group
Slow access group
A
B
C
a
b
c
Example sequence:
A
B
a
A
B
b
A
B
c

2.6—R
UN
-
TIME ENVIRONMENT
:
THE HOST

S HARDWARE
37
P
H
.D. D
ISSERTATION
O
SCAR
I
VÁN
L
EPE
A
LDAMA

2.6.4 PCI hidden bus arbitration’s influence on latency
PCI bus masters should always use burst transfers to transfer blocks of data be-
tween themselves and target devices. If a bus master is in the midst of a burst transac-
tion and the bus arbiter removes its grant signal, this indicates that the bus arbiter has
detected a request from another bus master and is granting bus mastership for the next
transaction to the other device. In other words, the current bus master has bee pre-
empted. Due to PCI’s hidden bus arbitration this could happen any moment, even one
bus cycle before the current bus master has initiated its transaction. Evidently this ham-
pers PCI’s burst transactions support and leads to bad performance.
In order to avoid this the 2.1PCI bus specification mandates the use of a master la-
tency timer per PCI device. The value contained in this latency timer defines the mini-
mum amount of time that a bus master is permitted to retain bus mastership. Therefore,
a current bus master retains bus mastership until either it completes its burst transaction
or its latency timer expires.
Note that independently of the latency timer a PCI device must be capable of
managing bus transaction preemption; that is, it must be capable of “remembering” the
state of a transaction so it may continue where it left off.
The latency timer is implemented as a configuration register in a PCI device’s
configuration space. It is either initialized by the system’s configuration software at
startup, or contains a hardwire value. It may equal zero, in which case a device can only
enforce single data phase transactions. Configuration software computes latency timer
values for PCI devices not having it hardwire from its knowledge of the bus speed and
each PCI device’s target value, stored in another PCI configuration register.
2.6.5 Network interface card’s system interface
There are two different techniques for interfacing a computer system with periph-
ery devices like network interface cards. If using the programmable input/output tech-
nique, a periphery device interchanges data between its local memory and the system’s
main memory by means of a program executed by the central processing unit. This pro-
gram articulates either input/output or memory instructions that read or write data from
or to particular main memory’s locations. These locations were previously allocated and
initialized by the system’s configuration software at startup. The periphery device’s and
motherboard’s organizations determine the use of either input/output or memory in-
structions. When using this technique, periphery devices interrupt the central processing
unit when they want to initiate a data interchange.
With the direct memory access (DMA) technique, the data interchange is carried
out without the central processing unit intervention. Instead, a DMA periphery device
uses a pair of specialized electronic engines for performing the data interchange with
the system’s main memory. One engine is part of the same periphery device while the
other is part of the bridge chipset; see Figure 2.6. Evidently, the input/output bus must
support DMA transactions. In a DMA transaction, one engine assumes the bus master
role and issues read or write commands; the other engine’s role is as servant and follows
commands. Generally, the DMA engine at the bridge chipset may assume both roles.
38 R
UN
-
TIME ENVIRONMENT
:
THE HOST

S HARDWARE
—2.6
O
SCAR
I
VÁN
L
EPE
A
LDAMA
P
H
.D. D
ISSERTATION

When incorporating a master DMA engine, a periphery device interrupts the central
processing unit after finishing a data interchange. It is important to note that DMA en-
gines do not allocate nor initialize the main memory’s locations from or to where data is
read or written. Instead, the corresponding device driver is responsible of that and
somehow communicates the location’s addresses to the master DMA engine. Next sub-
section further explains this.
Periphery devices’ system interface may incorporate both previously described
techniques. For instance, they may relay on programmable input/output for setup and
performance statistics gathering tasks and on DMA for input/output data interchange.
2.6.6 Main memory allocation for direct memory access network interface cards
Generally, a DMA capable network interface card supports the usage of a linked
list of message buffers,
mbufs
, for data interchange with main memory; see Figure 2.8.
During startup, the corresponding device driver builds two of this
mbufs
lists, one for
handling packets exiting the system, named the transmit channel, and the other for han-
dling packet entering it, named the receive channel.
Mbuf
s in these lists are wrapped
with descriptors that hold additional list management information, including the
mbuf
’s
main memory start address and size. Network interface cards maintain a local copy of
the current
mbuf
’s list descriptor. They use the list descriptor’s data to marshal DMA
transfers. A network interface card may get list descriptors either autonomously, by
DMA, or by device driver command, by programmable input/output code. The method
used depends on context as explained in the next paragraphs. Generally, a list descriptor
includes a “next element” field. If a network interface card supports it, it uses this field
to fetch the next current
mbuf
’s list descriptor. This field is set to zero to instruct the
network interface card to stop doing DMA through a channel.
Transmit channel. At system startup, transmit channels are empty and, conse-
quently, device drivers clear—writing a zero—network interface cards’ local copy of
the current
mbuf
descriptor. When there is a network message to transmit, a device
driver queues the corresponding
mbuf
to a transmit channel and does whatever neces-
sary so the appropriate network interface card gets its copy of the new current
mbuf
’s
list descriptor. This means that a device driver either copies the list descriptor by pro-
grammable input/output code, or instructs the network interface card to DMA copy it.
Advanced DMA network interface cards are programmed to periodically poll transmit
channels looking for new elements, simplifying the device driver task. Generally, list
descriptors have a “message processed” field, which is required for transmit channel
operation: after a network interface card DMA copies a message from the transmit
channel, it sets the “message processed” field and, after transmitting the message
through the data link, notifies the appropriate device driver of a message transmission.
(Message transmission notification may be batched for improved performance.) When