IXP1200 Programming by E.J.Johnson and A.R.Kunze May 27, 2002 Clement Leung

perchorangeΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

81 εμφανίσεις

IXP1200 Programming by E.J.Johnson and A.R.Kunze

May 27, 2002

Clement Leung

I have read over this book over this Memorial Day weekend, to prepare for further
discussions on IXP programming projects. I have not had time to appreciate all the details
of thi
s new system, but I would like to write down what I have learned while it is still
fresh in my mind. I did not go through all the programming examples in detail. The book
comes with a software development environment on a CD
ROM. A worthy student
would do
all the examples hands
on. Chen Fei has also read this book and I expect a lot
of feedback from him, as well as from others at CAPSL.

I have decided to talk about the features peculiar to the IXP1200 as they come to mind,
instead of spending a portion of
the available time organizing the material. The book
itself is well organized.


Memory Hierarchy

Each microengine has 128 registers. A thread can access its own 32 registers, and
also the entire set of 128 registers. These 128 registers are organized in 2
banks of
64 registers each. The functional unit takes one operand from each bank in each
cycle. Each microengine also has SRAM read registers, SRAM write registers,
SDRAM read registers, and SDRAM write registers. The SRAM read/write
registers also serve r
ead/write requests to the scratchpad SRAM on chip. All
memory transfers must take place through these read/write registers.

There is a bus interface unit on
chip, the FBI unit. This unit contains a receiver
FIFO (RFIFO) and a transmitter FIFO (TFIFO). The
se FIFOs are for sending and
receiving 64 bytes packets call mpackets on the IX bus which interfaces with
Ethernet MACs and other IXPs.

There are some synchronization instructions which operate on the scratchpad
SRAM and external SRAM, but not the externa

The StrongARM core can access all the memories, but can only do
synchronization operations on the external SRAM, but not on the scratchpad

The FBI unit can transfer data directly between the IX bus and the SDRAM.

There is also a PCI bus i
nterface. This book does not give examples of PCI
interface programming.



Each microengine has four threads. A non
preemptive arbiter handles thread
synchronization. A thread can give up control of the thread execution hardware
. As an example, a thread can execute a memory operation with the
following thread control options:


no_signal, thread execution continues


sig_done, continue operation, the memory controller should send signal when
the memory operation completes


swap, swap thread out; wait until operation is complete before rescheduling


voluntary_swap, swap thread out, do not wait for completion to swap back in


sync_wait, spin wait for completion of operation without swapping out

Each thread can find out its
own thread number and use it to determine its own
course of action as different from other threads.

Since thread execution is non
preemptive, resource sharing among threads in the
same microengine is automatically “atomic”.

There are additional synchroni
zation mechanisms based on operations on
scratchpad SRAM, external SRAM and explicit inter
thread signaling. The
scheduling hardware recognizes the inter
thread signals explicitly, even though it
does not record the signal origin. This can add complication

to program
debugging. The synchronization mechanism are:


set, test
clear operations on scratchpad memory locations and
external SRAM memory locations.


up to 8 SRAM CAM locks. Thread execution can lock up to 8 SRAM addresses.
Locking a l
ocked address will hold up the executing thread.


thread signals. Any thread can wait for its own inter
thread signal. Any
thread can send an inter
thread signal to any other thread.

The FBI unit contains FIFOs and state registers to support packe
t receipt and
transmission with MACs (Media Access Controllers, which interface with
physical signals and converts them into bit streams, into 64 byte “mpackets” in
the IXP.) The state registers contain status bits for these MACs.

Since queue management i
s so common in packet processing. The microengine
instruction set supports 8 LIFO queues in external SRAM with atomic push and
pop operations.

The StrongArm core maps the FBI FIFOs into its address space. It also
implements and some synchronization operat
ions thr

Each external SRAM and external SDRAM unit supports 3 command queues that
the software can specify for each memory operation:


ordered queue, commands are executed in order


priority queue, commands in this queue are serviced before those in ot
her queues


optimized queue, commands may be reordered for optimization. SRAM
operations may be separated into reads and writes sequences to avoid bus
turnaround time. SDRAM operations may be grouped to optimize memory bank
access patterns.


Compiler Cons

The book refers to compiler optimization in a couple of places. I have not noticed
any mention of compiler optimization in thread synchronization, microengine
synchronization, or data structure optimization.
Note that all the points I am making
this section comes from this book, and not from the C compiler reference manual provided
by Intel.

The compiler accepts assignment of variables to the different memory areas.
There is a specification of sharing the variable to force it to be put into m
and not optimized into a register, for example. But the programmer has to specify
this explicitly.

There is a library of intrinsics to access all the hardware mechanisms for thread
control, atomic memory operations, etc. Again the programmer uses th

All compiler generated memory references also swap out the executing thread.
Other thread
control options have to be hand
programmed using intrinsics.

On Page 277, the compiler is credited for generating no
ops to deal with what
appears t
o be an idiosyncratic hardware issue having to do with the property that a
thread terminating instrinsic will terminate the thread when it is feteched into the
microengine, even before it is executed.


Software Development

There is a C
compiler. But man
y of the synchronization operations are available
as intrinsic functions through the compiler. So there is no high
level parallel
programming paradigm supported for multi
engine multi
thread software
development. In general, the IXP 1200 is designed for ef
ficiency considerations.
Many mechanisms are made available, and design choices are made for efficient
hardware implementation rather than for supporting well
defined higher level
programming paradigms, Error handling when software operates the hardware
chanisms out of expectations is another area where simplicity of hardware
implementation takes priority over “nice” semantic properties.

There is a software model with components running in the StrongArm, and
components running in the microengines. These
components are referred to as
Active Computing Elements (ACE) and are meant to be composed together in the
StrongARM and in the microengines to build larger programs. There is an
Interface Definition Language (IDL) to facilitate this composition. The book
devotes a chapter to explain this software model and how it is supported.

The examples used in the book give a good introduction to the considerations in
programming These examples give a flavor of the background and experience
expected of software dev
elopers in building performance
oriented products based
on the IXP1200 and its successors.


Throughout the book the examples are for packet processing. It gives examples
of how to build queues and arrays that are well supported by the hardware
, e.g. by a hardware hashing unit to build look up tables. I have listed
the different hardware mechanisms in the different hardware units. The book is
very focused on making use of these hardware mechanisms, almost directly in C
through the intrinsic libr
ary functions each of which is a segment of sugared
assembly code. In building data structures to access on
chip and off
hardware status information, it is necessary to use the set of read and write
registers dedicated to each memory subsystem correct


There is the consideration of explicit thread control. In some examples a thread
continues to execute some instructions after issuing a long latency request before
giving up control and starts waiting for request completion.


In some examples, cod
e is move around to allow early initiation of long latency


In receiving and sending packets through MAC devices on the IX bus, the
control states kept among the packet processing software, the on
chip hardware
units and the external hardwar
e devices may be “out of phase”. This must be
taken into consideration while optimizing performance. The software reads and
writes hardware status registers, which are updated by hardware. Correct and
efficient synchronization schemes must be devised to ma
ke use of multi
microengines and multi


The hardware RFIFO and TFIFO in the FBI unit processes its entries in order.
So a system of valid bits is used to allow sharing these FIFOs among several
external MACs. Schemes must be devised to support

MACs with different line
rates so that there is no slow down due to coupling through the FIFO operations.


The book uses performance graphs for different size packets, and simulation
outputs showing thread activities in a microengine to illustrate the e
ffectiveness of
different techniques. These tools will be useful in any IXP software development


The book gives an example of building a hashing table, using the hardware
hashing unit, and analyzing the frequency and nature of memory locking am
operating threads.


The book develops in detail how to use the IXP1200 mechanisms to work with
optimized hardware platforms. One such platform works with 8 100Mbit Ethernet
ports and 2 Gigabit Ethernet ports. Another such platform works with “Fast
MAC devices, which can buffer multiple mpackets.


Information on the new IXP2xxx

This book has some good information on the new architecture and provides a
good context to understand some of the new features. So while it is not very
relevant to pro
gramming the IXP1200, I have decided to include these features in
this memo as well. It says very little about the compiler development. So either
the compiler features were not ready at the time of writing this book, or there is
little anticipated changes

in how the IXP2xxx C compiler supports the user as
compared to the IXP1200 compiler.


More and faster. 8 threads per microengine, clock speed at 700mhz and up, Xscale instead
of StrongArm

A microengine has 8 threads instead of 4. Microengines are group
ed into clusters.
IXP2400 has two clusters of 4 microengines. IXP2800 has two clusters of 8
microengines. Each cluster has an independent command bus and SRAM bus. All
clusters share a DRAM bus.

Each microengine has 8 threads and 256 general purpose regis
ters, and 512
transfer registers.


Memory hierarchy

A microengine has:

general purpose registers and transfer registers as the IXP1200.

a new set of next
neighbor registers which can be used as general purpose
registers and, in a new mode, for sending
data to a neighbor microengine. The
microengines are ordered as neighbors in this new mode. There are status bits
and get and put operations to support using these registers as a queue.

Each microengine has a local memory of 640 long words, accessed in 3


Each microengine has a 16 entries CAM. Operations on it returns, for a hit, the
stored state in the matching entry, and the position of the entry that matches. For a
miss, the operation returns the position of the entry that has been refere
nced least
recently. The book suggest that a local cache can be built using this CAM and the
memory local to the microengine. The book makes the point that if the memory
data is used only by a microengine, the cache implementation can use write
If th
e memory data is shared among microengines, a write
through cache should
be implemented. It does not discuss cache
coherency issues any further.

The on
chip scratchpad memory is now 16 Kbytes instead of 4 Kbytes.

The IXP2400 has 2 SRAM controllers and 1
DRAM controller. The IXP 2800 has
4 SRAM controllers and 3 DRAM controllers.

Each SRAM controller internally maintains an array of 64 “queue descriptors”. A
queue descriptor has information about the location of the head and tail of the
queue,. The actual

queue data is stored in SRAM. A queue can be a linked list or a



The SRAM now supports more atomic operations:

increment, decrement, add, test
increment, test
decrement, test

The scrathpad memory supports, in


subtract, and test

For thread control and synchronization, each microengine now manages 15
distinct inter
thread signals. Most accesses to external hardware can cause signals
using any one signal number. Thread scheduling can wait
for multiple signals,
specifying if the thread should wait for all of the selected signals to be asserted, or
for just any one of them.

Both inter
thread signaling and the microengine CAM can be used to implement
critical sections. There are only 15 inter
thread signals, and a CAM is accessible
only within a microengine.


Communications Interface Support

The FBI unit is now generalized to a Media Switch Fabric (MSF) unit. For the
IXP2400, this unit supports Utopia, Packet over Synchronous Optical Network

(POS) Physical Interface (POS/PHY), and Common Switch Interface (CSIX).
These are common interfaces for up to 2.5 Gbit/sec operations. The IXP2800
supports System Packet Interface 4 (SPI
4) and CSIX. SPI
4 is for 10 Gbit/sec

The MSF unit also

takes provides more support for receiving and sending packets
from and to external peripherals. There are now buffers of 8 Kbytes each,
reconfigurable to packet sizes of 64 bytes, 128 bytes or 256 bytes.

There are now two lists maintained in the MSF, a l
ist of threads that are ready to
process new packets, and a list of packets. This allow direct asynchronous
hardware scheduling between threads and external MACs.


Software Structure

The book advocates, based on analyzing many programs developed for IXP1
using software pipelines as a preferred structures. It distinguishes between:

a contextual pipeline stage: where all the threads in a microengine execute the
same code for a pipeline stage, and the pipeline is formed by concatenating these
stages thro
ugh queues. State information can be stored I registers or local memory
for fast access.

a functional pipeline stage: where all the threads can be executing different stages
of a pipeline at the same time.

There are obviously detailed tradeoffs in buildin
g these pipelines regarding code
size, access to shared state information, queue storage allocation, and so on. But
this structure is being advocated.