Benchmarking the Am186™CC Microcontroller's SmartDMA ...

forestevanescentElectronics - Devices

Nov 2, 2013 (3 years and 11 months ago)

106 views

Benchmarking the Am186™CC Microcontroller’s SmartDMA™ and HDLC controllers


By Patrick Maupin

April 6, 1999



The LANCE™
-
compatible SmartDMA™ controllers on the Am186™CC Communications
microcontroller can perform data movement tasks much more efficiently t
han the CPU itself. The
combination of these DMA controllers with the 186 core provides a low
-
cost, high
-
performance solution to
many communications needs.


This paper describes, at a high level, an HDLC performance benchmark written to verify SmartDMA
per
formance in a system that approximates the real world, and the results of running the benchmark. This
source code (the HDLC performance benchmarking program called CCTEST.HEX) is easily extensible and
adaptable, suitable for additional benchmark testing an
d possibly some simple real
-
world applications.


The SmartDMA controllers can be used in conjunction with either the HDLC controllers or the USB
controllers on the microcontroller. The HDLC controllers were chosen for the benchmark for simplicity
(they are

easy to put into loopback mode), and for maximum system loading (most tests were performed
with all four controllers simultaneously clocked at 10 Mbit/s).


Certain elements of the benchmark software mimic typical real
-
world applications very closely. For

example, all intermodule calls are FAR, most of the system is written in C, and the system performs several
calls to move data around through layered software.


Other elements of the system are not typical of the real world, but most of these elements mak
e the
benchmark slightly slower than a comparable real
-
world application. For example, the system is polled
rather than interrupt driven, which introduces latency into data movement because the system has to
constantly check to see what needs to be done. A
nother example is printfs. The benchmark prints its status
through the E86MON™ software monitor, which has a 16
-
byte UART transmit buffer. A delay occurs if
any more data is transmitted at a single time. Finally, the size of the DMA descriptor ring buffers

were
chosen to avoid running out of receive buffers (the receive descriptor ring is twice the size of the transmit
descriptor ring) because buffers are not moved at interrupt level. This keeps spurious errors from
interfering with the throughput.


In any
system, it is very difficult to directly measure CPU utilization. In an interrupt
-
driven system, you
can measure the amount of time spent in a null task, but it is difficult to take DMA and interrupt activity
into account. In a polled system like the CCDri
ver benchmark, it is even more difficult to measure CPU
utilization because there is no null task, per se. To compensate for this, the benchmark does not attempt to
directly measure CPU utilization. Instead, it allows you to artificially restrict CPU utili
zation to any desired
percentage between 20% and 100%. This restriction is performed by using general
-
purpose DMA
controllers to steal cycles from the CPU. Because the 186 has no cache, this is a very effective method of
halting the 186 CPU core for short
periods.


The benchmark program is very flexible (and the provided source code offers even more flexibility),
enabling you to try different buffer sizes, clock speeds, etc. Because the system performance scales very
linearly, this discussion is limited to

the following setup:




Am186™CC system clock at 48 MHz



Four HDLC channels running simultaneously, send and receive, all clocked at 10 Mbit/s



Four SmartDMA buffers per HDLC message



SmartDMA buffer sizes of 50, 100, 150, 200, 300, and 400 bytes



CPU utilization of 40, 50, 60, 70, 8
0, 90, and 100%


Raw throughput using all four HDLC channels


With this setup, the benchmark program allocates buffers from a pool, stores header information in them,
passes them to the send function, which transmits them out the HDLC channel. On the recei
ve side, the
program gets buffers out of the receive descriptor ring, compares the headers with the sent buffers to check
for errors, and frees the buffers. A poll task frees spent transmit buffers and allocates fresh receive buffers
for the descriptor rin
gs.


The raw throughput results in Mbit/s per channel for four channels are:



50 byte
buffers

100 byte
buffers

150 byte
buffers

200 byte
buffers

300 byte
buffers

400 byte
buffers

100% CPU

1.100

2.052

2.850

3.539

4.663

5.466

90% CPU

0.992

1.838

2.556

3.
166

4.174

4.899

80% CPU

0.878

1.624

2.262

2.802

3.680

4.336

70% CPU

0.764

1.412

1.968

2.440

3.197

3.761

60% CPU

0.653

1.202

1.672

2.076

2.724

3.196

50% CPU

0.541

0.996

1.380

1.710

2.253

2.646

40% CPU

0.430

0.791

1.092

1.352

1.775

2.094


Average throu
ghput per channel (Mbit/s), four HDLC channels in loopback mode


The numbers shown here were, as discussed previously, all generated with all four HDLCs clocking at 10
MHz. However, during the course of running tests for this paper, it was noted that, due

to the architecture
of the SmartDMA controller, the throughput numbers are
completely independent

of the HDLC clock rate,
unless the channels are saturated. For example, the numbers would have been very similar for 5
-
MHz
clocking, except the 5.466 Mbit/s
for 400
-
byte buffers at 100% CPU, and would have even been
substantially similar at 2.5 MHz for the left two columns of the table.



Of course, this test does not take into account actually manipulating the sent or received data (the overhead
of which will

be highly application
-
dependent), although it should give a pretty accurate picture of buffer
overhead through the system. Later in this paper, we will show some results with a simple application
touching every byte of the received data. For now we will d
o some math to test these raw performance
numbers.


With the benchmark as run, we expect the number of CPU cycles it takes to process a buffer to be
reasonably constant. There are some non
-
constant elements in the calculation, such as timer and printf
over
head,
but if the number of CPU cycles is not relatively constant, there is probably a benchmark flaw.


Calculating the number of CPU cycles it takes to process a buffer is also a useful exercise because it shows
how much time the benchmark’s buffer allocat
ion and free functions, buffer movement functions, etc. take.


The easy theoretical number to come up with is:
CPU cycles plus per
-
buffer SmartDMA overhead
. This is
calculated as follows (the items in the table which vary in the chart above are shown in b
old):


DMA Data cycles/second=
Mbits/sec
/channel * 4 channels /(8 bits/byte) * (4CpuCycles/DMA transfer) *
(2 DMA transfers/byte


one for send, one for receive)


CPU cycles/second = (48MCpu Cycles/sec *
Allowed Percentage
)


DMA Data Cycles


Buffers/secon
d = (
Mbits/sec
/channel * 4 channels)/ (
Bytes per Buffer
) / (8 bits/byte)


CPU Cycles/Buffer = (Cpu Cycles/second) / (Buffers/second)

Using this calculation with the raw data above, the CPU overhead per buffer is shown in the following
table:



50 byte
buff
ers

100 byte
buffers

150 byte
buffers

200 byte
buffers

300 byte
buffers

400 byte
buffers

100% CPU

3964

3878

3853

3825

3776

3825

90% CPU

3955

3901

3870

3858

3810

3855

80% CPU

3974

3929

3893

3882

3861

3885

70% CPU

3998

3959

3922

3908

3906

3947

60% CPU

4010

3992

3967

3949

3944

4009

50% CPU

4036

4019

4017

4014

3991

4056

40% CPU

4065

4055

4075

4080

4090

4135


CPU cycle overhead per looped buffer, four HDLC channels in loopback mode


Examination of the numbers in this table shows less than a 10% variatio
n over an 8:1 buffer size ratio and
2.5:1 CPU speed restriction ratio. The overhead per buffer is almost constant over a varying CPU
restriction at the lowest buffer size (when all buffers sent are probably transmitted almost immediately over
the 10
-
Mbit/s

link), and the overhead tends to decrease slightly for larger buffers at 100% CPU, and
increase slightly for larger buffers at 40% CPU. Both of these changes are probably artifacts of the polled
design of the system.


If we take the largest number we hav
e (4135 cycles/buffer), what is the benchmark doing with that time?
About 1 percent of it is SmartDMA ring descriptor per
-
buffer overhead (send and receive together are
about 32
-
36 clocks/buffer), and most of the rest of the breakdown can be determined by

running the
benchmark with statistical code profiling enabled. This was done with 400 byte buffers at a 40% CPU
restriction, with the following rounded results:


Transmit Buffer formatting(application code):

14%, or 579 cycles

Receive Buffer comparison(
application code): 35%, or 1447 cycles

HDLC transmit (driver code): 13%, or 536 cycles

HDLC receive (driver code): 12%, or 496 cycles

Buffer Free (driver code, 2 buffers): 7%, or 289 cycles

Buffer All
ocate(driver code, 2 buffers): 4%, or 165 cycles

SmartDMA ring overhead (driver code, fresh 11%, or 455 cycles


buffer to receive ring, spent buffers


from transmit ring)

E86MON (printf) overhead 4%, or 165 cycle
s


We can infer from this table that the number of CPU cycles required to allocate a buffer and send it using
SmartDMA (ignoring whatever processing the application has to do to fill the buffer) is approximately:


536 + (289+165+455)/2 = 991 CPU cycles


Li
kewise, the inferred number of CPU cycles required to receive a buffer from the SmartDMA and then
free it when we are done is approximately:



496 + (289+165+455)/2 = 951 CPU cycles


As the tables above show, we can do a small amount of application process
ing on each buffer and still
manage to send and receive 11,000 50
-
byte buffers per second. In fact, if the CPU were doing
nothing else (including not receiving the buffers), it could allocate and send over 48,000 short (1
-
byte
long) buffers every second.

Simulated application test


The final test we perform is an application test. We use a CRC calculation to simulate a real
-
world
application which must manipulate each data byte. We put a single HDLC channel in loopback mode, and
calculate the CRC on the r
eceived data and FCS bytes, and print an error if the received FCS is wrong
(fortunately, we don’t print any errors unless we force them externally!). With 200 byte buffers, we achieve
a 5.55Mbit/s throughput, with the following breakdown on the remaining
CPU cycles:


Transmit Buffer formatting(application code):


5%, or 579 cycles

Receive Buffer comparison(application code): 10%, or 1447 cycles

Receive Buffer CRC calculation: 68%, or 1447 cycles

HDLC transmit (driver code):

5%, or 536 cycles

HDLC receive (driver code): 3%, or 496 cycles

Buffer Free (driver code, 2 buffers): 2%, or 289 cycles

Buffer Allocate(driver code, 2 buffers): 1%, or 165 cycles

SmartDMA ring overhead (d
river code, fresh 4%, or 455 cycles


buffer to receive ring, spent buffers


from transmit ring)


The leftover CPU after SmartDMA data transfers is 43.5 M cycles/second. The CRC code takes 68% of
this, or 29.6 M cycles/sec, for 5.55 Mbit/s. This

works out to about 43 CPU cycles/byte for the CRC. We
further verify this by running two channels instead of one. The throughput is now 2.72 Mbit/s per channel,
or 5.44 Mbit/s total. If we drop the buffer size, we get 2.045 Mbit/s per channel for 100
-
byt
e buffers, and
1.36 Mbit/s per channel for 50
-
byte buffers.


An application which needs to perform a similar amount of processing per buffer and per byte could
easily communicate with two saturated T
-
1 or E
-
1 lines using the SmartDMA controllers if the
ave
rage buffer size is over 100 bytes.



Conclusions


For applications requiring movement of large amounts of data with little per
-
byte processing required,
SmartDMA channels give excellent performance, and can even saturate the entire bus bandwidth of the 18
6
by processing an aggregate throughput of 44 Mbit/s (four channels, send/receive at 5.5 Mbit/s on each
channel). When the required per
-
byte processing becomes large and/or the number of bytes per buffer
becomes small, the benchmark CodeKit should be adapt
ed to the application to assess the suitability of the
Am186
™CC microcontroller for each customer's requirements.


For USB applications, it may be worthwhile to run the DMA in buffer per IRP mode, instead of buffer per
packet. This greatly increases the complexity of error handling, but the increase in average buff
er size
could be well worth the extra code complexity.


Likewise, for HDLC applications that use proprietary protocols, it is well worthwhile to try to increase the
average message size, bearing in mind that the efficacy of the CRC FCS is reduced for messa
ges over 1 KB
or so in length.






Trademarks

© 1999 Advanced Micro Devices, Inc. All rights reserved.

AMD, the AMD logo, combinations thereof, Am186, E86MON, and SmartDMA are trademarks of Advanced Micro
Devices, Inc.

All other product names are for iden
tification purposes only and may be trademarks of their respective companies.