Flow Maximization for NoC Routing Algorithms

elfinoverwroughtΔίκτυα και Επικοινωνίες

18 Ιουλ 2012 (πριν από 5 χρόνια και 2 μήνες)

297 εμφανίσεις

Flow Maximization for NoC Routing Algorithms


Ying-Cherng Lan
1
, Michael C. Chen
1
, Alan P. Su
2
, Yu-Hen Hu
3
, Sao-Jie Chen
1, 4
1
Graduate Institute of Electronics Engineering, National Taiwan University,
Taipei, Taiwan, ROC, {f94943068, r95943171}@ntu.edu.tw, csj@cc.ee.ntu.edu.tw
2
SpringSoft, Inc., alan_su@springsoft.com
3
Department of Electrical and Computer Engineering, University of Wisconsin, Madison
Madison, WI53706, USA, hu@engr.wisc.edu
4
Graduate Institute of Electrical Engineering, National Taiwan University, Taipei Taiwan, ROC


Abstract

In this paper, we apply a flow maximization
methodology and develop a minimal adaptive routing
scheme towards lower average packet latencies in a
Network-On-Chip. The minimal adaptive algorithm allows
packets to be forwarded in multiple directions. Typical
routing algorithms select the next hop path for each packet
and then resolve conflicts between packets that have been
selected to have the same next hop path. We propose a
novel decision flow which combines routing direction and
port arbitration while maximizing flow. Upon a 2-D mesh
NoC, our implementation lowers average packet latencies
across all packet injection rates of an NoC.

1. Introduction

In response to meeting higher routing complexity and
tighter schedule demands in a multi-processor System-On-
Chip (MPSoC), Network-On-Chip (NoC) has been
proposed as a flexible and scalable solution. As silicon
technologies advance to 50nm and beyond, global wires
will take 6 to 10 cycles to propagate [1]. On-chip
interconnect delays will then far outweigh gate delays.
Furthermore, attempts to guarantee quality of service for
system performance will be a manually intensive task. A
dedicated set of recalculations will be necessary whenever
a new peripheral is added. To adapt, designs may focus on
probabilistic or stochastic performance metrics to meet
design objectives [2]. As indicated in several papers, a
paradigm change from computation-based to
communication-based will better serve current issues
[2,3,4]. The research of NoC addresses these issues and
makes for a promising solution.
A typical NoC can be partitioned into wiring, data link,
network, transport, and application layers [2]. In the
network layer, an efficient routing scheme optimizes usage
of the underlying communication architecture. This
generalized problem has been defined by [5] as the routing
problem. The problem states that given a communication
architecture and an application graph which can be seen as
a traffic pattern, to find a decision function at each router
for selecting an output port that best achieves a user-
defined objective function. This problem has three main
parts: a traffic pattern, an NoC communication architecture,
and an algorithm which best satisfies a set of user-defined
objectives.

First, traffic patterns known ahead of time can be
assigned by scheduling solutions. On the other hand,
dynamic or stochastic traffic patterns rely on routing
algorithms with varying degrees of adaptivity to route
packets. Multiple traffic patterns such as uniform,
transpose, and hotspot better demonstrates an algorithm’s
flexibility. Our focus is on dynamically created patterns.
Second, NoC communication architectures can have
different topologies. The most common one is a regular 2-
D mesh, frequently used to display the advantages of using
an adaptive routing algorithm. Other work, such as [6],
deals with the routing in irregular regions of a mesh array.
These types of routing algorithms allow working nodes to
maintain objectives in the presence of faulty nodes. Our
focus is conceptually independent of topology.
We develop a prioritized NoC routing concept based on
the Ford-Fulkerson algorithm following an analysis of stall
reduction. We show how routing algorithms can be
compartmentalized into sub-modules and fitted into our
concept. Finally, we compare our routing algorithm with
other existing algorithms and show the improvements in
lowering average packet latencies at low packet injection
rates while maintaining or decreasing average packet
latencies at high packet injection rates.
This paper is organized as follows: In Section 2, we
review relevant state-of-the-art techniques in the
networking layer of NoCs. In Section 3, we refer to the
Ford-Fulkerson algorithm as a means to solve maximum
matching problems. Next we apply NoC priorities to the
maximum matching problem in order to solve the routing
problem while preserving congestion avoidance and
congestion relief. We also break down the impact of stalls
in an NoC and relate this to how our Flow Maxima (FMAX)
can reduce average packet latencies in an NoC. Then, we
present our architecture and discuss a feasible hardware
implementation of our
FMAX
algorithm. Finally, we
present the results and discuss the type of improvements
that this implementation provides in terms of latency.

IEEE Computer Society Annual Symposium on VLSI
978-0-7695-3170-0/08 $25.00 © 2008 IEEE
DOI 10.1109/ISVLSI.2008.52
335
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on November 10, 2009 at 20:06 from IEEE Xplore. Restrictions apply.

2. Related work

Minimal routing algorithms based primarily on run-time
conditions can be classified into two sets. One set of
algorithms deal with centralized routing. An example is the
AntNet algorithm as described in [7].

The overhead of ants
is that each node requires extra ant buffers, routing tables,
and arbitration mechanisms.

The other set of algorithms, distributed algorithms which
rely on local information only, have been proposed as being
efficient and still maintaining low overhead and high
scalability. Routing algorithms in this category include
deterministic and adaptive algorithms. Under traffic
patterns such as transpose and hotspot, XY deterministic
routing failed to avoid hotspots and resulted in high average
latencies. Adaptive routing allows the router to react to
hotspots that different traffic patterns create. While
minimal routing algorithms prevent livelock from occurring,
adaptive routing introduces the possibility of deadlock [8].
Deadlock can be prevented in adaptive algorithms by
applying odd-even turn model restrictions to the routing
decision [9].
As presented in [10], the DyAD router dynamically
switches from deterministic to adaptive routing when
congestion is detected.

Deterministic routing achieves low
packet latency under low packet injection rates.
Neighboring nodes send indication to use adaptive routing
when their buffers are filled above a preset threshold.
Under those conditions, the router dictates that packets are
routed in the direction with more slots in the downstream
input buffers. This minimal adaptive algorithm used in the
presence of hotspots and increasing congestion rates pushes
back the saturation point of the traffic in the network.
In addition to making a routing decision based on the
buffer information of downstream packets, the other part of
a router’s decision making is the arbitration of packets.
When multiple input packets are designated to be
forwarded to the same next hop destination, arbitration
schemes such as round-robin or first-come first-serve are
proposed to resolve the output port contention. Arbitration
schemes such as these may be designed to relieve upstream
buffers with higher congestion.
These previous works have either dealt with some form
of routing or arbitration. We categorize the former as
various methods of congestion avoidance; in other words,
they evaluate downstream network conditions to avoid
sending packets towards those congested areas. We
categorize the latter as various methods of congestion
relief; in other words, they evaluate upstream network
conditions to determine which area had the most congestion
to send first. Seen as a whole, a router ultimately decides
which flits on input buffers match with which output ports
for sending flits on a cycle by cycle basis.
Several approaches in the field of networking have led to
solutions of similar problems in routing. These routing
problems are sometimes modeled using maximum
matching. This can be solved by maximum flow
algorithms, such as the Ford-Fulkerson algorithm [11].
One such approach for use in configuration of MEMS
switches modifies the Ford-Fulkerson algorithm for
hardware implementation [12].
This algorithm has also been used on mesh-based
processor arrays [13] for optimizing centralized routing of
the network by modeling the array as a set of vertices and
edges. However, our contribution applies towards
distributed routing performed on each switch, thereby
maintaining scalability and flexibility as the NoC
dimension increases.
Distributed routing in the field of networking has also
modeled its routing problem as a maximum matching
problem [14]. The hardware implementation in this case
requires many cycles to complete and cannot be applied to
typical NoC routers which must provide routing decisions
each cycle.

3. Latency analysis

One of the typical objectives of a routing algorithm
given a communication architecture and a traffic model is
to lower average packet latencies. We define the latency of
a packet as the number of cycles for a core to transmit the
packet through the NoC and be fully received by the
destination core. This is broken down as shown in Fig. 1.



Latency
λ
can be separated into the following
components:
λ
= h+ p
f
+ S
s
+ S
m
+ S
b
+ S
l
Hops
h
is the number of hops a flit must take to reach its
destination. Due to wormhole routing, each packet is sent
one flit per cycle. Therefore, the number of flits in a packet,

p
f
, directly corresponds to the number of cycles used to
complete the transmission of an entire packet. Ideal traffic
latency is based solely on these two variables, creating a
lower bound for packet latency:
Lower_bound = h + p
f
The remaining components of latency are different
forms of stalls. A buffer stall,
S
b
, represents a stall due to
downstream buffer being full. A link stall,
S
l
, represents a
stall which has lost arbitration for the output port to another
flit; in essence, the output link was unavailable. A flit in
the middle of a buffer may either stay at the same location
in the FIFO, which we name a stationary stall (
S
s
), or move
ahead to the next slot in the FIFO, considered to be a
moving stall (
S
m
). Regardless of
S
s

or
S
m
, the flit has the
same number of hops remaining to the destination.
S
s

and
S
m
are stalls created by the buffer fill level of the
downstream buffer. Both are functions of and directly
related to the occurrence of
S
b
and
S
l
in the NoC. Thus a
routing algorithm should focus on reducing
S
b
and
S
l
.
In Fig. 2, a series of two-flit buffers are connected with
all packets requesting the downstream buffer to the right.
At time
T=
0, the stall
P
0
S
l
occurs because
P
0
lost arbitration
to a different packet stream denoted by
Q
0
. Note that this
Q
0
is from the same router but different buffer as the one
P
0

came from but is not shown in the diagram. At time
T=
1
,
Figure 1.Components of latency
336
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on November 10, 2009 at 20:06 from IEEE Xplore. Restrictions apply.

this in turn causes
P
1
S
m
to occur because
P
1

is backed up by
P
0
.
Furthermore,
P
2
S
b
also occurs because the
middle


input buffer is now full. This sets off a series of chain stalls
at later times each consisting of an
S
b
and an
S
m
components.
This diagram illustrates the importance of
S
l

as it can create
a chain effect of other types of stalls. In the next section
we will cover how
S
l

can be reduced by utilizing the
maximum number of links and thereby allowing the least
possible loss of due arbitration.

4. Flow Maxima scheme
4.1 NoC architecture

In order to develop and implement our algorithm, we
use an
n

×

n
array of nodes arranged in a homogenous 2-D
mesh network. Each node is directly connected to four
neighboring nodes such that each hop requires one cycle.
Packets are broken into head, body, and tail flits and sent
via wormhole routing to save area with smaller buffer sizes.
The routing algorithms we use for reference and
comparison are minimal routing algorithms, thereby
ensuring freedom from livelock. We also implement the
odd-even turn model restrictions [9] to guarantee no
deadlock occurs.

4.2 Problem formulation




Consider the routing problem that a minimal adaptive
routing algorithm faces as illustrated in Fig. 3(a). There are
three packets
P
1
,
P
2
, and
P
3
waiting for arbitration.
P
1
.
WN
indicates the packet may travel in the West or North
directions. Similarly,
P
2
.
WN
indicates West or North and
P
3
.
WS
indicates West or South. In conventional routing
algorithms, maximum matching is not guaranteed between
input packet and output port. Therefore, the packet
P
2
may
stall when packet
P
1
and
P
3
are matched to the two possible
output ports of
P
2
as shown in Fig. 3(b). As demonstrated
earlier, the arbitration stall
S
l
of
P
2
can cause a chain effect
of stalls which could have been avoidable. As shown in
Fig. 3(c), with a maximum matching guaranteed routing
scheme, three packets
P
1
,
P
2
and
P
3
can all be routed in one
cycle.
To ensure maximum matching, each switch needs to
solve the following connection assignment problem per
cycle. Consider 2
N
nodes separated into
N
left nodes and
N

right nodes which represent
N
input buffers and
N
output
ports. The left nodes are connected to the right nodes using
a 0-1 capacity matrix
C
of size
N
×
N
. The rows of the
capacity matrix correspond to the requests from input
buffers and the columns correspond to which output ports
the requests are directed at. A value of 1 at
i
th

row and
j
th

column represents a possible connection from input buffer
to output port. We want to find a maximum 0-1 connection
matrix
R
such that it is below capacity and the target
number of connections per node. Furthermore, no other
connection matrices
R’
below capacity and the target
number of connections per node can be found with a
capacity above
R
.
RI
i
represent the target connections for
input buffer at the
i
th
position, and
RO
j
similarly represents
the target number of connections for the output port at the
j
th
position.

C

, RI

, RO

, R




Fig. 4 shows an example of the connection assignment
problem with
N
=4 which has a capacity matrix that is
illustrated by the routing problem in Fig. 3(a). The four
input buffers and output ports of the switch are ordered in
North, East, South, and West directions, which represent
the general mesh topology of NoC. For instance, the first
row of matrix C represents
P
3
.WS
at the North input buffer.
There are three packets
P
1
,P
2
,
and
P
3

waiting for arbitration
in North, East, and South input buffers as specified in
RI
.
Similarly,
RO
represents the North, South, and West output
ports receiving requests from i
nput buffers. The 0-1 matrix
R
is a maximum connection for packet delivery which is
not necessarily unique. Put mathematically, we want to
solve the problem
R

C
such that


(1)



RI
i

(2)



RO
j


j


(3)





satisfying (1), (2), and (3). (4)
Given a maximum matching problem in format of
capacity matrix
C
, the problem can be represented by a
bipartite graph
G
=(
V
,
E
) as illustrated in Fig. 5(a). It was
shown that the Ford-Fulkerson algorithm can be used to




Figure 5. Maximum bipartite matching
Figure 2. Effect of a single link stall
Figure 3. Example of a routing decision
Figure 4. Connection assignment problem
(a)

(b) (c)
(a) (b) (c)
337
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on November 10, 2009 at 20:06 from IEEE Xplore. Restrictions apply.

find the solution in polynomial time [15]. The technique is
to construct a flow network
G’
=(
V’
,
E’
) in which flows
correspond to matches, as shown in Fig. 5(b). Let the
source and sink be new vertices not in
V
, and let
. If the vertex partition of
G
is
, the directed edges
of
G’
are the edges of
E
directed from input buffer to
output port, and along with |
V
| new edges from source to
each input buffer and each output port to sink. The Ford-
Fulkerson algorithm finds the maximum flow from the
source node sink node by finding the augmenting path.
Breadth First Search (BFS) or Depth First Search (DFS)
can be used until there are no more augmenting paths. The
resulting flow is as illustrated in Fig. 5(c).

4.3 Decision Flow for FMAX

In Fig. 6, we present our Flow Maximization Algorithm
(FMAX), which integrates existing congestion avoidance
and congestion relief algorithms in our
FMAX Routing
module. The details of the
FMAX Routing
module will be
covered in the following section. Each flit requesting for
arbitration initially has up to two directions to travel in
based upon destination node. Since this algorithm is
adaptive, all input packets cancel possible paths which
are


violations or lead to violations of the odd-even turn model
in order to avoid deadlock. If any output directions lead to
downstream buffers without any empty slots, then paths
leading to those directions are also cancelled as well.
Applying wormhole routing to the remaining flits, the body
flits from previously routed packets are sent first.
The first packets to be routed are packets which have
already reached their destination. These are considered
first because once a packet has reached its destination, there
is only one possible output port the packet requests for. At
this point, if there is contention for the core output port,
Congestion Relief Prioritization (CRP) is used to determine
the winning packet. In our implementation, the
CRP
uses a
First Come First Serve (FCFS
)
policy such that the earliest
arriving packets get the highest priority while the latest to
arrive get the lowest. Routing packets for the core output
port does not affect the maximum bipartite matching
problem. This concept is further explained in the
implementation section of the paper.
The remaining packets with the exception of packets
originating from this core have their priorities set based on
the
CRP
and Congestion Avoidance Prioritization

(CAP).
CAP
can apply to any type of congestion avoidance scheme.
In
FMAX,
we use the adaptive portion of
DyAD
as our
congestion avoidance scheme. By this scheme,
CAP
gives
downstream directions with the most empty input buffer
slots the highest priority and the most filled input buffer
slots the lowest priority.
After prioritization,
FMAX Routing
performs maximal
matching between input buffers and output ports in an order
of highest to lowest priority. Finally, core input buffer
packets are routed out since the NoC will be further
congested upon introduction of new packets.

4.4 NoC Architecture



Congestion avoidance and relief schemes in a typical
NoC router work independently to decide the routing path
and arbitration priority. In our
FMAX Router

Architecture
,
we optimize routing decisions for both
CAP
and
CRP

as
shown in Fig. 7. Downstream FIFO fill level information
in
DownStream_Cong_Info
is used to determine
CAP
while
FCFS timestamps in
UpStream_Cong_Info
determine CRP.
This allows
FMAX
to utilize the combined routing and
arbitration schemes.
Prior to
FMAX

Routing
, the
Reserved Ports
unit sends
any body flits provided that there is space at the
downstream buffer because of wormhole design. Then the
Figure 7. FMAX router architecture
Figure 6. Flow maximization decision procedure
338
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on November 10, 2009 at 20:06 from IEEE Xplore. Restrictions apply.

Pathing
module makes requests for packets based on odd-
even turn model restrictions and buffer full flags
determined from
DownStream_Buffer _Cond
. Next, the
PE
Arbiter
routes packets which have already reached their
destination. Finally,
FMAX Routing
grants requests using
priority information from

CRP
and
CAP
.

4.5 Implementation of FMAX Routing

To enable our implementation to run in a single cycle of
execution time, our
FMAX Routing
does not directly solve
the maximum bipartite matching problem using the Ford-
Fulkerson algorithm. Instead, we reduce the bipartite
matching problem in a way that does not affect the
maximum number of matches. For a set of maximum
matches
M
={(
I
1
,
O
1
), (
I
2
,
O
2
), …, (
I
k
,
O
k
)}, |
M
|=
k
is the
maximum flow calculated by Ford-Fulkerson algorithm. In
the case that both
I
i
and
O
j
have one possible match (
I
i
,
O
j
),
namely to each other, then this match must be in the
solution set
M
. However, if there exists a match (
I
i
,O
j
)


M

where
I
i

has only one possible connection and that is to a
vertex

O
j
which is among the vertices in
M
, we can always
replace (
I
j
,O
j
) by (
I
i
,O
j
) and the modified matching
M’

remains maximal, which is |
M’
|=|
M
|. The similar reduction
can also be done as the case
O
j
has only one possible
connection and that is to a vertex
I
i
which is among the
vertices in
M
. Due to length restrictions, the full proof of
this reduction is not shown here.
Directly applying the two reduction mechanisms
presented above, we can reduce the routing problem
implemented in hardware. First, we identify the set of
input buffer flits which request for only one output port and
route these flits out using
CRP
to resolve any contention in
this set. Equivalently, these flits have only one direction in
which to travel to due to either: the odd-even turn model
restriction (a destination that lies directly North, East,
South, or West) or the downstream buffer in that direction
is full. In these cases the packet has fewer directions to
choose from and thus is routed out opportunistically.
Second, the set of output ports which only have one
requesting flit will have these flits routed to them starting
from the highest
CAP
priority since in doing so no link
stalls will occur.
T
hese two steps not only reduce link stalls
but also maintain the possibility of maximum matching.
The last step of our
FMAX
routing algorithm will
combine the prioritization calculated by the
CAP
and
CRP

schemes. Note that
CAP
and
CRP
are capable of
prioritizing most routing and arbitration algorithms. In this
case,
CAP
utilizes information from the adaptive
mechanism as found in
DyAD
, and
CRP
utilizes
information from
FCFS
arbitration scheme. Our routing
algorithm takes the combined level of these prioritizations
and routes them from highest to lowest priority until all
remaining routes depend on a previously taken input buffer
or output port.
While this no longer guarantees maximum matching, it
offers a good approximation. An input packet must be able
to travel in two directions and each of these directions must
also be contested by other input packets for it to have a
possibility of not being reduced via the previous two
techniques. The input packet will frequently not be able to
fulfill these conditions and hence be reduced by our
techniques. After application of these reductions, the
remaining edges often are sparse enough such that the last
step of the
FMAX
routing algorithm obtains a maximal
number of matches.

5. Experimental results
5.1 Experimental Setup

The physical layer of our network comprises of 8
×
8
nodes each with 5 input buffers and 5 output ports. Each
buffer is 4 flits long while each packet is a constant of 8
flits. The tests we perform send packets at a fixed injection
rate for 25000 cycles. Tests were run at varying injection
rates and varying patterns were used.
Three types of traffic patterns were run: uniform,
transpose, and hotspot. In uniform traffic, a node receives a
packet from any other node with equal probability based on
the injection rate. In transpose traffic, a node at (
i,j
) always
sends packets to a single node at (
j,i
). In hotspot traffic,
10% of the packets change their destination to a hotspot
located at (0,3) and (0,4) and 10% of the packets change
their destination to a hotspot located at (7,3) and (7,4) with
the remaining 80% of the traffic being uniform.
A cycle-accurate simulator for the NoC simulation
environment was programmed in C++. Wormhole
switching was implemented with each packet comprising of
header, body, and tail flits.
XY
,
DyAD
, and
FMAX

algorithms were each run in this environment.
Average packet latencies were calculated by taking the
average of subtracting the cycle on which the header flit
was generated by the core from the cycle on which the tail
flit was consumed by the core. Packet injection rates were
calculated from total packets injected divided by the
finishing simulation time and number of nodes.

5.2 Analysis

The first set of experiments was run under uniform
traffic with packet injection rates varying from 0.06 to 0.08
packets per node per cycle. Under uniform traffic,
XY

performs significantly better than adaptive algorithms as
previously uncovered in earlier works [8,9]. However,
FMAX
performs significantly better than
DyAD
not only in
terms of increasing the capacity of the maximum packet
injection rate but also in lowering average latencies across
all packet injection rates as shown in Fig. 8. Fig. 8(a) also
shows that the
FMAX
algorithm increases the capacity of
the NoC from
DyAD
by approximately 0.004 packets per
node per cycle. This increase is a significant step towards



Figure 8. Average packet latency under uniform traffic
(a) (b)
339
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on November 10, 2009 at 20:06 from IEEE Xplore. Restrictions apply.

the performance of the deterministic
XY
algorithm which is
approximately 0.009 better than
DyAD
. At lesser packet
injection rates as shown in Fig. 8(b),
FMAX
also reduces
average latencies from 2 to 10 cycles. This occurs because
even at low injection rates, algorithms which consider
routing and arbitration separately are unable to reduce link
stalls, but
FMAX
can via increased matching.
The next set of experiments evaluated the three
algorithms under transpose traffic with packet injection
rates ranging from 0.026 to 0.04 as shown in Fig. 9. Under
transpose traffic conditions,
FMAX
produces only slightly
better results than
DyAD
. Each transpose traffic source has
only one possible destination. The nature of this traffic
causes the two algorithms to have similar performance.





Neither uniform nor transpose traffic simulate realistic
application-based traffic. Multiple hotspots may be likely
to occur under more realistic traffic conditions. To
simulate these types of situations, we use hotspot traffic
with hotspots located on the East and West edges of the





NoC. This type of traffic pattern tests the ability of an
adaptive routing algorithm to avoid congested areas, and
thus as shown in Fig. 10(a),
XY
routing performs the worst.
However, since
FMAX
integrates the congestion avoidance
mechanism that is presented in
DyAD
on top of increased
matching, it still increases the routing capacity of an NoC
over
DyAD
. At lower injection rates,
FMAX
is consistently
able to give lower average latencies than either of the other
two algorithms as shown in Fig. 10(b).

6. Conclusion and future work

We applied maximum matching towards our distributed
routing algorithm for NoC. As shown in our proposed
decision flow, the implemented routing algorithm,
FMAX
,
can increase matching between input buffers and output
port while considering congestion avoidance and relief.
FMAX
maintains or increases the performance of an
adaptive routing algorithm under the presence of hotspots.
At the same time, increased matching allows
FMAX
to
lower average packet latencies even at lower packet
injection rates.
More single cycle techniques can be integrated while
retaining the possibility of maximum matching. This
remains as our future work.

7. Acknowledgements

This work was partially supported by the National
Science Council, under Grants 95-2221-E002-365, and 96-
2221-E002-247, and Springsoft Inc.

8. References

[1] R. Ho, K. W. Mai, and M. A. Horowitz, “The Future of Wires”,
Proceedings of the IEEE
, 89(4):490-504, Apr. 2001.
[2] L. Benini and G. De Micheli, “Networks on Chips: a New SoC
paradigm”,
IEEE Computer
, 35(1):70-78, Jan. 2002.
[3] G. De Micheli and L. Benini,
Networks on Chips
, Morgan
Kaufmann, 2006.
[4] J. Nurmi, H. Tenhunen, J. Isoaho, and A. Jantsch,
Interconnect-Centric Design for Advanced SoC and NoC
,
Kluwer Academic Publishers, 2005.
[5] U. Orgas, J. Hu, and R. Marculescu, “Key Research Problems
in NoC Design: A Holistic Perspective”,
in Proceedings of
the CODES+ISSS
, pp. 69-74, Sept. 2005.
[6] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, “Routing
Table Minimization for Irregular Mesh NoCs”,
in
Proceedings of the DATE
, pp. 1-6, Apr. 2007.
[7] M. Daneshtalab et al., “Ant colony based routing architecture
for minimizing hot spots in NOCs”,
in Proceedings of the
ASAP
, pp. 33-38, Sept. 2006.
[8] C. J. Glass and L. M. Ni, “The Turn Model for Adaptive
Routing”,
Journal of ACM
, 41(5):874-902, Sept. 1994.
[9] G. M. Chiu, “The odd-even turn model for adaptive routing”,
IEEE Transactions on Parallel and Distributed Systems
,
11(7):729-738, July 2000.
[10] J. Hu and R. Marculescu, “DyAD–Smart Routing for
Networks-on-Chip”,
in Proceedings of the DAC
, pp. 260-
263, 2004.
[11] L.R. Ford and D.R. Fulkerson,
Flows in Networks
, Princeton
University Press, 1962.
[12] I. Keslassy, S.-T. Chuang, and N. McKeown, “A Load-
Balanced Switch with an Arbitrary Number of Linecards”,
in
Proceedings of IEEE INFOCOM
, Vol.3, pp. 2007- 2016,
March 2004.
[13] L. Li, “A parallel algorithm for finding a maximum flow in
0-1 networks”,
in Proceedings of the ACM Annual Computer
Science Conference
, pp. 231-234, 1987.
[14] A. Mekkittikul and N. McKeown, “A practical scheduling
algorithm to achieve 100% throughput in input-queued
switches”,
in Proceedings of IEEE INFOCOM
, Vol. 2, pp.
792-799, Apr. 1998.
[15] T. Cormen, C. Leiserson, and R. Rivest,
Introduction to
Algorithms
, McGraw-Hill Book Company, 1990.




Figure 9. Average packet latency under transpose traffic
Figure 10. Average packet latency under hotspot traffic
(a) (b)

(a) (b)

340
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on November 10, 2009 at 20:06 from IEEE Xplore. Restrictions apply.