Bandwidth-aware routing algorithms for networks-on-chip platforms

brrrclergymanΔίκτυα και Επικοινωνίες

18 Ιουλ 2012 (πριν από 5 χρόνια και 3 μήνες)

562 εμφανίσεις

Published in IET Computers & Digital Techniques
Received on 6th July 2008
Revised on 2nd April 2009
doi:10.1049/iet-cdt.2008.0082
In Special Issue on Networks on Chip
ISSN 1751-8601
Bandwidth-aware routing algorithms for
networks-on-chip platforms
M.Palesi
1
S.Kumar
2
V.Catania
1
1
Dipartimento di Ingegneria Informatica e delle Telecomunicazioni,University of Catania,Italy
2
Department of Electronics and Computer Engineering,School of Engineering,Jo
¨
nko
¨
ping University,Jo
¨
nko
¨
ping,Sweden
E-mail:mpalesi@diit.unict.it
Abstract:General purpose routing algorithms for a network-on-chip (NoC) platform may not be able to provide
sufficient performance for some communication intensive applications.This may be because of low adaptivity
offered by a general purpose routing algorithm resulting in some links getting highly congested.In this study
the authors demonstrate that it is possible to design highly efficient application-specific routing algorithms
which distribute traffic more uniformly by using information regarding applications communication behaviour
(communication topology and communication bandwidth).The authors use off-line analysis to estimate
expected load on various links in the network.The result of this analysis is used along with the available
routing adaptivity in each router to distribute less traffic to links and paths which are expected to be
congested.The methodology for application-specific routing algorithms is extended to incorporate these
features to design highly adaptive deadlock-free routing algorithms which also distribute traffic more
uniformly and reduce network congestion.The authors discuss architectural implications and analyse area and
power overheads of the proposed approach on the design of a table-based NoC router.
1 Introduction
Network on chip (NoC) is likely to be used in high-
performance multi-core embedded systems in a near future.
Many factors affect the performance achieved by an
application on an NoC platform.For applications that
require intensive communication among cores,the main
factor which affects the overall performance of an NoC is
represented by its routing algorithm [1].
Traditionally,routing algorithms have been designed
without any reference to the characteristics of the traffic
which will stimulate the network.The main reason was that,
in a general purpose domain,the communication traffic
cannot be accurately characterised,thus the routing
algorithms are designed to provide deadlock freedom under
any type of traffic and give good average performance.As a
consequence,the design of the routing algorithm
conservatively assumes that all the network nodes may need
to communicate with each other.However,in the
application-specific domain,which characterises the area of
embedded systems,we assume that an accurate
characterisation of the communication traffic is possible [2,
3].The embedded system designer has good knowledge of
the application which will be mapped on the system.This
knowledge opens new directions in system optimisation like,
for instance,the customisation of the routing algorithm for a
given application.
Basedonthis,APSRA,a methodology todesignapplication-
specific routing algorithms for NoC systems was presented in
[4].However,the basic APSRA does not take into account
the communication attributes like the communication
bandwidth requirements of different communicating task
pairs mapped on different network nodes.Thus,selection of
the routing paths to be removed to restrict the routing
function and to guarantee deadlock freeness,is carried out in a
blind fashion.It is equivalent to assuming that all the
communications have the same bandwidth requirements.
Such unawareness may lead to a bad distribution of the traffic
load over the network.This is particularly true when the range
of the bandwidth requirements of different communications is
large.Unfortunately,this is a very frequent case in real
applications.In [5],for example,the range of communication
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 413
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
bandwidth requirements for a Video Object Plane decoder in a
MPEG-4 decoder systemspans from16 to 500 MB/s.
The performance of a routing algorithm designed using
APSRA methodology will also be greatly affected by its
selection function in the router.This function should
dynamically choose one among multiple admissible output
ports for a new packet.We propose a new strategy for load
estimation and design of the selection function.We propose
that the application’s communication behaviour along with
routing function (topology,admissible paths,communication
bandwidth between pairs etc.) should be analysed off-line and
selection probabilities should be assigned to each admissible
output port for packet coming froma certain input port.
The above two considerations motivate this work and our
proposal for improvement of APSRA methodology.As the
traffic characteristics of a communication node pair is
generally different from that of another pair,they should be
distinguished.For this reason,we believe that emphasising
the role of communication bandwidth requirements during
the design of the routing algorithm design adds a new
degree of freedom in system performance optimisation.
2 Related work
An adaptive routing algorithm can be seen as the cascade of
two main blocks which implement an adaptive routing
function and a selection function (a.k.a.,selection policy or
selection strategy),respectively.First,a routing function
computes the set of admissible outputs channels towards
which the packet can be forwarded to reach the destination.
Then,a selection function,is used to select one output
channel from the set of admissible output channels
depending on dynamic network conditions and/or locally
stored information.Both the blocks have an important
impact on overall network cost and performance and will
constitute the topic of this paper.
Regarding routing functions,many proposals for wormhole-
switchednetworks have beenpresentedinthe literature [6–10].
Glass and Ni in [7] propose a turn model for designing
wormhole routing algorithms for mesh and hypercube
topology networks that are deadlock and livelock free.This
model has been later utilised by Chiu [10] to develop the
Odd–Even adaptive routing algorithm for meshes without
virtual channels.In comparison with the turn model,the
degree of routing adaptiveness provided by the model is more
even for different source–destination pairs.Murali et al.[11]
present a methodology to design application-specific NoCs
using floorplan information.The routing function is designed
by using the turn prohibition algorithm presented by
Starobinksi et al.[12].In the Starobinski’s approach it is
assumed that all the nodes of the network communicate with
each other but this assumption is far away from the reality
especially if we consider as a scenario of an heterogeneous
system-on-chip implementing a specific application.Another
application-specific design methodology for NoC systems is
presented by Srinivasan et al.[13] where virtual channels are
used to deal with deadlocks.An application-specific routing
algorithm named APSRA has been proposed by Palesi
et al.[4].APSRA exploits communication information to
maximise the adaptivity while ensuring deadlock-free routing
for an application.The COmmunication Synthesis
Infrastructure (COSI) framework [14] is used to define
specific interconnect design flows for a variety of applications
from chips to systems.Routing is modelled in a way that is
very similar to the definition of routing tables in APSRA [4].
Moreover,as in APSRA,the definition of deadlock is based
on the channel dependency graph.Our current work extends
APSRA methodology to achieve multiple objectives of
maximising adaptivity and distributing traffic more uniformly
over the network.
As regards selection functions,in [15],Schwiebert and Bell
presented a detailed simulation study of various selection
functions for several fully adaptive wormhole routing
algorithms for 2D meshes.The obtained results show that
the choice of selection function has a significant effect on the
average message latency and saturation behaviour.Similar
conclusions have been drawn by Feng and Shin [16].An
analysis of several selection functions in order to evaluate their
influence on network performance has been carried out by
Martinez et al.[17].Improvement in network throughput (up
to 10%) and in latency when network is close to saturation
(up to 40%) has been observed.Hu and Marculescu [18]
propose a routing scheme called DyAD which combines the
advantages of both deterministic and adaptive routing
schemes.The router works in deterministic mode when the
network is not congested,and switches to an adaptive mode
when the network becomes congested.In [19] Ye et al.
present a contention-look-ahead on-chip routing scheme that
is similar to [20].It is a non-minimal routing in the sense
that based on the value of two delay penalty indices the router
chooses whether to send the packet towards a profitable route
(minimal route) or a misroute (non-minimal route).The
proposed approach has not been proved to be deadlock free.
Differently from the other approaches which focus on output
selection,in [21] the authors investigate the impact of input
selection and present a contention-aware input selection
technique that improves the routing efficiency.The concept
of neighbours-on-path has been defined by Ascia et al.[22]
to design a new selection policy which takes decision based
on information deriving fromthe status of nodes belonging to
the admissible paths fromthe current node.
There is an abundance of work on path selection with
bandwidth and latency awareness [23,24].Extensive
research in these topics has been developed in the context
of telecommunication and data networks.To the best of
our knowledge,bandwidth-aware routing algorithms is a
topic that has been left largely untouched in the context of
on-chip interconnection networks.
Except APSRA none of the aforementioned works exploit
application information to optimise the routing algorithms.
414 IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429
&
The Institution of Engineering and Technology 2009 doi:10.1049/iet-cdt.2008.0082
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
Although APSRA uses communication topology and
communication concurrency information,there are other
important information features that could be exploited to
improve the effectiveness of a routing algorithm.
Communication bandwidth is one of that.General routing
algorithms assume that all the communications are
characterised by the same bandwidth requirements.This
behaviour is rarely observed in real applications.For
instance,looking at the task graph of the multimedia
application [25] shown in Fig.7,communication
bandwidth requirements ranges from 10 to 500 MB/s.To
the best of our knowledge,there are no contributions
aimed at improving performance of routing algorithm by
exploiting communication bandwidth information.This
paper contributes in this direction presenting a
methodology to design application specific and bandwidth-
aware routing functions along with a novel selection policy.
This paper is an extension of [26].Extensions include
power,area and timing analysis of the router implementing
the proposed routing scheme;delay,throughput and energy
analysis of the NoC;and both an informal and a formal
description of the methodology.
3 Terminology and problem
formulation
Simply stated,for a given application and a given network
topology,the goal is to generate a routing algorithm which
is strongly adaptive and spreads the traffic over the network
in such a way that the communication traffic of any link
will not exceed its capacity (maximum sustainable
bandwidth).To formulate the problem more formally,we
borrow the following terms from [4].
The communication graph,CG ¼ G(T,C),is a directed
graph,where T is the set of tasks and C is the set of
communications.Each communication c
i,j
¼ (t
i
,t
j
) [ C
connects task t
i
[ T to task t
j
[ T.For a communication
c [ C,the function B(c) returns the bandwidth
requirement,that is the minimum bandwidth that should
be allocated by the network in order to meet the
performance constraints for communication c.
The topology graph,TG ¼ G(N,L),is a directed graph
which models the network topology.N is the set of network
nodes,and L is the set of network channels.Channel
l
i,j
¼ (n
i
,n
j
) connects node n
i
[ N to node n
j
[ N.Given
a channel l [ L,the function Cap(l ) returns its capacity.
The mapping function,M:T!N,maps tasks to
network nodes [e.g.if M(t
i
) ¼ n
j
then task t
i
is mapped
on node n
j
of the network].
3.1 Link load estimation
As we are dealing with adaptive routing,the required
bandwidth for communication c is split over multiple paths
that the routing function allows for c.For the sake of
example,consider Fig.1 which shows a 4  2 mesh-based
network topology.Let us suppose that communication
c ¼ (t
s
,t
d
) requires a bandwidth of 100 MB/s and that the
routing function allows all the minimal paths from node
n
s
¼ M(t
s
) to node n
d
¼ M(t
d
) (four paths in total).The
load is distributed over the paths as shown in Fig.1 which
reports,for each network channel,the effective bandwidth
(or effective load) (EB) and the total number of paths
containing that channel.Formally,the effective bandwidth
of a channel l [ L because of a communication c [ C can
be computed as
EB(c,l ) ¼ B(c) 
jPT(c,l )j
jP(c)j
where P(c) denotes the set of minimal paths admitted
by the routing function for communication c,and
PT(c,l ) ¼ {P [ P(c):l [ P} is the pass through link set,
that is the set of paths of c which contain the link l.Finally,
we indicate with AB(l ) the aggregate bandwidth of l which is
computed as
AB(l ) ¼
X
c[C
EB(c,l )
Using these definitions,the bandwidth-aware routing
algorithm problem should meet the following constraint.
Given a communication graph CG,a topology graph TG
and a mapping function M,find a routing function R which
is deadlock free and such that
8l [ L )AB(l )  Cap(l ) (1)
that is,the communication load of any channel,l,must not
exceed its capacity Cap(l ).
4 The proposed methodology
In this section we provide a high-level overviewof the proposed
methodology and we discuss about the assumptions made and
its limitations.
Figure 1 Effective bandwidth for a communication from
node n
s
to node n
d
at 100 MB/s assuming a fully
adaptive minimal routing
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 415
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
4.1 Overview
An overview of the proposed methodology is shown in Fig.2.
The application is modelled by means of a communication
graph.The communication graph together with the topology
graph and a mapping function,which defines where each task
is mapped on the NoC,represent the inputs of the proposed
methodology.This information is used to build the
application-specific channel dependency graph (ASCDG)
[4].If it contains cycles,they are iteratively broken by
removing application-specific dependencies selected by means
of a procedure that will be discussed in Section 5.1.The
heuristic behind such a procedure is to assign more adaptivity
to communications characterised by higher communication
bandwidth requirements.As soon as all the cycles have been
removed the routing function is deadlock free.Then,a link
load analysis is performed to identify links in which
aggregated bandwidth exceeds the link capacity.In this case a
load balancing procedure,which will be described in Section
5.2,is used to selectively remove routing paths and to reduce
the aggregated bandwidth on overloaded links.At the same
time,it tries to allocate alternative routing paths in such a way
that load is distributed almost equally among links.As a
result a new deadlock-free routing function is obtained.
Finally,a set of selection probabilities,which will be used by
the selection policy described in Section 5.3,are computed.
4.2 Assumptions and scope
In this work two important issues are not covered.The first is
related to the way in which communications characteristics
inducted by the application are modelled,and the second
concerns the out-of-order delivery problem which
characterises any adaptive routing algorithm.
Figure 2 Block diagram of the proposed design flow
416 IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429
&
The Institution of Engineering and Technology 2009 doi:10.1049/iet-cdt.2008.0082
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
In the overview of the proposed methodology we assumed
that the input application is already mapped and scheduled
on the NoC platform before the design of routing
algorithm starts.We also assumed that the communication
volume between various tasks (and hence between various
cores after the mapping step) is already determined using
application profiling.It should be pointed out that,
although the use of a bandwidth annotated communication
graph (also known as the core graph or the communication
task graph or the application characterisation graph) is
generally used as entry point in many design methodologies
[2,3,27],the application profiling task,which allows to
determine communication volume between various tasks
(even before the application and communication is mapped
onto the platform),is still an open issue.In this context,
the design space exploration tool from hArtes could be
useful for this purpose [28].Another example is the task
graph extraction (TGE) tool from Princeton [29].
The way in which communications are characterised in this
work constitutes a simplification of the problem.In fact,a
certain communication is characterised in terms of its
maximum bandwidth requirements only without considering
other important communication attributes like burstiness.This
simplified model of communication behaviour results in a
pessimistic analysis as we assume that a communication will
demand the same bandwidth (the maximum bandwidth)
for its entire lifetime.Further,the assumption that all
communications are potentially concurrent results in
exaggerated communication traffic density which may never
happen if communication dependencies are taken into
consideration.This may reduce the actual benefit of our
schemes when applied to real applications in which degree of
communication concurrency is less.
The second open issue in this work is related to the routing
algorithm.Although the routing algorithm we propose is
multi-path,we do not take into consideration the
mechanisms required for reordering packets at the
destination.To cope with out-of-order packets delivery
problemwhich characterises any adaptive routing algorithm,a
possibility is to use the re-ordering mechanism at network re-
convergent nodes proposed by Murali et al.[30].In this case
it needs to restrict the routing function in such a way as to
remove all the intersecting paths for each source/destination
pair.However,this will strongly impact the effectiveness of
the proposed routing algorithm since one of its main benefit
(high adaptivity) is reduced.However,in this work we
distinguish between application performance from network
performance,although the former depends on the later.Our
focus is to improve network performance (network latency
and throughput) and not application performance.That is,
the proposed routing method,like other adaptive routing
algorithms,is more useful to applications which can tolerate
out of order delivery of packets.
5 Bandwidth-aware routing
algorithm
In this section we present our proposal for designing highly
adaptive deadlock-free and bandwidth-aware routing
algorithms.The section is organised in three subsections.The
first subsection presents the strategy used to select and remove
dependencies in the ASCDG which minimise the amount of
bandwidth that must be redistributed among the remaining
routing paths.The second subsection deals with the problem
of checking and recovering when aggregated bandwidth on
some network links exceeds link capacity.Finally,the last
subsection describes a new selection function aimed at
exploiting the peculiarities of the proposed routing function.
5.1 Bandwidth-aware routing function
Acycle in the ASCDGis a succession of application-specific
direct dependencies D ¼ {d
1
,d
2
,...,d
n
},where a d [ Dis a
pair (l
i
,l
j
) with l
i
,l
j
[ L.Here the problemis the selection of
the best dependency to be removed to break the cycle D.
Removing a dependency means removing all the paths
which use that dependency.As soon as a path is removed,
the fraction of bandwidth it transports must be redistributed
between the remaining paths.For instance,suppose that the
direct dependency d between channel l
i
and channel l
j
in
Fig.1 must be removed to break a cycle in the ASCDG.
Removing d means prohibit path 3.As soon as path 3 is
removed,the 25 MB/s transports are redistributed between
path 1 and path 2 as shown in Fig.3a.The idea we propose
Figure 3 Bandwidth allocation
a After removing channel dependency from l
i
to l
j
in Fig.1
b After removing path 2 from a
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 417
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
in this paper is to choose and remove the dependency d which
minimises the overhead of bandwidth that should be allocated
to the remaining paths that do not use the dependency d.
Formally,let us indicate with PT
2
(c,d) the pass through
dependency set,that is the set of paths of c which use the
dependency d ¼ (l
1
,l
2
)
PT
2
(c,d) ¼ PT(c,l
1
) >PT(c,l
2
)
Let d be an application-specific direct dependency.To
remove d all the paths of any communication c which use d
must be removed.For communication c the aggregated
bandwidth to be redistributed is [B(c)=jP(c)j] jPT
2
(c,d)j.
This bandwidth is redistributed between the jP(c)j
jPT
2
(c,d)j remaining paths which do not use the
dependency d.Based on this,the dependency to be
removed is the d [ D such that the cost function
cost(d) ¼
X
c[C
B(c) jPT
2
(c,d)j
jP(c)j (jP(c)j jPT
2
(c,d)j)
(2)
is minimised.This ensures that the dependency which will be
chosen for removal is such that the load on the paths which
use that dependency is redistributed in such a way that it
results in minimum increase in load on alternative paths.
The cycles breaking algorithmis shown in Fig.4.First,all
the cycles of the ASCDG are detected by the function
GetAllCycles and stored in the list cycles.Then,the
so-called enumeration tree is built.The meaning of the
enumeration tree is as follows.The order in which the cycles
in ASCDG get treated determines both the overall
adaptivity of the generated routing algorithm and the
routability for all the communications.More precisely,with
regard to the second point,certain cycle removal sequences
might make some communications unroutable.In our
implementation we used a back-tracking mechanism in
which removing sequences are generated by performing a
depth-first search of the solution space.Fig.5 shows the
enumeration tree generated by four cycles c
1
,c
2
,c
3
,c
4
.If,for
instance,the removal sequence c
1
!c
2
causes reachability
problems then the sub-tree under c
1
!c
2
is not considered
for further analysis.The back-tracking mechanismreturns to
c
1
.If the removal sequence c
1
!c
3
!c
2
results in a
reachability problem then the back-tracking mechanism
returns to c
3
.If the removal sequence c
1
!c
3
!c
4
!c
2
is
feasible (i.e.it does not result in reachability problems) the
procedure terminates.The steps to break all the cycles of the
ASCDG start from line 6 in Fig.4.First,a backup of
ASCDG,C and P is performed.Then,a cycle sequence
cseq is extracted from the enumeration tree.The steps from
lines 10 to 22 remove all the cycles in the same sequence as
defined by cseq.For each of such cycles,only the channel
dependencies that,if removed,do not cause reachability
problems,are considered.This check is performed by
assuring that there does not exist any communication whose
all routing paths use such channel dependency (line 13).
Thus,the channel dependency,d
0
,which minimises the cost
function (2) is selected and removed from the ASCDG (line
27).Then,all the routing paths which use d
0
are removed
from the set of admissible paths (line 28).In case of
reachability problems (line 24),the ASCDG,C and P are
Figure 4 Break cycles algorithm
Figure 5 Enumeration of cycle sequences for four cycles
418 IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429
&
The Institution of Engineering and Technology 2009 doi:10.1049/iet-cdt.2008.0082
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
restored and the sub-tree of the enumerationtree whose root is
the cycle whose removal has caused reachability problems is
pruned (line 25).In this case a new iteration is performed
with a new sequence of cycles (line 6).
The overall time complexity of the algorithm is O(2
n
),
where n depends on size of the rectangle containing the
source and destination nodes (in the case of a mesh-based
topology).The complexity of the proposed approach is not
because of the heuristic itself but because of the computation
of the ASCDG.The construction of the ASCDG involves
the annotation of each minimum path between any source/
destination pair as defined in the communication graph.
The basic assumption is that we start from a minimal fully
adaptive routing algorithm.Thus,as the NoC size increases,
the approach could become infeasible if some nodes located
far from each other need to communicate.It should be
pointed out,however,that this is the worst-case condition.
In fact,any topological mapping algorithm tries to map
most frequent and most critical communications in such a
way as to minimise the physical distance between the source
and destination nodes.For long-distance communications,
one can consider a subset of all the minimal paths.A
detailed analysis of the complexity of building the ASCDG
has been presented in our previous work [4].
5.2 Bandwidth reallocation
Using the procedure discussed in the previous subsection,we
obtain a routing function which is deadlock free (as the
ASCDG is acyclic) and which generates a set of routing
paths by providing more adaptivity to communications
characterised by higher communication bandwidth.However,
it is possible that the aggregate bandwidth on some network
links exceeds the capacity of these links [i.e.condition (1) is
not satisfied for some l [ L].In this case some routing paths
passing on that link,must be removed to reduce the aggregate
bandwidth on that link down to the links capacity or,in a
more general way,down to a user-defined value.For instance,
looking again at Fig.3a if either network links capacity is
50 MB/s or we want that links load do not exceed 50 MB/s,
path 2 should be removed as shown in Fig.3b.
The proposed bandwidth reallocation algorithm is shown
in Fig.6.The input parameters are the set of network
links,the set of communications,the set of admissible
paths derived from the procedure described in the previous
subsection and a threshold which defines the maximum
bandwidth which has not to be exceeded in any network
link.The output is the updated set of routing paths.
The procedure starts by sorting network links in descending
order based on their aggregate bandwidth.For each link l and
for each communication c which has at least one path using l,
and more than one path,two lists named paths2rem and
paths2enr are generated as follows.paths2rem contains all
the paths for c that should be removed as they use network
links whose load exceeds the threshold.paths2enr contains
those paths that can be used by other communications (i.e.
can be enriched) as they use links whose load is below the
threshold.Then,the list paths2rem is scanned and routing
paths belonging to it are removed from P.Of course,
removing a path causes the redistribution of the bandwidth
allocated on it to the other paths belonging to paths2enr
(see,for example,Fig.3).Thus,the path elimination stops
when there is at least one path in paths2enr that contains a
link whose load exceeds the threshold.The above steps are
repeated until the load on each link does not exceed the
threshold.This procedure aborts if the path elimination step
cannot be operated because of reachability issues which
arises when it needs to remove a path which is unique for a
certain communication.Although the presented algorithm
assumes that all the network links have the same capacity,it
is simple to generalise by replacing the scalar input
parameter threshold with a function T:L!< which
returns the bandwidth threshold associated to any channel
l [ L.In this case,the condition AB(l ).threshold in
lines 5,15 and 24 is replaced with AB(l ).T(l ).
5.3 Load balancing selection function
To be effective,a good routing function must be coupled with
an intelligent selection function.In fact,selection schemes
Figure 6 Bandwidth reallocation algorithm
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 419
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
strongly affect the overall performance of any adaptive routing
algorithm [15–17].Generally,selection policies take
decisions based on on-line measurement or estimation of
traffic density.However,such estimation is a costly and
difficult task.
One of the ways to implement the selection function is to
randomly distribute packets to admissible output ports.But
this selection policy can lead to a large load imbalance on
network links and in actual practice degrade network
performance.Online information about traffic density and
congestion on paths leading to the packet destination can
be useful in selecting the appropriate admissible port.
Most of the current approaches use local information
regarding usage of buffer associated with an output port in
the router (or neighbouring router in that direction) as a
measure of communication traffic in that direction [18].
Some approaches use more elaborate look-ahead strategies
for this purpose [22].These selection strategies give better
latency performance,especially when communication
volume is high.
The idea behind the proposed selection policy can be
summarised by means of an example.Let us consider again
Fig.1.Let us suppose that all the four minimal paths from
node n
s
to node n
d
are allowed by the routing function.
When n
s
receives an header flit destined to n
d
,the routing
function returns,as a set of admissible output channels,the
set feast,southg.Now,let us suppose that the router in
node n
s
is aware of the number of admissible paths to reach
node n
d
starting from channel east and south,respectively.
In our example,there are three paths from east and one
path from south.So,the selection policy should use the
east output channel with higher probability than south
output channel (e.g.use east port with probability 0.75 and
south port with probability 0.25).Formally,let j be a
uniformly distributed random variable in the interval [0,1],
and {l
1
,l
2
,...,l
n
} the set of admissible output channels
Figure 7 Communication graph of the MMS
420 IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429
&
The Institution of Engineering and Technology 2009 doi:10.1049/iet-cdt.2008.0082
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
returned by the routing function,then the selection function
is defined as
S(l
1
,l
2
,...,l
n
) ¼ l
i
,i:j[
X
i1
j¼1
Pr{l
j
},
X
i
k¼1
Pr{l
k
}
"#
(3)
where Prflg indicates the probability to select output channel
l,which is proportional to the number of admissible
paths starting from l and that can be used to reach
the destination.Of course,these probabilities are
computed off-line and stored into the router as discussed in
Section 7.1.
6 Evaluation and results
We evaluate the proposed approach on both synthetic and
real traffic scenarios.As synthetic traffic scenarios,we
consider uniform,transpose,bit-reversal,shuffle,butterfly
and hot-spot [31].For them the bandwidth for each
communicating pair has been randomly generated between
10 and 100 MB/s.As a more realistic communication
scenario we consider a generic multimedia system (MMS)
which includes an H.263 video encoder,an H.263 video
decoder,an mp3 audio encoder and an mp3 audio decoder
[25].The communication graph of MMS is depicted in
Fig.7.It has been partitioned into 40 distinct tasks which
have been mapped on a 5  5 mesh-based NoC using the
mapping technique proposed in [32].
In the following we refer as APSRA the approach
proposed in [4],with APSRA-BW the variant of APSRA
using the heuristic presented in Section 5.1,and with
APSRA-BWL the augmented version of APSRA-BW
with the bandwidth reallocation procedure discussed in
Section 5.2.We organise this section in two subsections.
In the first one,we perform a bandwidth analysis aimed to
show how the proposed approach allows to (i) uniformly
distribute the communication bandwidth over network
links,and (ii) avoid that bandwidth allocated on network
links exceed link capacity.In the second subsection,we
perform a dynamic analysis using a flit-accurate NoC
simulator to show the performance improvements both in
terms of delay and throughput.
6.1 Bandwidth analysis
Let us start by showing the effectiveness of the proposed
approach in uniformly distributing the traffic over the
network.To do this,we use as a metric the standard
deviation of the aggregate bandwidth in the network links.
Using this metric,we compare APSRA,APSRA-BW and
APSRA-BWL on a 8  8 mesh-based NoC under
different traffic scenarios.For the APSRA-BWL,we fix
the threshold to 90% of the maximum aggregate
bandwidth when fully adaptive minimal routing is used.
For each traffic,Table 1 reports the reduction in percentage
of standard deviation of the aggregated bandwidth in
network links when both APSRA-BWand APSRA-BWL
are used.As can be seen,the proposed heuristic to break
cycles of the ASCDG allows to better distribute the
bandwidth across the network.There are some situations,
in which there is not any reduction in standard deviation.
This is the case of transpose and butterfly traffic in which
the ASCDG is acyclic and the cutting edge heuristic does
not take place.On average the standard deviation of the
aggregated bandwidth in network links decreases by 10%.
An additional improvement of 2% is obtained when the
bandwidth redistribution procedure is used.On the other
side,as discussed in Section 5.2,the elimination of some
routing paths operated by the bandwidth redistribution
procedure,negatively affects the adaptiveness [10] of the
routing function as shown in Fig.8.It is interesting to
observe that,for some traffics,like bit-reversal and shuffle,
the adaptivity of APSRA-BW is higher than that of
APSRA.Although the main objective of APSRA is the
maximisation of adaptivity,the heuristic used to
break cycles immediately stops when the first solution is
found.At any rate,as can be observed,the average
adaptivity still remains much higher as compared to that of
odd–even [10].
Fig.9 shows the aggregate bandwidth of any link of a
9  9 mesh-based NoC under uniform traffic for both the
routing algorithm generated by APSRA and by APSRA-
BWL.The threshold has been fixed to 550 MB/s.As can
be observed,when APSRA is used,the aggregate
bandwidth in several link exceeds the threshold.If this
threshold represents the network link capacity,such
bandwidth overheads translate in local network congestion
that,because of back pressure mechanism along with the
wormhole switching techniques,propagates to the entire
network causing a strong degradation of overall network
performance.
Table 1 Percentage reduction of standard deviation of the
aggregated bandwidth in network links
Traffic APSRA-BW APSRA-BWL
uniform 25 27
bit-reversal 19 23
butterfly 0 2
shuffle 18 19
transpose1 0 2
transpose2 0 2
hot-spot_C 10 12
hot-spot_TR 5 10
MMS 5 5
Average 10 12
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 421
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
Fig.10 shows the absolute number of network
links which exceed a given threshold when APSRA,
APSRA-BW and APSRA-BWL are used.As can be
observed,both APSRA-BW and APSRA-BWL allow
to reduce the number of bandwidth violations as
compared to APSRA.On average,the number of links
exceeding the threshold when APSRA-BWL is used,is
about the half of that obtained when APSRA is used.
In particular,APSRA-BWL allows to meet bandwidth
constraints which are almost 30 and 20% more
stringent as compared to APSRA and APSRA-BW,
respectively.
Figure 8 Adaptivity exhibited by odd–even and by routing algorithms generated by APSRA,APSRA-BW and APSRA-BWL
under different traffic scenarios
Figure 9 Aggregate bandwidth per link for a 9  9 mesh-based NoC under uniform traffic
Routing algorithm used is generated by APSRA (top) and APSRA-BWL (bottom)
422 IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429
&
The Institution of Engineering and Technology 2009 doi:10.1049/iet-cdt.2008.0082
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
6.2 Performance analysis
Now,we evaluate the different routing algorithms in terms of
average delay.Delay is defined as the time (in clock cycles)
that elapses from the occurrence of a header flit injection
into the network at the source node to the occurrence of a
tail flit reception at the destination node.Noxim [33] is
used as NoC simulation platform.Poisson packet injection
distribution is used for synthetic traffic scenarios whereas
self-similar packet injection distribution is used for MMS
scenario (self-similar traffic has been observed in the bursty
traffic between on-chip modules in typical multimedia
applications [34]).
Fig.11 shows the average delay variation under uniform
traffic for different ranges of communication bandwidth.
That is,the bandwidth for each communicating pair has
been randomly generated between the lower and upper
bounds reported on the x-axes.In this experiment both the
random selection policy (RND – oblivious routing) and
the load balancing selection policy (LB – adaptive routing)
have been used to distinguish the effect of the selection
policy.Graph also reports results for deterministic XY
routing and adaptive odd–even routing.Once again,
APSRA-BWand APSRA-BWL outperform APSRA.For
a given average delay,APSRA-BWand APSRA-BWL are
able to sustain higher bandwidth communication traffic
than APSRA.Performance improvement over XY and
odd–even is even more evident.
Fig.12 shows average delay,throughput and energy for
different packet injection rate (pir) factors under MMS
traffic scenario.That is,starting from the communication
graph of the application,we compute the pir of any
communication c as
pir(c) ¼
communication bandwidth of c
packet size flit size clock frequency
Thus,a point in the graph at a given pir factor p is computed
simulating the network using a pir value of p  pir(c) for a
communication c.As can be observed,both the oblivious
routings (odd–even and APSRA with random selection
function) and adaptive routings (APSRA-BW and
APSRA-BWL with LB selection function) outperforms
XY deterministic routing.For instance,looking at
Figs.12a and b,moving from XY to odd–even the pir
factor which saturates the network (a network is said to
start saturating when increase in applied load does not
result in linear increase in throughput [35]) increases by
33%.An additional improvement of 25% is obtained when
application-specific routing is used.Finally,the use of an
effective selection function like that proposed in this paper
adds a further 10 and 40% of improvement when APSRA-
BW and APSRA-BWL are considered,respectively.
Fig.12c shows the average energy per cycle per flit for
different pir factors.We used the high-level energy
estimation feature provided by noxim simulator to compute
energy numbers [22].Please note that the values after the
saturation pir factor do not carry useful information as
there the network is congested and flits into the network
spend much of their travel time waiting into routers’ buffer.
Thus,considering the range of pir factor where none of the
algorithms are saturated,we observe that application-
specific routing algorithms are more than 6 and 5% energy
efficient than XY and odd–even,respectively.If we restrict
the analysis to APSRA,APSRA-BW and APSRA-BWL
we observe that the proposed approach allows to reduce
energy consumption by 6%.
Taking APSRA as the baseline implementation,a
summary of the improvements in terms of percentage
increase in saturation pir factor,reduction of both average
delay and energy consumption for different traffic scenarios
is shown in Fig.13.For all traffic scenarios but MMS the
bandwidth for each communicating pair has been randomly
generated between 10 and 100 MB/s.As can be observed,
on average APSRA-BWL improves saturation point by
Figure 10 Absolute number of network links which exceed
the threshold when APSRA,APSRA-BW and APSRA-BWL
are used
Figure 11 Average delay variation under uniform traffic for
different ranges of communication bandwidth
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 423
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
38%,reduces average delay by 43% and energy consumption
by 4%.
Finally,Fig.14 shows the links utilisation under uniform
traffic for APSRA and APSRA-BWL.Link utilisation
value is discretised by three levels:low (white),medium
(grey) and high (black).As can be observed,when
APSRA-BWL is used links utilisation are more evenly
distributed as compared to APSRA.For instance,looking
at links utilisation when APSRA is used,there are several
high utilised links (black) and many low utilised links
(white).When APSRA-BWL is used,traffic flows
responsible for the high utilisation of some links,are
redistributed in favour of low utilised links.This is
confirmed by the higher number of medium utilised links
when APSRA-BWL is used.
7 Implications for router
architecture
In this section we present a router architecture design to
support the proposed routing algorithm (routing function
and selection function).
7.1 Router architecture
Fig.15 shows an architecture of the proposed router for the
case of a mesh network topology and minimal routing.The
top part of the picture shows the high level view of the
router,whereas the bottom part shows the block diagrams
of the modules which implement routing function and
selection function associated to the west input port.
The routing function is implemented by means of a
routing table.The routing table is addressed by the
destination id.An entry of the routing table contains two
main fields:AOC and Pr.AOC encodes the set of
admissible output channels that can be used to reach the
current destination.If we consider the west input port,
AOC is a four bit field whose bits indicate which of the
output ports among north (N),east (E),south (S) and
local (L) that can be used to reach the current destination.
Pr encodes the probability used by the selection function as
discussed in Section 5.3.The number of bits used to
encode Pr determines the precision of the selection
function.For instance,using three bits,eight probability
levels are possible (from 0.125 to 1).
Figure 13 Summary of the results taking APSRA as baseline
a Percent increase in saturation pir
b Percent reduction in average delay
c Percent reduction in energy consumption
Figure 12 Simulation results for MMS traffic
a Delay variation
b Throughput variation
c Energy variation
424 IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429
&
The Institution of Engineering and Technology 2009 doi:10.1049/iet-cdt.2008.0082
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
A possible implementation of the selection function
reported in (3) is shown in the bottom right corner of
Fig.15.The connector labelled with 1,is used in several
parts of the circuit.It is set when the routing function
returns more than one admissible output channel.If it is
zero,only one admissible output channel can be used.In
this case,the selection logic is bypassed and clock gating is
used to prevent the unnecessary activity of unused blocks.
The DirEncS block converts the one-hot encode used at
the input,to the encode of the selected output channel.
If more than one (max two,because we are considering
minimal routing) output channels can be used,a selection
must be operated.The input pr is shifted left (multiplied) and
compared with the current value stored in the linear feedback
shift register (LFSR).If it is less,the first output channel is
selected,otherwise the second one is selected.This selection
is,of course,conditioned by the whrt word which encodes
the reservation status of the output channels operated by
wormhole switching technique.Precisely,suppose that north
and east output channels are admissible and north should be
selected after the comparator.However,if north output
channel is reserved but east is not,east will be selected.This
computation is performed by the DirEncM block which
returns the encode of the selected output channel.
7.2 Area,timing and power analysis
A router implementing deterministic XY routing algorithm,
a router implementing adaptive odd–even routing and two-
table-based routers,one implementing a random selection
policy (TB-RND) and the other implementing the load
Figure 14 Links utilisation under uniform traffic for APSRA and APSRA-BWL
Figure 15 Block diagram of the router for a mesh network topology
Top view (top),routing function and selection function associated to the west input port (bottom)
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 425
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
balancing selection policy (TB-LB),have been designed in
VHDL and synthesised using Synopsys Design Compiler
and mapped on a 90 nm technology library from TSMC.
We considered 8  8 mesh topology networks,four-flits
FIFO input buffers with flit size of 64 bits.The analysis is
carried out at a granularity of the following main blocks.
† Arbiter:It is a general arbiter which manages situation
where several packets simultaneously want to use the same
output.In this case,arbitration between these has to be
performed.In this implementation round-robin policy is
used.
† XBar:It is a general 5  5 crossbar block which allows to
simultaneously route non-conflicting packets.
† Input FIFOs:They are the FIFO buffers at the input of
each router.For a mesh topology there are five FIFO
buffer in total.We considered four entry FIFO buffer with
an entry size of 64 bits (flit size).
† WHRT:This block implements the Wormhole
Reservation Table which stores the output port selected by
the routing algorithm associated to a given input port.
† Routing function:It is the block that gives the set of
admissible outputs for the current node and a given
destination.As we are considering mesh topologies and
minimal routing,the maximumnumber of admissible outputs
is two.
† Selection function:This block probabilistically selects one
of the outputs fromthe set of admissible outputs returned by
the routing function.
† Control:Control logic for sequencing various activities in
the router.
The effectiveness of the proposed selection policy depends
on the number of bits used to encode the selection
probabilities stored in the routing table (field Pr in the
routing table shown in Fig.15).We used three bits (i.e.
eight probability levels from 0.125 to 1) as no appreciable
performance improvements have been observed by using
more than three bits.For instance,Fig.16 shows the
average delay variation under MMS traffic when different
discretisation levels are used to encode the selection
probabilities.
7.2.1 Area analysis:
Fig.17a.shows the area
breakdown for the considered routers.As expected,
although a good percent of the area is due to FIFO buffers,
control logic and arbiter,the impact of routing table is
quite evident.
The use of the LB selection function determines an area
overhead on routing function block (i.e.routing table) and
selection function block of 56 and 73% respectively.The
overhead in the routing table is due to the additional field
Pr which stores the selection probabilities used by the
selection function.However,as input FIFO buffers
dominate the area,globally this overhead translates to
approx 8% of overall router area only.
7.2.2 Power analysis:
Average power dissipation values
of the main blocks composing the four routers are shown in
Fig.17b.Once again,the main contribution to power
dissipation is due to FIFO buffers.The second highest
contribution is due to the crossbar.Power dissipated by
routing tables is 8 and 3% more than that dissipated by
routing blocks implementing XY and odd–even,respectively.
With regard to the selection function,power dissipated by the
LB selection function is about 80% more than that dissipated
by a random selection function.In terms of global router
power dissipation,routing table contributes by about 12%
whereas LB selection function by 6%.
It should be pointed out that both the routing table and the
selection function block are active only when an header flit is
processed.In fact,the above analysis is very conservative as it
has been assumed that all the blocks in the router are
characterised by the same utilisation factor (worst-case
analysis).In practical cases,power contribution because of
routing table and LB selection function are likely to be
lower than that reported above.
7.2.3 Timing analysis:
Fig.17c shows the delay of the
different blocks composing the four routers.We considered
a five-stages pipeline implementation of the router with the
following stages:FIFO,routing,selection,arbitration and
crossbar.In this case the clock frequency is tuned over the
FIFO stage except for the router implementing the odd–
even routing whose slowest stage is routing.The access to
the routing table as well as the computation of the LB
selection function do not affect the router clock frequency.
Figure 16 Average delay variation under MMS traffic when
different discretsation levels are used to encode selection
probabilities
426 IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429
&
The Institution of Engineering and Technology 2009 doi:10.1049/iet-cdt.2008.0082
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
7.3 Summary of the architectural
implications
Fig.17d compares different routers in terms of area,delay
and power.Values are normalised with respect to a router
implementing XY routing.In terms of area,RT-LB is 8%
bigger than a classic table-based router implementing a
random selection function.Such difference is mainly
because of the increase in width of routing table as it needs
to store the selection probabilities required by the LB
selection function.In terms of timing,the increase of
routing table and the LB selection function do not impact
the clock frequency of the router as the slowest pipestage
continues to be the FIFO stage.Average power dissipation
of RT-LB router is 14,7 and 3% higher than that of XY,
odd–even and RT-RND router,respectively.However,as
it has been shown in the experimental section,performance
improvement obtained using the proposed routing and
selection functions results in an overall saving in energy
consumption.This is due to the fact that,although a RT-
RND router is more power hungry than the other routers,
a network built with RT-LB routers requires less cycles to
drain a given volume of traffic with a consequent reduction
in energy consumption.The average energy consumed in a
period of time is the product between the average power
dissipation and the duration of the period.
8 Conclusions
An application-specific routing algorithm has a potential to
provide substantially higher communication performance
than general purpose routing algorithms.In this paper
we have presented an extension to APSRA methodology
to design highly adaptive bandwidth-aware application-
specific deadlock-free routing algorithms for NoC
platforms.The basic idea behind the approach is the
exploitation of communication bandwidth information to
customise the routing algorithm for a given application.
The approach is divided into two phases.In the first phase,
information regarding communication bandwidth required
between a pair of cores is used in the heuristic while
removing cycles in ASCDG to ensure deadlock freedom
and deciding selection probabilities for various available
paths for a communication.This helps the resulting routing
algorithm to achieve high adaptivity along with spreading
the traffic uniformly over the network links.In the second
phase,the routing function is further restricted in an
iterative manner to reduce loads on some overloaded
network links.The approach has been evaluated on both
synthetic and real traffic scenarios.The results obtained
show that the routing algorithm generated by the proposed
approach (i) is highly adaptive,(ii) reduces the variation of
load in the network links and (iii) ensures that the link
Figure 17 Comparison between routers implementing XY routing,odd-even routing and table based router with random
selection policy and balancing selection policy
a Breakdown of area
b Breakdown of power dissipation
c Breakdown of delay
d Area,timing and power values normalised with respect to a router implementing XY routing
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 427
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
bandwidth limit is not violated.A table-based design is the
natural option for implementing routers for this approach.
In this paper we assumed that the application has been
already mapped and scheduled on the NoC platform before
the design of routing algorithm starts.An interesting
research direction is to use the mapping as an additional
degree of freedom and simultaneously optimise the mapping
and the routing function for a specific application.Although
we have used 2D homogeneous mesh topology network for
describing and evaluating,the proposed ideas are topology
agnostic.It will be interesting to evaluate these ideas for
irregular mesh and other topologies.Effective handling of
out-of-order packet delivery problem remains an interesting
open problemfor all adaptive routing algorithms.
9 Acknowledgment
This paper is an extended version of the paper presented at
the International Symposium on Networks-on-Chip,
Newcastle Upon Tyne,UK,April 7–10,2008.
10 References
[1]
OGRAS U.Y.
,
HU J.
,
MARCULESCU R.
:‘Key research problems in
noc design:a holistic perspective’.Int.Conf.Hardware–
Software Codesign and System Synthesis,September
2005,pp.69–74
[2]
HU J.
,
MARCULESCU R.
:‘Energy- and performance-aware
mapping for regular NoC architectures’,IEEE Trans.Comput.-
Aided Des.Integr.Circuits Syst.,2005,24,(4),pp.551–562
[3]
BERTOZZI D.
,
JALABERT A.
,
MURALI S.
,
ET AL
.:‘NoC synthesis
flow for customized domain specific multiprocessor
systems-on-chip’,IEEE Trans.Parallel Distrib.Syst.,2005,
16,(2),pp.113–129
[4]
PALESI M.
,
HOLSMARK R.
,
KUMAR S.
,
CATANIA V.
:‘Application
specific routing algorithms for networks on chip’,IEEE
Trans.Parallel Distrib.Syst.,2009,20,(3),pp.316–330
[5]
VAN DER TOL E.B.
,
JASPERS E.G.
:‘Mapping of MPEG-4
decoding on a flexible architecture platform’,Media
Processors,2002,4674,pp.362–375
[6]
LINDER D.
,
HARDEN J.
:‘An adaptive and fault-tolerant
wormhole routing strategy for k-ary n-cubes’,IEEE Trans.
Comput.,1991,40,(1),pp.2–12
[7]
GLASS C.J.
,
NI L.M.
:‘The turn model for adaptive routing’,
J.Assoc.Comput.Mach.,1994,41,(5),pp.874–902
[8]
CHIEN A.A.
,
KIM J.H.
:‘Planar-adaptive routing:low-cost
adaptive networks for multiprocessors’,J.ACM,1995,42,
(1),pp.91–123
[9]
UPADHYAY J.
,
VARAVITHYA V.
,
MOHAPATRA P.
:‘A traffic-balanced
adaptive wormhole routing scheme for two-dimensional
meshes’,IEEE Trans.Comput.,1997,46,(2),pp.190–197
[10]
CHIU G.-M.
:‘The odd–even turn model for adaptive
routing’,IEEE Trans.Parallel Distrib.Syst.,2000,11,(7),
pp.729–738
[11]
MURALI S.
,
MELONI P.
,
ANGIOLINI F.
,
ET AL
.:‘Designing
application-specific networks on chips with floorplan
information’.IEEE/ACM Int.Conf.Computer-aided Design,
2006,pp.355–362
[12]
STAROBINSKI D.
,
KARPOVSKY M.
,
ZAKREVSKI L.
:‘Application
of network calculus to general topologies using
turn-prohibition’,IEEE/ACM Trans.Netw.,2003,11,(3),
pp.411–421
[13]
SRINIVASAN K.
,
CHATHA K.S.
,
KONJEVOD G.
:‘Application specific
network-on-chip design with guaranteed quality
approximation algorithms’.Asia and South Pacific Design
Automation Conf.,2007,pp.184–190
[14]
PINTO A.
,
CARLONI L.
,
SANGIOVANNI-VINCENTELLI A.
:‘COSI:
a framework for the design of interconnection networks’,
IEEE Des.Test Comput.,2008,25,(5),pp.402–415
[15]
SCHWIEBERT L.
,
BELL R.
:‘Performance tuning of adaptive
wormhole routing through selection function choice’,
J.Parallel Distrib.Comput.,2002,62,(7),pp.1121–1141
[16]
CHANG FENG W.
,
SHIN K.G.
:‘Impact of selection functions
on routing algorithm performance in multicomputer
networks’.11th Int.Conf.Supercomputing,1997,
pp.132–139
[17]
MARTNEZ J.C.
,
SILLA F.
,
LPEZ P.
,
DUATO J.
:‘On the inuence of
the selection function on the performance of networks of
workstations’.Int.Symp.High Performance Computing,
(LNCS,1940),Springer-Verlag,2000,pp.292–299
[18]
HU J.
,
MARCULESCU R.
:‘DyAD – smart routing for
networks-on-chip’.ACM/IEEE Design Automation Conf.,
San Diego,CA,USA,7–11 June 2004,pp.260–263
[19]
YE T.T.
,
BENINI L.
,
MICHELI G.D.
:‘Packetization and routing
analysis of on-chip multiprocessor networks’,J.Syst.
Archit.,2004,50,(2–3),pp.81–104
[20]
NILSSON E.
,
MILLBERG M.
,
OBERG J.
,
JANTSCH A.
:‘Load
distribution with the proximity congestion awareness in a
network on chip’.Design,Automation and Test in Europe,
Washington,DC,USA,2003,pp.1126–1127
[21]
WU D.
,
AL-HASHIMI B.M.
,
SCHMITZ M.T.
:‘Improving routing
efficiency for network-on-chip through contention-aware
input selection’.Asia and South Pacific Design Automation
Conf.,2006,pp.36–41
428 IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429
&
The Institution of Engineering and Technology 2009 doi:10.1049/iet-cdt.2008.0082
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.
[22]
ASCIA G.
,
CATANIA V.
,
PALESI M.
,
PATTI D.
:‘Implementation and
analysis of a new selection strategy for adaptive routing in
networks-on-chip’,IEEE Trans.Comput.,2008,57,(6),
pp.809–820
[23]
MATTA I.
,
BESTAVROS A.
:‘A load profiling approach to
routing guaranteed bandwidth flows’,IEEE INFOCOM,
1998,3,(29),pp.1014–1021
[24]
KAR K.
,
KODIALAMM.
,
LAKSHMAN T.V.
:‘Minimum interference
routing of bandwidth guaranteed tunnels with MPLS traffic
engineering applications’,IEEE J.Sel.Areas Commun.,2000,
18,(12),pp.2566–2579
[25]
HU J.
,
MARCULESCU R.
:‘Energy-aware mapping for tile-
based NoC architectures under performance constraints’.
Asia & South Pacific Design Automation Conf.,January
2003,pp.233–239
[26]
PALESI M.
,
LONGO G.
,
SIGNORINO S.
,
KUMAR S.
,
HOLSMARK R.
,
CATANIA V.
:‘Design of bandwidth aware and congestion
avoiding efficient routing algorithms for networks-on-chip
platforms’.IEEE Int.Symp.Networks-on-Chip,Newcastle
University,UK,7–11 April 2008,pp.97–106
[27]
MURALI S.
,
MICHELI G.D.
:‘Bandwidth-constrained
mapping of cores onto NoC architectures’.Design,
Automation,and Test in Europe,16–20 February 2004,
pp.896–901
[28]
BERTELS K.
,
KUZMANOV G.
,
PANAINTE E.M.
,
ET AL
.:‘Profiling,
compilation,and HDL generation within the hArtes project’.
FPGAs and Reconfigurable Systems:Adaptive Heterogeneous
Systems-on-Chip and European Dimensions,DATE 07,2007,
pp.53–62
[29]
VALLERIO K.S.
,
JHA N.K.
:‘Task graph extraction for
embedded system synthesis’.16th Int.Conf.VLSI Design,
January 2003,pp.480–486
[30]
MURALI S.
,
ATIENZA D.
,
BENINI L.
,
MICHELI G.D.
:‘A multi-path
routing strategy with guaranteed in-order packet delivery
and fault-tolerance for networks on chip’.ACM/IEEE
Design Automation Conf.,San Francisco,CA,USA,24–28
July 2006,pp.845–848
[31]
DUATO J.
,
YALAMANCHILI S.
,
NI L.
:‘Interconnection networks:
an engineering approach’ (Morgan Kaufmann,2002)
[32]
ASCIA G.
,
CATANIA V.
,
PALESI M.
:‘Multi-objective mapping
for meshbased NoC architectures’.Second IEEE/ACM/IFIP
Int.Conf.Hardware/Software Codesign and System
Synthesis,Stockholm,Sweden,8–10 September 2004,
pp.182–187
[33]
FAZZINO F.
,
PALESI M.
,
PATTI D.
:‘Noxim:network-on-chip
simulator’,http://noxim.sourceforge.net
[34]
VARATKAR G.
,
MARCULESCU R.
:‘Traffic analysis for on-chip
networks design of multimedia applications’.ACM/IEEE
Design Automation Conf.,June 2002,pp.510–517
[35]
PANDE P.P.
,
GRECU C.
,
JONES M.
,
IVANOV A.
,
SALEH R.
:
‘Performance evaluation and design trade-offs for
network-on-chip interconnect architectures’,IEEE Trans.
Comput.,2005,54,(8),pp.1025–1040
IET Comput.Digit.Tech.,2009,Vol.3,Iss.5,pp.413–429 429
doi:10.1049/iet-cdt.2008.0082
&
The Institution of Engineering and Technology 2009
www.ietdl.org
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on March 27,2010 at 04:43:25 EDT from IEEE Xplore. Restrictions apply.