A Reconfigurable Fault-tolerant Deflection Routing Algorithm

VINetworking and Communications

Oct 6, 2011 (5 years and 8 months ago)

617 views

A Reconfigurable Fault-tolerant Deflection Routing Algorithm Based on Reinforcement Learning for Network-on-Chip

A Reconfigurable Fault
-
tolerant
Deflection Routing Algorithm
Based on Reinforcement Learning
for Network
-
on
-
Chip
Chaochao Feng* **,Zhonghai Lu**,Axel Jantsch**,
Jinwen Li* and Minxuan Zhang*
*National University of Defense Technology,China
**Royal Institute of Technology,Sweden
Outline

Motivation

Related work

NoC architecture and fault model

FTDR algorithm
2

Experimental results

Conclusion and future work
Motivation

Trends of CMOS technology

Nanometer domain

Physical effects:crosstalk,electro
-
magnetic interference,
alpha and neutron particle strike,power supply disturbance …

Challenge:
reliability of VLSI?

NoC

A new on
-
chip communication paradigm

Inherent structure redundancy

fault tolerant routing

Deflection routing

Non
-
minimal adaptive routing

Achieve fault
-
tolerance at the cost of small hardware overhead

Reinforcement learning

Reconfigure routing table
3
Related work
-
(1/2)

Handle regular fault region

Y.J.Suh et al."Software
-
based rerouting for fault
-
tolerant
pipelined communication",
IEEE Transactions on Parallel and
Distributed Systems
,2000

Fault
-
tolerant oblivious routing algorithm

Tolerate both convex and concave fault regions

Z.Zhang et al.“A reconfigurable routing algorithmfor a fault
-

Z.Zhang et al.“A reconfigurable routing algorithmfor a fault
-
tolerant 2D
-
mesh Network
-
on
-
Chip”,
DAC'
08

Deterministic,fault
-
tolerant,distributed,reconfigurable routing algorithm to
handle one faulty switch or region topology

C.Feng et al."FoN:Fault
-
on
-
Neighbor aware routing algorithm
for Networks
-
on
-
Chip",
SOCC'
10

Deflection routing

Convex and concave fault regions with at most one concave point in
sequence
4
Related work
-
(2/2)

Handle irregular fault region

Need more fault information

V.Puente et al."Immunet:dependable routing for
interconnection networks with arbitrary topology,"
IEEE
Transactions on Computers
,2009

Dynamic network reconfiguration in response to failures

Find a spanning tree of the network

A.Mejia et al."Region
-
based routing:a mechanismto support
efficient routing algorithms in NoCs",
IEEE Transactions on VLSI
Systems
,2009

Group switches into regions based on the fault pattern to make
routing decision

Additional region computation process
5
NoC architecture

NostrumNoC architecture

2D mesh

Deflection routing (Bufferless)

Packet priority (Hop count)

Stress value (Load balance)

Packet format

Packet format

V:valid bit

DA and SA:destination and source
addresses (Relative addressing)

HC:hop count

Payload:80 bits
Fault model

Fault denotation

Permanent faults

Fault region:any shape which does not disconnect the network

Bidirectional link failure

A 4
-
bit fault vector for each switch,1 bit for one direction

2
-
hop fault information transmission

2
-
hop fault information transmission

Transmit its own link status to four neighbors

Collect link status fromthree neighbors
and transmit to the fourth neighbors

Each switch can get the link
status for at most 16 links
7
f
a
u
l
t
_
t
o
[
N
]
fault
_
to
[
W
]
fault
_
to
[
E
]
f
a
u
l
t
_
t
o
[
S
]
f
a
u
l
t
_
f
r
o
m
[
N
]
fault
_
from
[
E
]
f
a
u
l
t
_
f
r
o
m
[
S
]
fault
_
from
[
W
]
FoN
_
to
[
W
]
FoN
_
from
[
W
]
(
i
,
j
)
F
o
N
_
t
o
[
N
]
F
o
N
_
f
r
o
m
[
N
]
FoN
_
to
[
E
]
FoN
_
from
[
E
]
F
o
N
_
t
o
[
S
]
F
o
N
_
f
r
o
m
[
S
]
A
FTDR algorithm

Basic RL
-
based routing

FTDR algorithm

Hierarchical FTDR algorithm (FTDR
-
H)

Deadlock and livelock avoidance

Hardware implementation
8
Reinforcement learning based routing*

Adaptive routing algorithm

Make routing decision using only local information

Topology
-
agnostic characteristic

Suitable for fault
-
tolerance

Table based routing algorithm

Table based routing algorithm

Reconfigure routing table with reinforcement learning
method
9
*J.A.Boyan,et al.Packet Routing in Dynamically Changing Networks:A
Reinforcement Learning Approach.Advances in Neural Information Processing
Systems,1994.
FTDR algorithm
-
(1)

To achieve fault
-
tolerance,use the formula to reconfigure the
routing table:
)
,
(
min
1
)
,
(
1
z
d
Q
y
d
Q
y
t
z
x
t



)
,
(
y
d
Q
x
t
--
minimum number of hops from
x
to
d
through
x
’s
neighbor
y
at time
t
)
,
(
min
1
z
d
Q
y
t

--
minimum number of hops from
y
to
d
through
y
’s

x
sends a packet to
d
through neighbor
y
,
y
returns the minimum
number of hops to
d
back to
x

x
updates the corresponding entry with 1 hop plus the minimum
number of hops to
d
from
y

In the case of several directions with equal number of hops to
destination,choose one of themwith the smallest stress value
10
)
,
(
min
1
z
d
Q
t
z

--
minimum number of hops from
y
to
d
through
y
’s
neighbor
z
at time
t
-
1
FTDR algorithm
-
(2)

Routing table update

example

3x3 mesh,one faulty link

Reconfigure routing table entry to S6
S1
S2
S3
N
E
S
W
2
4
4
2
1
3
3
3
2
2
4
4
1
2
3



S6
4
2
4

N
E
S
W
4
S6
2
N
E
S
W
3

3
3
5
11
0
S3
S4
S5
S6
S7
S8
S9
0
0
0
2
2
4
4
3
3
3
1
4
2
2
4
3
1
3
3
4
4
2
2
3
3
1
3
4
5
6
7
8
9






1
2
3
4
5
6
7
8
9
P
Q
1
2
3
4
5
6
7
8
9
P
Q
S6
2
N
E
S
W
3
2
4

1
2
3
4
5
6
7
8
9
4
P
Q
1
2
3
4
5
6
7
8
9
S6
2
N
E
S
W
2
2
4

4
S6
3
N
E
S
W
3
3
4

5
S6
3
N
E
S
W
3
2
4

5
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9


P
Q
P
Q
P
Q
FTDR
-
H algorithm

Hierarchical routing

Reduce the routing table size

n x n
mesh can be divided into several sub
-
regions
with equal size

Each switch contains a local and a region routing table

Each switch contains a local and a region routing table

Destination is in the same region

Routed based on local routing table

Destination is not in the same region

Routed based on region routing table

Local and region routing tables are also updated by
12
)
,
(
min
1
)
,
(
1
z
d
Q
y
d
Q
y
t
z
x
t



Deadlock and livelock avoidance

Deflection routing
--
inherently
deadlock free

Livelock avoidance

Packet priority:not saturate

The routing table entry will converge to the
minimumnumber of hops to destination in the
minimumnumber of hops to destination in the
presence of fault regions which do not
disconnect the network.
13
Experimental results
-
(1/6)

Platform

8x8 2D mesh,VHDL Nostrum NoC simulator

Compare four algorithms

FoN,Cost
-
based,FTDR and FTDR
-
H

Traffic

Traffic

Synthetic workload

Uniformrandom,transpose,bit complement,bit reverse,
shuffle and tornado

Application workload

A multi
-
core NoC platform

Each node has a LEON3
processor plus a local memory

Matrix multiplication,1D radix
-
2 parallel FFT and
wavefront computation
15
Experimental results
-
(2/6)

2
-
hop vs.1
-
hop fault information

Randomtraffic with a fault pattern of 10%faulty links

Learning period:350 cycles (depending on fault patterns)

2
-
hop fault information

some unnecessary misroutings can be avoided

Less average hop count than 1
-
hop fault information

Less average hop count than 1
-
hop fault information
16
Experimental results
-
(3/6)

No faults

Cost
-
based:best

In the case of link faults

FTDR and FTDR
-
H:similar

Results with synthetic workloads
-
throughput

Assume Link fault rate:0~30%

FTDR and FTDR
-
H:similar

FTDR:14% and 23%
higher than FoN and cost
-
based
17
Experimental results
-
(4/6)

Injection rate

0.1 packets/cycle/node

FTDR

Less hop count than FoN
and cost
-
based

Routing table converge
0
10
20
30
40
0
10%
20%
30%
A
v
e
r
a
g
e
h
o
p
co
u
n
t
Fault rate
(a) Uniform random
FoN
Cost
FTDR
FTDR
-
H
0
10
20
30
40
50
0
10%
20%
30%
A
v
e
r
a
g
e
h
o
p
co
u
n
t
Fault rate
(b) Transpose
FoN
Cost
FTDR
FTDR
-
H
60
50

Results with synthetic workloads
-
average hop count

Routing table converge

For uniform random,bit
reverse and shuffle

FTDR
-
H:18%,10%,15%
less than FTDR

For the rest traffic

FTDR and FTDR
-
H:similar
18
0
10
20
30
40
0
10%
20%
30%
A
v
er
a
g
e
h
o
p
c
o
u
n
t
Fault rate
(e) Shuffle
0
10
20
30
40
50
60
0
10%
20%
30%
A
v
e
r
a
g
e
h
o
p
c
o
u
n
t
Fault rate
(c) Bit complement
0
10
20
30
40
50
0
10%
20%
30%
A
v
e
r
a
g
e
h
o
p
c
o
u
n
t
Fault rate
(d) Bit reverse
0
10
20
30
40
50
0
10%
20%
30%
A
v
e
r
a
g
e
h
o
p
c
o
u
m
t
Fault rate
(f) Tornado
Experimental results
-
(5/6)

Results with application workloads

FTDR and FTDR
-
H algorithms performbetter than FoN
especially under high fault rates

For matrix multiplication and wavefront computation,FTDR
and FTDR
-
H performsimilar

For FFT,FTDR
-
H is 10%less than FTDR in the presence of
19

For FFT,FTDR
-
H is 10%less than FTDR in the presence of
link faults
4
6
8
10
12
14
0
10%
20%
30%
Averag
e
ho
p
count
Fault rate
(a) Matrix multiplication
FoN
FTDR
FTDR
-
H
1.5
2
2.5
3
3.5
4
4.5
0
10%
20%
30%
A
v
erag
e
h
op
co
unt
Fault rate
(c) Wavef ront
FoN
FTDR
FTDR
-
H
3
4
5
6
7
8
9
0
10%
20%
30%
Av
erage
hop
c
o
un
t
Fault rate
(b) FFT
FoN
FTDR
FTDR
-
H
Experimental results
-
(6/6)

Area cost

For FoN and cost
-
based:not increase with the network size

Main overhead of FTDR switch:routing table

For FTDR
-
H algorithm,the 8x8,12x12 and 16x16 meshes are
divided into 4,9 and 16 4x4 sub
-
meshes respectively.

The routing table cost of FTDR
-
H can reduce up to
n/2
times
less than FTDR for an
nxn
mesh in theory.
20
less than FTDR for an
nxn
mesh in theory.
0
50000
100000
150000
200000
250000
300000
4x4
8x8
12x12
16x16
A
r
e
a
(um2)
Network size
Non
-
FT
FoN
Cost
FTDR
FTDR
-
H
Conclusion and future work

FTDR

Routing table updates via 2
-
hop learning

Table based,topology
-
agnostic,insensitive to the
shape of fault regions

Small and medium network sizes

Small and medium network sizes

FTDR
-
H

Reduce routing table size

Future work

Fault detection mechanism

Reinforment
-
learning under transient faults
21
Thank you!
Questions?