Low Power Architecture for High Speed Packet Classification

courageouscellistAI and Robotics

Oct 29, 2013 (4 years and 11 days ago)

83 views

Low Power Architecture for High Speed
Packet Classification

Author:



Alan Kennedy, Xiaojun Wang


Zhen Liu, Bin Liu

Publisher:


ANCS 2008

Presenter:

Chun
-
Yi Li

Date:

2009/05/06

2

Outline


Introduction


Adaptive Clocking Architecture


Hardware Accelerator


Hierarchical Intelligent Cutting (HiCut)


Multidimensional Cutting (HyperCuts)


Algorithm Changes


Low Power Architecture


Performance

3


Optical Carrier levels describe a
range of digital signals that can be
carried on SONET fiber optic
network.



The number in the Optical Carrier
level is directly proportional to the
data rate of the bitstream carried by
the digital signal.

Introduction

4


Implementing packet classification algorithms
in software is not feasible when trying to
achieve high speed packet classification.


High throughput algorithms such as RFC are
unable to reach OC
-
768 or even OC
-
192 line
rates when run on devices such as general
purpose processors for even relatively small
sized rulesets

Introduction

5


A large percentage of idle time means a large amount
of unnecessary dynamic power is being used due to the
unnecessary switching of logic and memory elements.

Introduction

Percentage of time classifier spends idle when classifying packets from
the CENIC trace at different frequencies

6

Outline


Introduction


Adaptive Clocking Architecture


Hardware Accelerator


Hierarchical Intelligent Cutting (HiCut)


Multidimensional Cutting (HyperCuts)


Algorithm Changes


Low Power Architecture


Performance

7


The adaptive clocking unit is designed to run a packet
classification hardware accelerator at up to
N

different
frequencies.


For our packet classifier, it was found that
32MHz

is fast
enough to deal with the worst case bursts of packets for OC
-
768 line speeds. This means that
Fmax
=
32MHz

Adaptive Clocking Architecture

State

S
0

S
1

S
2

S
3

S
4

S
5

S
6

S
7

S
8

S
9

Speed
(MHz)

f
0
=0.0625

f
1
=0.125

f
2
=0.25

f
3
=0.5

f
4
=1

f
5
=2

f
6
=4

f
7
=8

f
8
=16

f
9
=32

f
i

=
Fmax/2
n
-
i
-
1
,

i=0,1,...,N
-
1

8


This threshold is variable with the number of packets stored in
the buffer distributed among the
N

states with each state
having a width
W
i
,
0


W
i

M

Adaptive Clocking Architecture

Buffer

W
0

W
1

W
N
-
1

M

‥‥

‥‥

‥‥‥

‥‥

M

=
Σ
W
i

i=0

N
-
1

9


The threshold for determining when a state is exited and the
next higher state entered is saved in a register in the adaptive
clocking unit and can be changed at any time.

Adaptive Clocking Architecture

Buffer

W
0

W
1

W
N
-
1

‥‥

‥‥

‥‥‥

‥‥

T
0

T
1

T
i

=
Σ
W
j

,
i=0,1,...,N
-
2

j=0

i

10


The output clock frequency to the packet classification
hardware accelerator will start at the frequency of the
lowest
-
used state
f
0
.


If the threshold for this state
T
0

is exceeded then the
next higher
-
used state
S
1

will be entered and the clock
frequency will change to
f
1

Adaptive Clocking Architecture

S
0

S
1

S
2

S
3

S
4

S
5

S
6

S
7

S
8

S
9

11


Only states
S
4
,
S
7
,
S
8

and
S
9

are used.


In this case the output clock frequency to the packet
classifier will start at
f
1
.

Adaptive Clocking Architecture

S
0

S
1

S
2

S
3

S
4

S
5

S
6

S
7

S
8

S
9

12

Outline


Introduction


Adaptive Clocking Architecture


Hardware Accelerator


Hierarchical Intelligent Cutting (HiCut)


Multidimensional Cutting (HyperCuts)


Algorithm Changes


Low Power Architecture


Performance

13

Hierarchical Intelligent Cuttings (HiCut)


The algorithm constructs the decision tree by
recursively cutting the hyperspace one dimension at
a time into sub regions.


The algorithm will keep cutting into the hyperspace
until none of the sub regions exceed a predetermined
number called
binth
.

Hardware Accelerator

14

Hardware Accelerator

binth = 4

R
2

R
3

R
4

R
7

R
10

R
11

Field
2

(4 cuts)

R
9

R
10

R
11

R
8


R
9

R
10

R
11

R
0

R
1

R
5

R
6

R
7


R
10

R
11

Hierarchical Intelligent Cuttings (HiCut)

00

01

10

11

15

Hardware Accelerator

binth = 4

Field
2

(4 cuts)

R
9

R
10

R
11

R
8


R
9

R
10

R
11

Field
4

(4 cuts)

R
7

R
10

R
11

R
3

R
7

R
10

R
11

R
2

R
7

R
10

R
11

R
4

R
7

R
10

R
11

Field
3

(4 cuts)

R
7

R
10

R
11

R
1

R
7

R
10

R
11

R
0

R
5

R
6

R
7

R
10

R
11

R
7

R
10

R
11

Hierarchical Intelligent Cuttings (HiCut)

16

Hardware Accelerator

binth = 4

Field
2

(4 cuts)

R
9

R
10

R
11

R
8


R
9

R
10

R
11

Field
4

(4 cuts)

R
7

R
10

R
11

R
3

R
7

R
10

R
11

R
2

R
7

R
10

R
11

R
4

R
7

R
10

R
11

Field
3

(4 cuts)

R
7

R
10

R
11

R
1

R
7

R
10

R
11

R
7

R
10

R
11

Field
5

R
0

R
5

R
6

R
10

R
7

R
11

Hierarchical Intelligent Cuttings (HiCut)

17

Hardware Accelerator

Hierarchical Intelligent Cuttings (HiCut)

18

Hierarchical Intelligent Cuttings (HiCut)


binth
: Limits the amount of linear searching at
leaves.


np
: Number of cuts.


spfac:
A

multiplier which limits the amount of
storage increase caused by executing cuts at a
node.


spfac
*
number of rules at i


rules at each child of i+np

Hardware Accelerator

19

Multidimensional Cutting (HyperCuts)


The main difference from HiCuts is that HyperCuts
recursively cuts the hyperspace into sub regions by
performing cuts on multiple dimensions at a time.

Hardware Accelerator

20

Hardware Accelerator

binth = 4

R
0

R
4

R
6

Field
1

(2 cuts)

Field
5

(2 cuts)

R
7

R
8

R
9

R
1

R
3

R
0

R
2

R
5

Multidimensional Cutting (HyperCuts)

Rule

Field
1

Field
2

Field
3

Field
4

Field
5

R
0

128
-
240

15
-
15

40
-
40

180
-
180

120
-
140

R
1

90
-
100

0
-
80

0
-
200

190
-
200

130
-
132

R
2

130
-
255

60
-
140

0
-
60

180
-
180

133
-
135

R
3

90
-
92

200
-
200

40
-
40

180
-
180

136
-
138

R
4

130
-
255

60
-
140

40
-
40

190
-
200

60
-
63

R
5

140
-
150

60
-
140

0
-
255

0
-
255

140
-
255

R
6

160
-
165

80
-
80

0
-
255

0
-
255

0
-
80

R
7

48
-
50

0
-
80

40
-
40

0
-
255

0
-
10

R
8

26
-
36

50
-
50

40
-
40

180
-
180

30
-
40

R
9

40
-
40

40
-
70

40
-
40

0
-
255

0
-
60

21

Multidimensional Cutting (HyperCuts)


spfac:
A

multiplier which limits the amount of
storage increase caused by executing cuts at a
node.


max child nodes at i


spfac*sqrt( number of rules at i)


Hardware Accelerator

22

Multidimensional Cutting (HyperCuts)


Region Compaction

Hardware Accelerator

A node in the decision tree originally covers the
region {[
X
min
,
X
max
], [
Y
min
,
Y
max
]}.


However all the rules that are associated with the
node are only covered by the subregion {[
X’
min
,
X’
max
], [
Y’
min
,
Y’
max
]}.


Using region reduction the area that is associated
with the node shrinks to the minimum space
which can cover all the rules associated with the
node.

23

Multidimensional Cutting (HyperCuts)


Pushing Common Rule Subsets Upwards

Hardware Accelerator

An example in which all the child nodes of A share the same subset of rules {
R
0
,

R
1
}.

As a result only A will store the subset instead of being replicated in all the children.

R
0

R
1

R
2

R
0

R
1

R
3

R
0

R
1

R
0

R
1

R
4

R
2

R
3

R
4

R
0
R
1

A

A

24

Algorithm Changes



Remove the region compaction and push
common rule subsets upwards heuristics
from the HyperCuts algorithm.

Hardware Accelerator

25

Algorithm Changes


For HiCuts the number of cuts to an
internal node starts at 32 and doubles each
time the following condition is met


(spfac
*
number of rules at i


rules at each child of
i+np)
& (np
<
129)

Hardware Accelerator

26

Algorithm Changes


All combination of cuts between the chosen
dimensions are considered if they obey the
following condition where
spfac

can be 1, 2,
3 or 4
:


(np


2
(4+spfac)
)

& (np

32)

Hardware Accelerator

27

Memory Structure


The hardware accelerator uses 7704
-
bit wide memory words.


In order to calculate which cut the packet should traverse to, the
internal node stores 8
-
bit mask and shift values for each
dimension.


The masks indicate how many cuts are to be made to each
dimension while the shift values indicate each dimensions weight.


The cut to be chosen is calculated by ANDing the mask values
with the corresponding 8 most significant bits from each of the
packets 5 dimensions. The resulting values for each dimension
are shifted by the shift values with the results added together
giving the cut to be selected.

Hardware Accelerator

28

Memory Structure


Each saved rule uses 160 bits of memory.


The Destination and Source Ports use 32 bits each with 16 bits used
for the min and max range values.


The Source and Destination IP addresses use 35 bits each with 32 bits
used to store the address and 3 bits for the mask.


The storage requirement for the mask has been reduced from 6 to 3 bits
by encoding the mask and storing 3 bits of the encoded mask value in
the 3 least significant bits of the IP address when the mask is 0
-
27.


The protocol number uses 9 bits with 8 bits used to store the number
and 1 bit for the mask.


Each 7704
-
bit memory word can hold up to 48 rules, and it is possible
to perform a parallel search of these rules in one clock cycle.

Hardware Accelerator

29

Outline


Introduction


Adaptive Clocking Architecture


Hardware Accelerator


Hierarchical Intelligent Cutting (HiCut)


Multidimensional Cutting (HyperCuts)


Algorithm Changes


Low Power Architecture


Performance

31

Low Power Architecture

32

Outline


Introduction


Adaptive Clocking Architecture


Hardware Accelerator


Hierarchical Intelligent Cutting (HiCut)


Multidimensional Cutting (HyperCuts)


Algorithm Changes


Low Power Architecture


Performance

33

Performance

34

Performance

Power figures for ASIC
implementation

Power figures for Cyclone 3
implementation

35

Performance

ASIC implementation
classifying network traces using
rulesets containing 20,000 rules
.

Cyclone 3 implementation
classifying network traces using
rulesets containing 20,000 rules
.

36

Conclusion

Simulation results show that ASIC and FPGA
implementations of our low power architecture can reduce
power consumption by between 17
-
88% by adjusting the
frequency.