1
D
YNAMIC GRANULARITY
K
ERNIGHAN

L
EE PARTITIONING ALGO
RITHM
Erik Csókás, Pavel Čičák
Faculty of Informatics and Information Technologies
Ilkovi
čova
3, 842 16 Bratislava
csokas.erik@gmail.com, cicak@fiit.stuba.sk
Abstract.
The computer industry is facing
a problem in the design of complex systems.
One of
t
he
result
s
of increasing system complexity
is that
the gap between the idea of the system and the
realization is rapidly expanding. One of the new approaches to solve th
is
complexity is system design
(co

design). This approach relies hard on the step of partitioning the system as a whole to smaller co

operating pieces. In this work we address the partitioning problem of system design with a dynamic
granularity Kernighan

Lee system partitioning algorithm.
Keywords:
Co

design, partitioning, algorithm
1. INTRODUCTION
For now several software, hardware, and co

design
approaches exist, but they are specialized on their field, not
allowing the designers to create truly mixed systems. The
existing co

design approaches take system specification and
split it into a software and hardware description. The target
architecture
in a typical hardware / software co

design
situation could be:
• processor (the generated software code is in C, or ASM)
and an FPGA (VHDL), or
• C and ASIC (instruction set selection problem) (1),
The existing approaches force the designers to think in
small cooperating systems instead of allowing them to
create the model of a complete system. Most design
partitioning approaches consider only a hardware/software
partitioning. The target of the software part is usually a
microprocessor, while the target o
f the hardware partition
can be an ASIC, FPGA or microprogrammable logic
controller. For most types of embedded systems both a
complete hardware and a complete software solution exist
(2). These solutions have complementary advantages.
Finding an optimal p
artitioning could be a hard task. This
task consists of finding a trade

off between the conflicting
requirements. The central task of co

design tools and
methods is the step of splitting the design into two or more
parts. This step can be manual or automat
ic. In this step is
decided, which components will be implemented in
hardware and which in software. This decision can be a
complex task, where numerous factors must be considered
including power consumption, performance, memory
consumption, communication
costs, etc.
2.
PARTITIONING
Some ste
ps in the co

design design flow
require manual
decisions
–
like partitioning. Several approaches were
proposed to automate the task of design partitioning
–
what
parts should be implemented in software and what parts in
hardware.
The complexity of the systems sometimes makes it hard to
achieve good manual partitioning results. Automatic
partitioning seeks better results

i
n (4) Neema states: “In
order to perform rigorous analysis and synthesis it is
essential to prune
the design space retaining only the most
viable alternatives. In the past heuristics have been used to
prune large design spaces. However, due to the complex
behavior and interactions in multimodal systems it is
difficult to come up with effective heurist
ics.
A better approach is to use constraints to explore and prune
the design spaces; constraint satisfaction can eliminate the
designs that do not meet the constraints. The pruned design
space contains only the designs that are correct with respect
to th
e applied constraints. These designs can then be
simulated, synthesized and tested.”
Homogeneous modeling means that a single modeling
language is used for modeling throughout the whole design
flow. In such situations the co

design design flow starts
with
the specification in the selected modeling language.
3. KERNIGHAN

LEE ALGORITHM
One of the most promising directions in design partitioning
is the application of the Kernighan

Lin algorithm
(4)
(5)
.
The algorithm was developed for the circuit partitioning
problem.
The algorithm is aimed
at
partitioning a graph into two
graphs of equal sizes with as least cutting edges as possible.
Kernighan

Lin algorithm to design partitioning was
suggested by Vah
id (
5
). In his work he suggests the
minimization of the execution time metric instead of cut
metric. Mann in (
4
) has presented an altered algorithm,
which tries to overcome the shortcomings of Vahids
implementation. The KL algorithm works in passes. In
2
eve
ry run each free node moves once. Every step a pair of
free nodes is selected and swapped.
procedure onePass()
{
calculate gains
free all nodes
while there are free nodes
let
v
be a free node with the maximal gain
swap
v
w. a node
with low gain other part
if (P
cur
is better then P
best
) then
P
best
= P
cur
lock swapped nodes
update gains
}
procedure KL()
{
create initial partition, set P
best
and P
cur
to it
repeat
{
onePass()
P
cur
= P
best
} until there is no improvement in P
best
}
After that, that pair of nodes will become locked. The
algorithm selects
the
node with the highest gains i.e. is a
greedy algorithm. The pass is ready when no more free
nodes are availab
le. Free nodes are swapped even at the
price of worsening the solution. This is the way the KL
algorithm escapes from local optima. At the end only the
best possible found solution is stored, all nodes are made
available, and a new pass is started. The alg
orithms
stopping criteria is, when it does not find a better solution as
the previous one.
4.
CLUSTERING ALGORITHM
In the specification phase the designer specifies a design
with structure and behavior. In system design usually large
granularity is used
for specification. In the previously
presented system design methods the behavior specification
Both
coarse and fine grain granularity levels have their
benefits and weaknesses. The designers’ choice is to use the
most appropriate level of granularity.
In this approach we
present a dynamic granularity clustering method. Before the
clustering method will start, the performance estimation
(profiling) task needs to be completed.
The performance estimation (profiling) task evaluates the
behavioral models
and enhances the system design with
information
.
The clustering process has the flattened,
profiled behavioral model as the input. Performance
estimation metrics are inserted at method headers, cycles,
if/else constructs. Our clustering method goes below t
he
method level granularity
to the code block level.
The Kernighan

Lin algorithm is an iterative improvement
algorithm. It starts from a partition and works towards a
better with continuously swapping pairs of nodes. The main
benefit
of this algorithm is
that it can easily get out local
optima.
function cluster(in: flattened profiled design, out:
clustered design)
for (i=1;i < (number of code blocks / 4); i++)
//max at least 4 blocks / cluster.
{
centers = findcenters (design, i);
while the
re are
not assigned
blocks
{
for each (block in centers)
{
sort (neighbor block of blocks);
//add the upper half of the blocks with
high communication rate to the ith cluster
Vi.Add(blocks, #blocks/2) ;
}
}
//calculate the communication rate between clusters
foreach(cluster)
{
foreach(block in cluster)
{
if the call is out

of cluster
clusterCommi+=
else
inClusterCommi+=
}
}
if ( (i > 3) &&
&&
return clusters;
}
return clusters;
The
basic
idea behind
this clustering algorithm is very
simple
: t
he algorithm stops when there are too many
clusters (i.e. in average there should be at least 4 blocks in
each cluster). In the next step the algorithm tries to find the
centers of the clus
ters. We use a simple list based algorithm,
where we choose the blocks with the highest
communication rate. A slight modification of this algorithm
could search for small groups (neighbors) of blocks with
high communication rate. In the next step the actua
l
clustering is taking place. The idea is that every cluster tries
to pull in the neighboring blocks with the highest
communication rate with the given cluster. In each iteration
step
we take the upper half of the neighboring blocks until
there are unassig
ned blocks. If all blocks are assigned, we
calculate the communication rate between the clusters and
the inner communication rate of the clusters. We use a
simple “
rule of thumb
” or elbow criteria to stop the
algorithm.
3
To validate the clustering algorith
m, 2 systems have been
used
–
a very simple system consisting of 10 nodes, and a
more complex system with 80 nodes.
Figure 1. Clustering graph (10 nodes)
According to
the clustering graph the highest
communication gain
is achieved with 4 nodes.
For fi
nding the centers of the clusters we use a simple list

based algorithm, where the code blocks are sorted according
to their communication rate. To calculate the
communication rate, we use the following formulae:
where
,
are positive weights
–
and can be set by the designer according to the needs.
function findCenters (in: flattened profiled design,
number of clusters, out: array blocks)
{
//sort the flattened profiled design by
sort(design
);
return arraysplit(design, num of clusters);
}
The use of pattern matching has been previously proposed
by Čičák in
(1)
for ASIP (instruction set selection problem)
design, and by P.A. Mudry
(2)
to improve the partitioning
results of hw/sw partit
ioning method.
The motivation for using a pattern matching algorithm in
system design is to find reusable parts. When the
partitioning process selects a part of the system design to be
implemented in hardware / software, the
not
partitioned
system design
can be searched for similar patterns. If a part
is moved into hardware, than the result system can be
cheaper or faster. In the case of software the code size can
be smaller
(2)
.
5.
MODIFIED KERNIGHAN

LEE PARTITIONING
ALGORITHM
The Kernighan

Lee
algorithm was originally developed for
the circuit partitioning problem, but after various
modifications it has made its way to co

design. The
improvements of the KL algorithm include that of Fiduccia
and Mattheyes
(3)
and Mann
(4)
. The FM variant of the
KL
partitioning algorithm introduced an altered optimization
constraint
–
instead of using the bisection constraint of the
original KL algorithm and the swapping of nodes was
changed to moving nodes. So the output of the FM
algorithm is not necessarily of
equal sizes. Another change
to the algorithm was the implementation of the gain bucket
array. The main benefit of using the gain bucket array to
choose the best available node to switch is the efficient
implementation.
In order to implement an efficient
system partitioning
strategy, we have to make various changes to the KL
algorithm.
The KL algorithm enforces balance in the
partitioned graphs
–
i.e. the partitioned graphs are of equal
sizes from the starting partitioning to the end. In our case
this appr
oach is not acceptable
.
The cost function of the
original KL algorithm, but also of the Vahid, FM or the
Mann algorithm use simple cost metrics. In the case of
system partitioning the overall fitness function and the cost
function has to be proposed with c
are. To keep the main
benefit of the KL algorithm
–
the efficiency, the system
partitioning method will lack scheduling. The algorithm
itself follows the idea presented in
(4)
. The proposed
changes are marked with a bold
type
.
function KL()
{
creat
e initial partition and set
P
best
and P
cur
to it
repeat
{
onePass
(clusters)
P
cur
= P
best
} until there is no improvement in P
best
or
maxiterations
repeat
{
onePass
(nodes)
P
cur
= P
best
} until there is no improvement in P
best
or
maxiterations
}
The partitioning algorithm has been altered in a way, that it
makes use of the previously created clusters. In the first
phase whole clusters are moved between architectures, in
the second phase in order to achieve a finer granularity
partitioning algorithm, the algorithm moves nodes. Using
this approach the algorithm can achieve better performance,
since neighbor nodes are moved in groups (clusters) in the
first phase, and in the
second phase only used for
refinement
–
with higher granularity.
The main function of
the KL algorithm has been changed only in the way the
initial partition is created and in the behavior of a pass of
the algorithm. The specialty of the KL algorithm is t
hat it
performs good partitioning when a valid initial partition is
used as a starting point. There are several possibilities to
create a valid partition:
If the design specification contains only task
deadline constraints, then an all

hardware or RTOS
partition is always a valid partition.
4
If the design specification does not contain task
deadline constraints, then an all software partition
is also valid
function CreateInitialPartition()
{
int inc=1;
minSystemDeadline = minDeadline(n
odesOfSystem)
maxSystemDeadline = maxDeadline(nodesOfSystem)
for all clusters in graph
{
deadlineOfCluster = minDeadline(nodesOfCluster)
with probability
cluster goes to the tar
get
partition specified with deadlineOfCluster
}
if (partition is valid)
return partition
inc++;
repeat the for cycle until all nodes are not in fastest target
architecture
}
One pass of the KL algorithm had to be change
d to provide
system partitioning capabilities.
function onePass(granulaty level)
{
free all clusters
calculate gains
while there are free nodes / clusters {
let
v
be a free nodes with the maximal gain
move
nodes / clus
ters
v
to a cheaper partition
let
n
be a free nodes with the minimal gain
move
nodes / clusters
n
to a faster partition
if (P
cur
is better then P
best
) then
P
best
= P
cur
lock moved nodes / clusters
unlock neighbor nodes / clusters
update gains
}
}
However such homogenous partitions do not ensure good
partitioning results. To create the initial partition we will
use a simple algorithm
.
What the CreateInitialPartition
does, is
that calculates the minimal and the maximal
deadline of all nodes. Then, for all cluster within the system
specification the algorithm tries to move the cluster in an
appropriate target partition. The target partitions
(architectures) are in this case sort
ed according to their
speed (or the capability to ensure that a certain task is
finished within the deadline time interval). If the resulting
system partition is valid (i.e. design constraints are met), the
partition is returned. If the resulting partition
is not valid,
th
e
n the whole process will be repeated, but the probability
of assigning tasks to faster partitions (architectures) will be
higher.
One pass of the KL algorithm works as follows: For each
node or cluster the gain is calculated for each pos
sible
target. The gain of a node for a given architecture is the cost
of that node within the architecture. In the original KL
algorithm the gain of a node was specified as the number of
cut connections by which by moving the node to the given
partition is
decreased. The gain function greatly affects the
efficiency of the KL algorithm. For example, Mann
(4)
states:
That is the gain of a node is a function of the delta of the
software gain of the node, the delta of the hardware gain of
a node, the runnin
g time of the system with respect to
partition
cost of the current partition and a deadline
constraint
.The gain of node v is
if the given node
in the given partition hurts the deadline constraint, or is the
hardware cost of the given node. The problem with this
approach is that it buries possible solutions where the
deadline constraint is exceeded with a small amount, but for
t
hat price the hardware cost is greatly reduced. A more
dynamic solution for the gain function would be:

p(
)
Where p is a penalty function, which penalizes the node in
the amount by which the given node exceeded the deadline
constraint. When using a permissive gain function, non
valid partitions should be allowed only temporarily
–
i.e.
before one pass exiting th
e algorithm should take the best
valid partition. After the gains are calculated, a list of nodes
(clusters) is updated. The effectiveness of the KL algorithm
lies in the possibility of moving a node from one partition
into another without the need to upda
te the gains of the
system.
The algorithm selects the nodes or clusters with
high gains (which would probably increase the overall
fitness of the system) to be moved into
a cheaper
partition.
The nodes or clusters with the lowest gains are selected to
be
moved into faster partitions.
After the move the node
will be locked
–
i.e. the node cannot be moved back in this
pass. Mann
(4)
argued that unlocking the ex

neighbors of
the node (cluster) would be beneficial to the overall
performance of the partitioning
algorithm. However there
should be ticket based constraint on how many times a node
can be unlocked by neighbors to prevent endless loops.
6
.
TEST RESULTS
To evaluate the modified Kernighan

Lee hardware

software
partitioning we have used the results of
the clustering
algorithm. After the clustering algorithm is finished, 80
5
nodes from the system are distributed between 22 clusters.
Two versions have been studied:
Modified Kernighan

Lee partitioning
Modified Kernighan

Lee partitioning with
clustering
The
resulting fitness of the partitioning process can be seen
on (
Figure
1, 2, 3). We can state that including clustering in
the
partitioning process has increased the
overall fitness, but
also made the partitioning process longer. However, if as
the number of nodes in the system increases, the running
time of the algorithm in the case of modified Kernighan

Lee
partitioning with clustering is lower with 27%.
Figure 1
. Modified Kernighan

Lee
partitioning
algorithm fitness / iteration graph
Figure
2
. Running time of the mo
1
dified KL algori
thm
Figure
3
. Design quality / running time of the algorithm tradeoff graph
This work was supported by Slovak Science Grant Agency
within:
• project No. VG 1/3104/06 „Systems of grid
computing and its components”
6
7
. REFERENCES
1.
Čičák, P.
Príspevok k návrhu riadiacich jednotiek
číslicových systémov. Dizertačná práca.
s.l.
: FEI STU,
1998.
2.
P.A. Mudry, G. Zufferey, G. Tempesti.
A hybrid
genetic algorithm for constrained hardware

software
partitioning.
Lausanne, France
: Ecole Polytechniqu
e
F´ed´erale de Lausanne.
3.
C.M. Fiduccia, R.M.Mattheyes.
A linear time

heuristic
for improving network partitions. .
s.l.
: Proceedings of the
19th Design Automation Conference, 1982.
4.
Mann, Z. Á.
Partitioning algorithms for hardware /
software co

de
sign.
Budapest, Hungary
: Budapest
University of Technology and Economics.
5.
F. Vahid, T. D. Le.
Extending the Kernighan/Lin
heuristic for hardware and software functional partitioning.
s.l.
: Design Automation for Embedded Systems 2:237

261,
1997.
TH
E AUTHORS
Pavel ČIČÁK
has got M.Sc., Ph.D. degrees in Computer
Science and Networks at The Faculty of Informatics and
Information Technologies of the Slovak University of
Technology in Bratislava in 1979 and 1999 respectively.
He is A
ssociate Professor at
the Institute of Computer
Systems and Networks of Slovak University of Technology
teaching in courses Engineering Methods, Machine and
System Level Programming, Computer Organization,
Computer Application Design. His scientific interests
include Digital C
ontrol Systems Design, The New Methods
of Computer Communications, Real

time Systems, means
of hardware (and software) specification, as well as other
topics in Computer Engineering. He has been with Slovak
University of Technology since 1978. During these
years he
co

operated on several scientific and research projects
including Autonomous In

Circuit Emulators, Fault

Tolerant
Computer Systems of New Generation, High Performance
Parallel and Distributed Fault

Tolerant Computer
Structures, Parallel Real

Time
Systems,
Methods and
Resources of the Computer Mobile Networks Safety
Development
etc. He co

authored several textbooks for
students (15), scientific papers (28) and contributed to
several international conferences (13).
Since 2000 he has been Main contac
t and Director of
Regional Cisco Networking Academy RCNA FIIT STU
Bratislava, he is CCNA, CCAI of Cisco Systems and FIET,
CEng. of the IET.
Erik Csókás
has got M.Sc. degree in Computer Science
and Networks at The Faculty of Informatics and
Information Te
chnologies of the Slovak University of
Technology in Bratislava in 2004. He is an external Ph.D.
student at
the Institute of Computer Systems and Networks
of Slovak University of Technology. His research areas
include Distributed Visualization Systems, Co

Design and
Petri Nets. He authored scientific papers and contributed to
an international conference.
Comments 0
Log in to post a comment