A Fresh Look at Efficient Perl Sorting

helmetpastoralΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 11 μήνες)

77 εμφανίσεις

A Fresh Look at Efficient Perl Sorting

Uri Guttman and Larry Rosler

Uri Guttman is an independent Perl and Internet consultant; <
uri@sysarch.com
>

Larry Rosler is with Hewlett
-
Packard Laboratories, Palo Alto, CA; <
lr@hpl.hp.com
>

Abstract

Sorting can be a major bottleneck in Perl programs. Performance can vary by orders of magnitude, depending on
how the sort is written. In this paper, we examine Perl

s
sort

function in depth and de
scribe how to use it with simple
and complex data. Next we analyze and compare several well
-
known Perl sorting optimizations (including the Orcish
Maneuver and the Schwartzian Transform). We then show how to improve their performance significantly, by
pack
ing multiple sortkeys into a single string. Finally, we present a fresh approach, using the
sort

function with
packed sortkeys and without a sortsub. This performs much better than any of the other methods, and is easy to
implement directly or by using a n
ew module we created, Sort::Records.

What is sorting and why do we use it?

Sorting

is the rearrangement of a list into an order
defined by a monotonically increasing or decreasing
sequence of
sortkeys
, where each sortkey is a single
-
valued function of the

corresponding element of the list.
(We will use the term sortkeys to avoid confusion with
the keys of a hash.)

Sorting reorders a list into a sequence suitable for
further processing or searching. Often the sorted output
is intended for people to read; so
rting makes it much
easier to understand the data and to find a datum.

Sorting is used in many types of programs and on all
kinds of data. It is such a common, resource
-
consuming
operation that sorting algorithms and the creation of
optimal implementations

constitute an important branch
of computer science.

This paper is about creating optimal sorts using Perl.
We start with a brief overview of sorting, including
basic algorithm theory and notation, some well
-
known
sorting algorithms and their efficiencies,

sortkey
processing, and sorting outside Perl. Next we describe
Perl’s
sort

function [1] and basic ways to use it. Then
we cover handling complex sortkeys, which raises the
question of how to optimize their processing. Finally we
introduce a method that mo
ves all the sortkey
processing out of the sort function, which produces the
most efficient Perl sort. We present a new module that
implements this sorting technique, which has powerful
support for
sortkey extraction

(the processing of the
input data to pro
duce the sortkeys).

Algorithm and sorting theory

A complete discussion of algorithm and sorting theory
is beyond the scope of this paper. This section will
cover just enough theory and terminology to explain the
methods that we use to compare sort techniqu
es.

The
complexity

of an algorithm is a measure of the
resources needed to execute the algorithm


typically
there is a critical operation that needs to be executed
many times. Part of algorithm theory is figuring out
which operation is the limiting factor
, and then
formulating a function that describes the number of
times the operation is executed. This complexity
function is commonly written with the
big
-
O

notation


O(f(N))


where ‘O’ is read as ‘order of’ and ‘f(N)’ is
some function of N, the size of t
he input data set.

O(f(N)) comparisons have some unusual properties.
The actual size of N is usually irrelevant to the correct
execution of an algorithm, but its influence on the
behavior of f(N) is critical. If an algorithm’s order is
O(N*logN + N), when
N is large enough the effect of
the N on the function’s value is negligible compared to
the N*logN expression. So that algorithm’s order is just
O(N*logN). Sometimes the calculated order function
for an algorithm is a polynomial of N, but you see only
the
term with the highest power, and no coefficient is
shown. Similarly, if two algorithms have the same order
but one does more work for each operation, they are
still equivalent in order space, even though there may
be a substantial difference in real
-
world
speeds. That
last point is crucial in the techniques we will show to
optimize Perl sorts, which all have the same big
-
O
function, O(N*logN).

Here are some well
-
known algorithms and their order
functions (adapted from [2]):

Notation

Name

Example

O(1)

const
ant

array or hash index

O(logN)

logarithmic

binary search

O(N)

linear

string comparison

O(N*logN)

n log n

advanced sort

O(N**2)

quadratic

simple sort

O(N**3)

cubic

matrix multiplication

O(2**N)

exponential

set partitioning

Sorting’s critical operati
on is determining in which
order to put pairs of elements of the data. The
comparison can be as simple as finding whether two
numbers are equal or which is greater than the other (or
doing similar operations on strings), or it can be
complex.

Simple sortin
g algorithms (bubble or insertion sorts)
compare each element to each of the others repeatedly,
so their complexity is O(N**2). Even with the triangle
optimization ($x is equal to $x, and $x compared to $y
is the negative of $y compared to $x), which reduc
es the
function to O((N * (N
-
1))/2), the complexity is still
O(N**2), as explained above.

But these algorithms have their uses. When N is small,
they can be faster than the other methods, because the
O(1) and O(N) overhead of the advanced sorts may
outweig
h the O(N**2) behavior of the simple sorts.
“Fancy algorithms are slow when N is small, and N is
usually small. Fancy algorithms have big constants.” [3]
The really important cases, which are worth care in the
coding, occur when N is large.

Advanced sortin
g methods repeatedly partition the
records to be sorted into smaller sets, to reduce the
number of comparisons needed. Their complexity is
O(N*logN), which can be much less than O(N**2) for
sufficiently large values of N. These algorithms include
‘tree sor
t’, ‘shell sort’, and ‘quicksort’. [4]

Some specialized sort algorithms (such as ‘radix sort’)
work by comparing pieces of numeric sortkeys, and can
achieve linear complexity (O(N)) [5]. These methods
are not general
-
purpose, so we will not address them
fu
rther.

One property of sort algorithms is whether they are
stable.

A stable sort preserves the relative order in the
sorted data of two elements that compare equal. Some
sorting problems require stability. The simple sorting
algorithms are generally stable
; the advanced ones are
not. We will show how to make stable sorts in Perl.

When the original data elements can’t conveniently be
moved around by the sort algorithm’s shuffling, you sort
their index numbers. You then use the sorted indexes to
create a list

of sorted elements. Some
sort

operators in
other languages (APL comes to mind) simply return
sorted indexes, and it is up to the programmer to use
them correctly. We will show how to create an efficient
Perl index sort and where it is useful to do so.

Sor
tkeys

If you are sorting a set of scalar
-
valued elements where
the comparison looks at the entire element, the sortkey
is simply the entire element. More generally, the sortkey
is based on some properties that are functions of all or
part of the element. S
uch
subkeys

may be extracted
from internal properties of parts of the element (
fields
)
or derived from external properties of the element such
as the modification date of a file named by the element,
which is expensive to retrieve from the file system.

To
avoid repeated computation of the sortkeys, the sort
process has to retain the association between records
and their extracted or derived sortkeys. Sorting theory
and algorithms usually ignore the cost of this
association, as it is typically a constant fac
tor of the
comparison operation. But as we will see later, in the
real world, removing that overhead or reducing it from
O(N*logN) to O(N) is valuable, especially as N grows.

Complex sortkeys can add tremendously to the
overhead of each comparison. This ha
ppens when the
records have to be sorted by primary, secondary, and
lower
-
order subkeys. This is also known as doing a
subsort on the lower subkeys. Extracting and comparing
complex sortkeys can be costly.

No previously known general
-
purpose implementation

of a sort algorithm can efficiently support extracting
and comparing different types of sortkeys. Therefore,
most implementations provide a simple interface to call
a
sortsub



a custom comparison subroutine that is
passed two operands. These operands can

be the
records themselves, or references to or indexes of the
records. The comparison returns a negative, zero, or
positive value, depending on the order of the sortkeys
of the two records. The programmer is responsible for
any preprocessing of the record
s to generate the
sortkeys and any postprocessing to retrieve the sorted
data. The generic
sort

function only manages the
comparisons and shuffles its operands into sorted order.

As Perl’s
sort

function has order O(N*logN),
efficiency must come from extra
cting and comparing
the sortkeys using the least computation. Much of this
paper will be about methods to make sortkey extraction
and comparison as efficient as possible.

External sorting

Every popular commercial operating system offers a
sort

utility. Uni
x/POSIX flavors typically have a
sort

command that is fast and flexible regarding sortkey
extraction from text files. This command may be easier
to code and more efficient than using the Perl
sort

function, even considering the overhead of piping data
into

and out of a second process.

Several vendors sell highly optimized commercial
sort

packages that have received decades of attention and
can handle massive amounts of data. But they are
expensive and usually not suited for use from a Perl
program.

All thes
e are capable of dealing efficiently with large
amounts of data, using external media such as disk or
tape files for intermediate storage when needed. In
contrast, the Perl
sort

function requires that the entire
list of operands be in (real or


much more
expensively


virtual) memory at the same time. So Perl is not the
appropriate tool to use for huge sorts (where huge is
defined by your system’s memory limits), so we shall
not consider them further.

Perl sorting

The Perl
sort

function uses an implementat
ion of the
quicksort algorithm that is similar to (but more robust
than) the
qsort

function in the ANSI/ISO Standard C
Library [6]. In the simplest use, the Perl
sort

function
requires no sortsub:

@out = sort @in;

This default sorts the data in ascending l
exicographic
order, using as the comparison operation the fast C
memcmp

function (which simply compares sequences of
unsigned bytes). If a locale is specified, it substitutes the
more complicated and somewhat slower C
strcoll

function.

If you want any orde
ring other than this, you must
provide a custom comparison sortsub. The sortsub can
be specified either as a code block, the name of a
subroutine, or a typeglob that refers to a subroutine. In
Perl 5.6, a scalar variable that contains a coderef can
also be

used to specify the sortsub.

To optimize the calling of the sortsub, Perl bypasses the
usual passing of arguments via @_, using instead a
more efficient special
-
purpose method. Within the
sortsub, the special package global variables $a and $b
are aliases

for the two operands being compared. The
sortsub must return a number less than 0, equal to 0, or
greater than 0, depending on the result of comparing the
sortkeys of $a and $b. The special variables $a and $b
should never be used to change the values of
any input
data, as this may break the sort algorithm.

Even the simplest custom sort in Perl will be less
efficient than using the default comparison. The default
sort runs entirely in C code in the perl binary, but any
sortsub must execute Perl code. A wel
l
-
known
optimization is to minimize the amount of Perl code
executing and to try to stay inside the perl binary as
much as possible. Later we will study several
optimization techniques that will reduce the amount of
Perl code executed.

The primary goal of
this paper is to do all sorts using
the default comparison. Here is how an ascending
lexicographic sort would be done using an explicit
sortsub:

@out = sort { $a cmp $b } @in;

For a simple measurement, compare Default and
Explicit in Benchmark A1 of Append
ix A. The default
method is about twice as fast as the explicit method.

Trivial sorts

We call
trivial sorts

those that use as the sortkey all or a
fixed substring of the record, and do only a minimal
amount of processing of the record. To do trivial Perl
s
orts other than ascending lexicographic, you just need
to create an appropriate sortsub. Here are some
common simple but useful sortsubs.

The simplest such example is the ascending numeric
sort, which uses the picturesquely monikered
‘spaceship’ operator:

@out = sort { $a <=> $b } @in;

A numeric sort is required because the lexicographic
order of, say, (1, 2, 10) does not correspond to the
numeric order.

If you want the sort to be in descending order there are
three techniques you can use. The worst is to n
egate the
result of the comparison in the sortsub. Better is to
reverse the order of the comparison by swapping $a and
$b. This has the same speed as the corresponding
forward sort.

# descending numeric

@out = sort { $b <=> $a } @in;

# descending lexicogra
phic

@out = sort { $b cmp $a } @in;

The best method is to apply the
reverse

function to the
result of a default ascending lexicographic sort.

@out = reverse sort @in;

Note that this is faster than using the explicit
descending lexicographic sort, for the r
eason discussed
above: the default sort is faster than using a sortsub.
The
reverse

function is efficient because it just moves
pointers around.

Another simple example is sorting on only a substring
of the element, using the
substr

function. This may be
fa
ster than the equivalent advanced sorts that we will
discuss later, because a call to
substr

can be faster than
a hash lookup (Orcish Maneuver) or an array
dereference (Schwartzian Transform).

@out = sort { substr($a, 4, 6) cmp


substr($b, 4,
6) } @in;

Another common problem is sorting with case
insensitivity. This is easily solved using the
lc

or
uc

function. Either one will give the same results.

@out = sort { lc $a cmp lc $b } @in;

Benchmark A1 analyzes these examples as a function of
the in
put size. The O(N*logN) behavior is apparent, as
well as the cost of using in the sortsub even a simple
built
-
in function like
substr

or
lc
.

Fielded and record sorts

The above trivial sorts sort the input list using as the
sortkey the entire string or one
substring (for a
lexicographic sort) or the first number in each datum
(for a numeric sort). More typically, the sortkey is based
on some properties that are a function of several parts of
each datum. Individual subkeys may be combined into a
single sortke
y or may be compared in pairs.

A string may be divided into
fields
, some of which may
serve as subkeys. For example, the Unix/POSIX
sort

command provides built
-
in support for collation based
on one or more fields of the input; the Perl
sort

function
does n
ot, and the programmer must provide it. One
CPAN module focuses on fielded sorts [7].

If your data are
records

that are complex strings or
references to arrays or hashes, you have to compare
selected parts of the records. This is called
record
sorting
. (Fi
elded sorts are a subset of record sorts.)

In the code examples that follow, KEY() is meant to be
substituted with some Perl code that does sortkey
extraction. It is best that it not be an actual subroutine
call, because subroutine calls within sortsubs ca
n be
expensive. Calls to built
-
in Perl functions (such as the
calls to
lc

in the example above) are like Perl operators,
thus less expensive.

When sorting string records, $a and $b are set to those
strings, so to extract the sortkeys you generally do
vario
us string operations on the records. Functions
commonly used for this include
split
,
substr
,
unpack
,
and
m//.

Here is one example, sorting a list of password
-
file lines by user name using
split
. The fields are
separated by colons, and the user name is the
first field.

@out = sort {


(split ':', $a, 2)[0] cmp


(split ':', $b, 2)[0]


} @pw_lines;

Multi
-
subkey sorts

Often you need to sort records by a primary subkey,
then for all the records with the same primary subkey
value, you need to sort by a seco
ndary subkey. One
horribly inefficient way to do this is to sort first by the
primary subkey, then sort all the records with each
primary subkey by the secondary subkey. A better
method is to do a multi
-
key sort. This entails extracting
a subkey for each f
ield and comparing paired subkeys in
priority order. So two records with the same primary
subkey are immediately compared based on the
secondary subkey. Sorting on more than two subkeys is
done by extending the logic.

Perl has a convenient feature that mak
es multi
-
key sorts
easy to write. The
||

and
or

(short
-
circuit logical
or
)
operators return the value of the first logically true
operand they see. So if you use them to concatenate a
set of key comparisons, the first comparison is the
primary subkey. If a

pair of primary subkeys compare
equal, the sortsub’s return value will be the result of the
secondary subkey comparison.

An example will illustrate this ‘ladder’ of comparisons
better than more text. Here is a three
-
subkey sort:

@out = sort {


# primar
y subkeys comparison


KEY1($a) cmp KEY1($b)


||


# or if they are equal


# return secondary comparison


# descending numeric comparison


KEY2($b) <=> KEY2($a)


||


# or if they are equal


# return tertiary compa
rison


# lexicographic comparison


KEY3($a) cmp KEY3($b)

} @in;

Naive multi
-
subkey record sorts

In the two previous examples, we showed a sort with
expensive sortkey extraction (via
split
), and a multi
-
subkey sort. Let’s combine them. For concretenes
s, we
shall deal with a problem that has received much
attention in comp.lang.perl.misc


sorting a list of IP
addresses in ‘dotted
-
quad’ form. Each element of the
list is a string of the form
"
nnn.nnn.nnn.nnn
\
tabc.xyz.com
\
n
", where
nnn

represents a decima
l integer between 0 and 255, with or
without leading zero
-
padding.

In the most naive approach, we sort successively on
each of these four numeric fields as individual subkeys.

@out = sort {


my @a = $a =~


/(
\
d+)
\
.(
\
d+)
\
.(
\
d+)
\
.(
\
d+)/;


my @b = $b =~


/(
\
d+)
\
.(
\
d+)
\
.(
\
d+)
\
.(
\
d+)/;


$a[0] <=> $b[0] ||


$a[1] <=> $b[1] ||


$a[2] <=> $b[2] ||


$a[3] <=> $b[3]


} @in;

This is slow even for small lists, because of the many
Perl operations executed in the sortsub for each of the
O(N*logN) comparisons
.

Computing a single packed
-
string sortkey

To improve performance, we will derive from these four
subkeys a single packed
-
string sortkey for each IP
address, that we can then use to sort the array
monotonically increasing.

The following expression produces

the shortest key, a
string of four bytes, with the least Perl calculation:

pack 'C4' => $string =~


/(
\
d+)
\
.(
\
d+)
\
.(
\
d+)
\
.(
\
d+)/

This uses the fancy comma operator,
=>
, which you can
read as ‘applied to’. We then sort these sortkeys
lexicographically.
(The four integers could be combined
into a 32
-
bit integer by multiplication or shifting, and
then compared numerically. However, we are looking
ultimately for lexicographic sorting.)

The following, then, is the next approach toward
achieving an efficient
sort:

@out = sort {


pack('C4' => $a =~


/(
\
d+)
\
.(
\
d+)
\
.(
\
d+)
\
.(
\
d+)/)


cmp


pack('C4' => $b =~


/(
\
d+)
\
.(
\
d+)
\
.(
\
d+)
\
.(
\
d+)/)


} @in;

Benchmark A2 shows that comparing the subkeys in
pairs is less efficient than packing them and compar
ing
the packed strings. This observation applies to all
sorting methods. In further benchmarks of advanced
sorts for this problem, we will use packed sortkeys.

Nevertheless, naive sorting is still woefully inefficient,
because both sortkeys are recomputed
every time one
input operand is compared against another. What we
need is a way to compute each sortkey once only and to
remember the result.

Advanced sorts

As all sorts in Perl use the built
-
in
sort

function and
therefore the same quicksort algorithm, all

Perl sorts are
of order O(N*logN). We can’t improve on that, so we
have to address other issues to gain efficiency. As the
complexity is fixed, tackling the constant factors can be
fruitful and, in the real world, can produce significant
improvements in e
fficiency. When a sortsub needs to
generate a complex sortkey, that is normally done
O(N*logN) times, but there are only N records, hence N
sortkeys. What if we were to extract the sortkey only
once per record, and keep track of which sortkey
belonged to w
hich record?

Caching the sortkeys

The obvious way to associate sortkeys with the records
from which they were created is to use a hash. The hash
can be created in a preprocessing pass over the data. If
the approximate size of the data set is known,
preallo
cating the hash improves performance.

keys my %cache = @in;

$cache{$_} = KEY($_) for @in;

The following sets up the cache more efficiently, using
a hash slice:

keys my %cache = @in;

@cache{@in} = map KEY($_) => @in;

Then the sortsub simply sorts by the val
ues of the
cached sortkeys.

@out = sort {


$cache{$a} cmp $cache{$b)


} @in;

In essence, we have replaced lengthy computations in
the sortsub by speedy (O(1)) hash lookups.

If you want to do a complex multi
-
key comparison, you
either have to use a sepa
rate cache for each subkey or
combine subkeys in a similar way to the packed
-
sort
optimizations we will describe later. Here is an example
of the former:

keys my %cache1 = @in;

keys my %cache2 = @in;

($cache1{$_}, $cache2{$_}) =


map { KEY1($_), KEY2($_)
} $_


for @in;

@out = sort {


$cache1{$a} cmp $cache1{$b) ||


$cache2{$b} <=> $cache2{$a} }


@in;

Alternatively, a multi
-
level cache can be used, which
sacrifices speed to save some space:

keys my %cache = @in;

$cache{@in} =


map [ KEY0($_), K
EY1($_) ]


=> @in;

@out = sort {


$cache{$a}[0] cmp $cache{$b)[0]


||


$cache{$b}[1] <=> $cache{$a}[1]


} @in;

An important point about cached sorts is that no
postprocessing is needed to retrieve the sorted records.
The method

sorts the actual records, but uses the cache
to reduce the sortkey extraction to O(N).

The Orcish Maneuver (OM)

The Orcish Maneuver (invented by Joseph N. Hall [8])
eliminates the preprocessing pass over the data, which
might save keeping a copy of the da
ta if they are being
read directly from a file. It does the sortkey extraction
only once per record, as it checks the hash to see if it
was done before. The test and storage of the sortkey is
done with the
||=

operator (short
-
circuit logical
-
or
assignment)
, which will evaluate and assign the
expression on the right to the lvalue on the left, if the
lvalue is false. The name ‘orcish’ is a pun on ‘or
-
cache’.
The full statement in the sortsub looks like this:

keys my %or_cache = @in;

@out = sort {


($or_cac
he{$a} ||= KEY($a))


cmp


($or_cache{$b} ||= KEY($b))


} @in;

The first part of the sortsub sees if the sortkey for $a is
cached. If not, it extracts the value of the sortkey from
the operand and caches it.
The sortkey for $a is then
compared to the

sortkey for $b (which is found in the
same way).

Here is an example of a two
-
subkey comparison using
two caches:

keys my %or_cache1 = @in;

keys my %or_cache2 = @in;

@out = sort {


($or_cache1{$a} ||= KEY1($a))


cmp # primary string key


($or_cach
e1{$b} ||= KEY1($b))


|| # or


($or_cache2{$b} ||= KEY2($b))


<=> # secondary numeric key


($or_cache2{$a} ||= KEY2($a))


} @in;

The OM has some minor efficiency flaws. An extra test
is necessary after each sortkey is retrieved from the or
-
ca
che. Furthermore, if an extracted sortkey has a false
value, it will be recomputed every time. This usually
works out all right, because the extracted sortkeys are
seldom false. However, except when the need to avoid
reading the data twice is critical, the

explicit cached sort
is always slightly faster than the OM. (See Benchmark
A3.)

The Schwartzian Transform (ST)

A more efficient approach to caching sortkeys, without
using named temporary variables, was popularized by
Randal L. Schwartz, and dubbed the Sc
hwartzian
Transform [9, 10]. (It should really be called the
Schwartz Transform, after the model of the Fourier and
Laplace Transforms, but it is too late to fix the name
now.)

The significant invention in the ST is the use of
anonymous arrays to store the

records and their
sortkeys. The sortkeys are extracted once, during a
preprocessing pass over all the data in the list to be
sorted (just as we did before in computing the cache of
sortkeys).

@out =


map $_
-
>[0] =>


sort { $a
-
>[1] cmp $b
-
>[1] }



map [ $_, KEY($_) ] =>


@in;

The ST doesn’t sort the actual input data. It sorts the
references to anonymous arrays that contain the original
records and the sortkeys. So we have to postprocess to
retrieve the sorted records from the anonymous arrays.

U
sing the ST for a multi
-
subkey sort is straightforward.
Just store each successive extracted subkey in the next
entry in the anonymous array. In the sortsub, do an
or

between comparisons of successive subkeys, as with the
OM and the naive sorts.

@out =



map $_
-
>[0] =>


sort { $a
-
>[1] cmp $b
-
>[1] ||


$b
-
>[2] <=> $a
-
>[2] }


map [ $_, KEY1($_), KEY2($_) ]


=> @in;

For an illuminating deconstruction and reconstruction
of the ST, see [11].

The packed
-
default sort

Each of the advanced so
rting techniques described
above saves the operands to be sorted together with
their sortkeys. (In the cached sorts, the operands are the
keys of a hash and the sortkeys are the values of the
hash; in the Schwartzian Transform, the operands are
the first e
lements of anonymous arrays, the sortkeys are
the other elements of the arrays.) We now extend that
idea to saving the operands to be sorted together with
packed
-
string sortkeys, using concatenation.

This little
-
known optimization improves on the ST by
eli
minating the sortsub itself, relying on the default
lexicographic sort, which as we showed earlier is
efficient. This is the method used in the new
Sort::Records module.

To accomplish this goal, we modify the ST by replacing
its anonymous arrays by packed
strings.

First we pack the subkeys into a single string. Multiple
subkeys are simply concatenated, suitably delimited if
necessary. Then we append the operand to be sorted.

Several methods can be used, singly or in combination,
to build the packed strings,

including concatenation,
pack
, or
sprintf
. Techniques for computing subkeys of
various types are presented in Appendix B.

Then we sort lexicographically on those strings, and
finally we retrieve the operands from the end of the
strings.

Several methods ca
n be used to retrieve the operands,
including
substr

(shown here), which is likely to be the
fastest,
split
,
unpack

or a regex.

@out =


map substr($_, 4) =>


sort


map pack('C4' =>


/(
\
d+)
\
.(
\
d+)
\
.(
\
d+)
\
.(
\
d+)/)


. $_ => @in;

Benchmarks of the

packed
-
default sort

Benchmark A4 compares the two most advanced
general
-
purpose sorting techniques, ST and packed
-
default. These multi
-
stage sorts are measured both as
individual stages with saved intermediate data and as
single statements.

The packed
-
def
ault sort is about twice as fast as the ST,
which is the fastest familiar Perl sorting algorithm.

Earlier, we showed trivial sorts using the
substr

or
lc

function. Even for those cases, the packed
-
default sort
performs better when more than a few data item
s are
being sorted. See Benchmark A5, which shows quasi
-
O(N) behavior for the packed
-
default sort over a wide
range of input sizes, because the sorting is much faster
than the sortkey extraction.

Sorting a list of arrays or hashes

Consider the common probl
em of sorting a two
-
dimensional data structure, a list of references to arrays
or to hashes, where the sortkeys are functions of the
values of the submembers.

If we were to use the packed
-
default method, the
references would be converted to strings and app
ended
to the sortkeys. After sorting, the operands could be
retrieved as strings, but would no longer be usable as
references. Instead, we must use the indexes of the list
members as the operands to be sorted.

The following benchmark compares a packed
-
sort
key
ST sort with an indexed sort that uses the packed
-
default approach. The list being sorted comprises
references to arrays, each of which has two elements: an
IP address (that serves as the primary sortkey), and a
domain name (that serves as the secondar
y sortkey).
These are the same data as used in the above
benchmarks, split into two array elements.

@out =


map $_
-
>[0] =>


sort { $a
-
>[1] cmp $b
-
>[1] }


map [ $_, pack('C4 A' =>


$_
-
>[0] =~


/(
\
d+)
\
.(
\
d+)
\
.(
\
d+)
\
.(
\
d+)/,


$_
-
>[1]) ] => @in;


my $i = 0;

keys my %h = @in;

@h{ map pack('C4 A x N' => $_
-
>[0]


=~ /(
\
d+)
\
.(
\
d+)
\
.(
\
d+)
\
.(
\
d+)/,


$_
-
>[1], $i++) => @in } = @in;

@out = @h{ sort keys %h };

The indexed sort is faster than the ST once again. (See
Benchmark A6.)

Indexed sorts and sta
ble sorts

In the indexed sort, the auto
-
incrementing index $i
ensures that no array records will have identical packed
sortkeys. It also ensures that the sort will be stable.

Any Perl sort can be stabilized by using such an index
as the final tie
-
breaking
subkey. For an indexed sort, the
index is what is being sorted. This offers another
possible performance advantage for the indexed sort.
The actual records to be sorted (which may be long
strings) need not be appended to the sortkeys, which
would create a
second copy of each record. Using the
indexed sort, the records may be recovered after the sort
from the original data, using the sorted indexes.

The Sort::Records module

One of the authors (Uri) has created a new module,
Sort::Records

[12], that combines
the packed
-
default
sort technique with automatic subkey extraction using a
simple attribute/value syntax. The module builds a
subroutine that converts the specified subkeys into a
single packed string that can be sorted using the default
comparison. This k
ey
-
extraction coderef is saved, and
the object can be used to sort more data or it can be
cloned. See Appendix C for the Pod of this module.

Conclusions

Packing of subkeys into strings that can be compared
lexicographically improves the performance of all
sorting techniques, relative to the usual method of
comparing the individual subkeys in pairs.

Appending the operands to the sortkeys allows the sort
to be done using the default ascending lexicographic
comparison (without a sortsub). This yields a markedl
y
faster sort than the Orcish Maneuver or the Schwartzian
Transform. The sorting process may approach O(N)
behavior over a wide range of input sizes, because the
O(N*logN) time for the sort itself may be small
compared to the time required to extract the s
ortkeys.

The packed
-
default sort may be written explicitly, or the
new Sort::Records module may be used.

Acknowledgments

This idea was brought to our attention by Michal Rutka
[13]. John Porter participated in initiating this project.
Tom Christiansen prov
ided valuable comments on the
penultimate draft.

References


1.

The
sort

function man page,
http://www.perl.com/CPAN/doc/manual/html/pod/p
erlfunc/sort.html


2.

Kernighan, B. W.
& Pike, R., (1999).
The Practice
of Programming,

p. 41. Addison
-
Wesley.


3.

Pike, R. (1989).
Notes on Programming in C,
http://wwwwbs.cs.tu
-
berlin.de/~jutta/c/pikestyle.html


4.

Knuth, D.

E. (1998). The Art of Computer
Programming : Sorting and Searching (Vol 3, 2
nd

ed), chap. 5. Addison
-
Wesley.


5.

Sedgewick, R. (1983). A
lgorithms,

chap. 10.
Addison
-
Wesley.


6.

ANSI/ISO 9899
-
1992, sect. 4.10.5.2. American
National Standards Institute.


7
.

Hall, J. N.,
Sort::Fields


Sort lines containing
delimited fields
,
http://www.perl.com/CPAN/modules/by
-
module/Sort/JNH/


8.

Hall, J. N. (1998).
Effective Perl Programming
, p.
48. Addiso
n
-
Wesley.


9.

How do I sort an array by (anything)?,
http://www.perl.com/CPAN/doc/manual/html/pod/p
erlfaq4.html#How_do_I_sort_an_array_by_anyth

10.

Ch
ristiansen, T. & Torkington, N. (1998).
The Perl
Cookbook
, Recipe 4.15: “Sorting a List by
Computable Field”. O’Reilly.

11.

Christiansen, T.,
Far More Than Everything
You’ve Ever Wanted to Know About Sorting
,
http://www.perl.com/CPAN/doc/FMTEYEWTK/so
rt.html

12.

The Sort::Records module and benchmarks are
available at
http://www.sysarch.com/perl/sort/

13.

Rutka, M., in
comp.lang.perl.misc
.
http://x4.dejanews.com/[ST_rn=ps]/getdoc.xp?AN
=397853353
)

14.

The paper, benchmarks and slides are available at
http://www.hpl.
hp.com/personal/Larry_Rosler/sort/

Appendix A: Benchmarks

A caveat: Useful benchmarking depends on judicious
isolation of relevant variables, both in the algorithms
being benchmarked and in the data sets used. Different
implementations may give different
relative results even
with the same algorithms and data. Thus all such results
should be verified under your own conditions. In short,
your mileage may vary.

In the following benchmarks [14], all data represent the
CPU time (in microseconds) per line in th
e input data,
which averages 35 characters per line. All named arrays
and hashes are preallocated, which reduces the variance
in the measurements caused by storage allocation.

A1. Trivial sorts

Control

@out = @in;

Default

@out = sort @in;

Reverse

@out =
reverse sort @in;

Explicit

@out = sort


{ $a cmp $b } @in;

Substring

@out = sort {


substr($a, 4, 6) cmp


substr($b, 4, 6) }


@in;

Insensitive

@out = sort


{ lc $a cmp lc $b }


@in;


Number of lines:

100

1000

10K

100K

Control

5

6

7

8

De
fault

9

13

19

25

Reverse

9

14

19

26

Explicit

17

25

37

50

Substring

33

49

68

85

Insensitive

43

64

92

120


A2. Naive sorts (IP addresses)

Number of lines:

100

1000

10K

100K

Separate subkeys

697

1251

1732

2758

Packed sortkeys

583

1002

1363

1814

A3. C
ached sorts (packed sortkeys)

Number of lines:

100

1000

10K

100K

Caching

66

75

85

74

Sorting

49

87

122

164

Total cached sort

116

163

215

240

Orcish Maneuver

125

168

221

256

A4. Advanced packed
-
sortkey sorts

Number of lines:

100

1000

10K

100K

ST









Caching


80

84

84

75

Sorting

27

47

76

97

Retrieval

13

18

20

17

One statement

116

150

177

191

Packed Default









Packing

61

63

65

67

Sorting

9

12

18

25

Retrieval

12

12

13

12

One statement

73

79

86

93

A5. Another look at trivial sorts

Pac
ked
substring

@out = map substr($_, 6) =>


sort map


substr($_, 4, 6) . $_ =>


@in;

Packed
insensitive

@out = map substr($_,


1 + rindex $_, "
\
0")=>


sort map "
\
L$_
\
E
\
0$_" =>


@in;


Number of lines:

10

100

1000

10K

100K

Substring

17

33

49

68

8
5

Packed substr.

21

23

27

35

42

Insensitive

20

43

64

92

120

Packed insens.

26

29

32

38

47

A6. Two
-
dimensional packed
-
sortkey sorts

Number of lines:

100

1000

10K

100K

ST

243

314

359

435

Index

200

285

323

259

Appendix B: Explicit packed
-
default sorts

B1. Creating sortable string sortkeys

This is the preprocessing pass (the first map executed).

@sorted = map ... => sort


map SORTKEY($_) . $_ => @data;

To create the sortkey and the appended operand, any
combination of concatenation, interpolation,
pac
k
, or
sprintf

may be used, the latter two with simple or
compound formats.

STRINGS

Fixed
-
length strings (ascending):

simple interpolation

pack

with format 'A*' or 'A
n
'

sprintf

with format '%s' or '%
n.
n
s'

Using

pack,
shorter strings can be
padded
with trailing
null

bytes ("
\
0")
by a
format like 'a5'.

Fixed
-
length strings (descending):

Bit
-
complement the string argument first.

$string ^ "
\
xFF" x length $string

Varying
-
length strings

Null bytes
are used to terminate string subkeys of
varying length, as that ensures lexicographic
ordering.
If a string subkey may contain a null byte, then it must
be of fixed length. If any of the operands to be sorted
may contain null bytes, then every subkey must have
fixed length.

Varying
-
length strings (ascending):

Terminate the string with a nul
l byte, to separate it from
succeeding subkeys or the operand.

interpolation:

"$string
\
0"

pack

with format 'A* x' or 'A
n
x'

sprintf

with format "%s
\
0" or "%
n.
ns
\
0"

Varying
-
length strings (descending):

Make a prepass over the data to find the length of the
longest string.

my $len = 0;

$len < length and $len = length


for map SUBKEY($_) => @data;

Subtler and somewhat faster:

my $len = "";

$len ^= $_


for map SUBKEY($_) => @data;

$len = length $len;

Then bit
-
complement each string to that length (which
a
utomatically null
-
pads the shorter strings first).

$string ^ "
\
xFF" x $len

Case
-
insensitive strings:

interpolation:

"
\
L$string
\
E"

pack

or
sprintf
:

lc $string

INTEGERS

Descending:

Negate the value, then use the appropriate one of the
formulas below.

Unsigne
d 32
-
bit longs:

pack

with format 'N'.

(Preferred


only 4 bytes; big
-
endian.)

sprintf

with format '%.10u' or '%.8x'.

(Readable


but longer.)

Signed two’s
-
complement 32
-
bit longs:

Bias to unsigned by xoring the sign bit, then treat as
unsigned.

$number ^ (
1 << 31)

Unsigned 16
-
bit shorts:

pack

with format 'n'.

(Preferred


only 2 bytes; big
-
endian.)

sprintf

with format '%.5hu' or '%.4hx'.

(Readable


but longer.)

Signed two’s
-
complement 16
-
bit shorts:

Bias to unsigned by xoring the sign bit, then treat as
un
signed.

$number ^ (1 << 15)

Unsigned chars:

pack

with format 'C'

sprintf

with format '%c'

Signed chars:

pack

with format 'c'

FLOATING
-
POINT NUMBERS

Descending:

Negate the value, then use the formula below.

Ascending:

This code assumes that floating
-
point n
umbers are
represented in binary using IEEE format. Create a
subroutine that packs a double in network order (big
-
endian). Floats can be handled analogously.

BEGIN {


my $big_endian =


pack('N', 1) eq


pack('L', 1);


sub double_sort ($) {


($big
_endian ?


pack 'd', $_[0] :


reverse pack 'd', $_[0]) ^


($_[0] < 0 ? "
\
xFF" x 8 :


"
\
x80" . "
\
x00" x 7)


}

}

B2. Extracting the operands from the sorted strings

This is the postprocessing pass (the second map
executed).

@sorted =

map RETRIEVE($_) =>


sort map ... => @data;

If all the subkeys have known length, use the total
length:

Preferred for efficiency:

@sorted =


map substr($_, $length) =>


...

TMTOWTDI:

@sorted =


map unpack("x$length A*",


$_) => ...

If a
ny of the subkeys has varying length, make sure that
the last character in the complete packed sortkey is a
null byte, then search for it from the right:

Preferred for efficiency:

@sorted = map substr($_,


1 + rindex $_, "
\
0") => ...

TMTOWTDI:

@sorted =


map (split /
\
0/)[
-
1] => ...

@sorted = map /([^
\
0]+)$/ => ...

Appendix C: The
Sort::Records

module

This is the Pod from the
Sort::Records

module. It is not
complete yet. Some features are not fully supported,
and the module needs more testing.

NAME

Sor
t::Records.pm


Efficient Multi
-
Key Sort of Records
or Strings

SYNOPSIS

use Sort::Records;

# Sort /etc/passwd by user name.

$sort = Sort::Records
-
>


new([width => 10,


split => [':', 0]]);

@pw = $sort
-
>sort(‘cat /etc/passwd‘);

# Sort /etc/passwd

by user ID.

$sort = Sort::Records
-
>


new([type => 'int',


split => [':', 2]]);

@pw = $sort
-
>sort(‘cat /etc/passwd‘);

DESCRIPTION

The Sort::Records module is designed to sort Perl
records that have multiple subkeys, with very high
efficiency. I
t can handle lists of strings or hash or array
references. Each subkey can be designated to be any of
the Perl data types string, integer (char, short or long),
float or double. Subkeys can be sorted in ascending or
descending order, and string subkeys can

be case
-
insensitive. Special
-
case subkeys such as IP addresses
are supported.

The module performs the extraction and conversion of
the subkeys according to the description parameters
passed to the
new

method that created the sort object.
That sort object
can be passed records to sort, and the
sorted records can be retrieved. The sort object can be
cleared and reused or cloned, which saves the overhead
of creating a new object.

The sortkey processing technique used is different from
other well
-
known sort tr
icks such as the Schwartzian
Transform and the Orcish Maneuver. In this technique,
subkeys are extracted and packed into a single string;
then the record or its index is appended. This string is
then sorted using the default lexicographic comparison
of the

sort

function. This gains efficiency for two main
reasons: the sortkey extraction and processing is faster
than the other methods; and the sort callback is
eliminated.

The sortkey extraction is done by a generated
subroutine whose logic is determined by t
he subkey
descriptions. The subroutine is built only once in the
new

method, and can be used multiple times and cloned.

NOTE: As of June 23, 1999 when I am writing this Pod,
the interface to this module is not stable. The code is
written but needs more tes
ting and benchmarking. Some

features documented here are not yet implemented.
Assume that this module is very alpha.

new

This method takes parameters that describe the records
and how to sort them, and returns a sort object.

Sort attributes and defaults

Th
e first arguments must be name/value pairs
that

are
sort attributes or default values for sortkey attributes.
The following are the supported sort attributes and their
allowable values:

record_type =>


'string' | 'array' | 'hash'

This defines the recor
d type, which is needed to
generate code for this sort. This attribute defaults to
'string'.

stable => 1

This optional attribute makes the sort stable (the relative
ordering of records that sort together remains
unchanged). It is useful for string records
only, as hash
and array sorts using this module are always stable.

width => <
integer
>

This sets the default width for all string subkeys. It can
be overridden in each subkey description.

case => 1

This sets the default case sensitivity attribute to true fo
r
all subkeys. It can be overridden in each subkey
description.

descend => 1

This sets the default sort
-
descending attribute to true for
all subkeys.

Subkey attributes

After the sort attributes come the subkey descriptions.
Each is a reference to an array
that contains name/value
pairs for any optional subkey attributes and is followed
by name/value pairs for the subkey extraction
operations. There must be at least one subkey
description, and each description must have at least one
subkey extraction operati
on.

These are the subkey
-
description attribute names and
their allowed values:

type => 'string' | 'byte' | 'short' |


'integer' | 'float' | 'double' |


'IP'

This sets the type of the subkey. The subkey extraction
operations for this subkey must produ
ce a value that can
be converted by
pack

into this type. The default is
'string'. The 'IP' type is special and requires the last
extraction operation (see below) to be 'IP' or 'IP_regex'.

width => <
integer
>

This sets the width for a string subkey. If the l
ast string
extraction operation is a
substr
, its width will be used
and this is not needed (not supported yet).

case => 0 | 1

If true, this subkey is case
-
insensitive, so upper
-

and
lower
-
case letters will sort together.

descend => 0 | 1

If true, this subk
ey will sort in descending order.

Subkey extraction operations

These are the name/value pairs that describe the subkey
extraction operations. Most just map to a standard perl
operation, so they should be easy to understand. If the
record type is 'array' or

'hash', the first extraction
operation for each subkey must be of that same type.
Also all dereferencing operations must precede all other
operations. If the record type is a string, then no
dereferencing operations are allowed. Subkey extraction

operatio
ns are performed in the order seen in the
anonymous array. See the examples for more on this.
The value for an extraction operator is an anonymous
list of values. If there is only one value, it may be used
as is.

array => <
index
>

The value is the index int
o the array to get the subkey
data. The previous extraction (or the record itself) must
be an array reference.

hash => <
key
>

The value is the key into the hash to get the subkey data.
The previous extraction (or the record itself) must be a
hash reference.

substr => [<
offset
>, <
length
>]

The values are the second and third arguments to a
substr

call. The previous extraction (or the record itself)
must be a string.

split =>


<
regex
>, <
index
>

[
,
<
save_name
>]

The values are the regular expression for a
split
,

and the
index of which split element you want. The optional
third value is a name to be used later in a 'saved'
extraction operation (see below). The previous
extraction (or the record itself) must be a string.

regex =>


<
regex
>, <
index
>

[
,
<
save_name
>]

The values are the regular expression with one or more
grouping parentheses, and the index of which grouped
element you want. The optional third value is a name to
be used later in a 'saved' extraction operation (see
below). The previous extraction (or th
e record itself)
must be a string.

IP_regex => 0

This takes no arguments, but a dummy one is needed to
keep the pairing. It uses a regex to extract the four parts
of an IP address. It doesn’t do any complex checks; it
just looks for a dotted
-
quad IP string
. This extraction
must be the last and the subkey type must be 'IP' as it
expects the results from this extraction.

saved => <save_name>, <index>

The
split

and regex extraction operations can optionally
save all their values in a named array. The saved
ope
ration is used to get a key value from that array. This
saves the code needed to run the
split

or regex again.
The first value is the name used as the last value of the
split

or
regex operation. The second value is the index
of the element from that operation
.

A
split

or regex operation with the same save name
must be performed before a saved operation with that
name.

records

This method is used to pass records to the sort object. It
processes its arguments using the sort object key
subroutine. It can be calle
d multiple times, and
it
pushes
all the processed records into the object.

sort

This method does the actual sort and stores the sorted
keys in the object. It returns the value of a call to the
results

method, so you get the sorted records. If
sort

is
passed a
ny arguments, they are taken as all the input
records, so a call to
clear

and
records

is made with the
data.

results

This method processes the sorted keys and returns the
sorted records. If called in a list context, it returns the
list of sorted records. E
lse in a scalar context, it returns
a reference to the list of sorted records.

clear

This method clears all data in the sort object. It does not
modify the key subroutine, so the object can be used for
more sorting.

AUTHOR

Uri Guttman <
uri@sysarch.com
>, using ideas from
Larry Rosler <
lr@hpl.hp.com
>