Some Random Theorems
I've been requiring that just about everything is a permutation, not just a random mapping. Permutations always map n things
to n
things. Random mappings map them to m, with m <= n.
The Birthday Paradox
Theorem BIRTHDAY PARADOX
: Apply
ing a random mapping
f : S

> S
to
\
sqrt{2S}
values is expected to produce
about 1 collision.
Proof: Let c
k
be a list of choices, where each choice comes from S, and let C
k
be the list of all such choices.
By definition, if there is a collision in c
k
i
n C
k
, there will be an i,j in 0..k

1, i < j, such that c
k
i
= c
k
j
.
Given i,j in 0..k

1, i < j, every value in (SxS) is equally likely to be the value of (c
k
i
, c
k
j
). This follows from C
k
being SxSxSx...xS.
There are S*S possible pairs in SxS,
S of w
hich have the same first and second term,
so 1/S of all possible pairs have the same first and second term.
Each c in C
sqrt(2S)
has (sqrt(2S)(sqrt(2S)

1)/2) = (S

sqrt(S/2)) pairs (c
i
, c
j
) where i < j.
Since every possible pair is equally lik
ely, the chances that c has a (c
i
, c
j
) with i < j and c
i
= c
j
is (S

sqrt(S/2))/S, which is about 1.
Therefore a set of sqrt(2S) independent choices from S will have about 1 collision.
QED birthday
Distribution of items in buckets
[
Chuck Blake poi
nted out that hash functions produce a binomial distribution in each hash bucket, which in the limit is a Poisson
distribution. That is, if there are
n
buckets and
m
items hashed into them, and
a=m/n
, then the probability of a given bucket having
k
items i
n it will be about
(a
k
e

a
)/k!
, so the total number of buckets with
k
items will be about
(na
k
e

a
)/k! ].
We want the variable

length hash to be an equidistributed function, that is, it should map an equal number of keys to each output. The
only equidistri
buted
mixing functions
are the permutations of the internal state
S
. If the mixing function
m
is not a permutation, the
final mixing steps will guarantee that some hash values are not
possible, so the results will not be equidistributed. Similarly, if the
combining function
g: B
\
times S

> S
is not equidistributed for the set of all input blocks
B
given a fixed internal state, the
hash will not be equidistributed even for one

bucket k
eys. What if
g
is not a permutation of
S
when the input block is fixed?
Let a hash function be given where, for every block in
B
, the combining step is an arbitrary mapping from
S
to
S
. Then, for a fixed
key, the hash is effectively a sequence of arbitrar
y mappings applied to the initial internal state. (The function proposed in
\
cite{MERKLE} seems to be an example of such a hash.) If we start with all possible internal states, how many are left after
y
blocks
have been hashed? The answer is about
2S/y
,
to be shown by theorem
loss
.
To show this, let's start with a simpler question: if we apply a random mapping to x out of n states, how many distinct resul
ts will
there be?
Theo
rem HALF
: Define
q = n/x
. Applying a random mapping to
x
out of
n
possible states is expected to produce about
n/(q+(1/2)+o(1/q))
distinct results if
q <
\
sqrt{n}
.
Proof: First, find the straightforward formula for how many distinct result there will be.
1/n
chance of a given input hitting a given result
(n

1)/n
chance of a given input not hitting a given result
((n

1)/n)
x
chance of all x inputs not hitting a given result
n(((n

1)/n)
x
)
number of results not hit at all by x inputs
n(1

(((n

1)/n
)
x
))
number of distinct results hit by some input
Next, look at the series expansion of 1

(((n

1)/n)
x
):
1

(((n

1)/n)
x
)
=
1

((n

1)
x
)/(n
x
)
=
1

(n
x
)/(n
x
)
+ x(n
x

1
)/(n
x
)

x(x

1)(n
x

2
)/(2n
x
)
+ ...
=
1

1
+ x/n

(x
2

x)/(2n
2
)
+ o((
x
3
)/(n
3
))
=
x/n

(x
2
)/(2n
2
)
+ x/(2n
2
)
+ o((x
3
)/(n
3
))
=
x/n

(x
2
)/(2n
2
)
+ o((x
3
)/(n
3
))
(assuming n > (n
2
)/(x
2
))
=
1/q

1/(2q
2
)
+ o(1/(q
3
))
Note that 1/(q+(1/2)+o(1/q)) is also 1/q

1/(2q
2
) + o(1/(q
3
)), so
1

(((n

1)/n)
x
) = 1/((n/x
)+(1/2)+o(x/n))
QED half
For example,
n/10
states will map to about
n/10.5
states.
Corollary BUCKETS
: Hashing
b/q
keys into
b
buckets is expected to hit about
b/(q+(1/2))
buckets, if
q <
\
sqrt(b)
.
OK. Now we can look at the original question, applying a
sequence of
y
random mappings to all
n
possible states.
Theorem LOSS
: After applying
y
random mappings g
i
: S

> S to all elements of
S
, the number of elements remaining will be about
(2S)/y
when
y <
\
sqrt{n}
.
The exact formula for x
i
, the number of st
ates after
i
mappings, is:
x
0
=
n
x
i+1
=
n(1

(((n

1)/n)
x
i
))
Hypothesis: after
i
mappings we are left with
(2n)/(i+o(
\
log i))
states. The next mapping will leave us this many states:
n(1

((n

1)/n)
x
i
)
=
n/{(n/(x
i
))
+ 1/2
+ o((x
i
)/n)}
(by
half
)
=
n/{n/(2n/(i+o(
\
log i)))
+ 1/2
+ o({2n/(i+o(
\
log i))}/n)}
(by hypothesis)
=
n/{((i+o(
\
log i))/2)
+ 1/2
+ o( 2 /(i+o(
\
log i))}
=
2n/{ i+o(
\
log i)
+ 1
+ o( 4 /
i)}
=
2n/{(i+1)
+ o(
\
log (i+1))}
(since
\
sum
{i=1}
n
{4/i} is o(
\
log n))
The base case is satisfied because
o(
\
log i)
can be any constant, in reality
2
for
i=0
. By induction, the number of states left after
y
mappings is
(2n)/(y+o(
\
log y))
. This seq
uence converges to the sequence
2n/y
.
QED loss
Experiments with
n=65536
agree with both of these results.
Using theorem
loss
we
can see that if the combining function does not act as a permutation on
S
, changing the
i
th out of
q
blocks is
expected to allow only
\
frac{2}{q

i}
of all possible final states to be produced. If S = 2
8
(as in
\
cite{CACMhash}, which does
not have this p
roblem), such a loss would be disasterous. But when S = 2
128
(as in
\
cite{MERKLE}) this loss may be negligible.
The birthday paradox
: Applying a random mapping
f : S

> S
to
\
sqrt{2S}
values is expected to produce about 1 collision.
A common question in random number generators can be answered as a corollary of the theorems above. If a random number
generator is based on a random mapping
f : S

> S
, what is its expec
ted cycle length? The answer is about
\
sqrt{S}
.
Theorem SQRT
The median number of distinct values in the sequence f
i
(x), where
f : S

> S
is a random mapping and
x
is an
element in
S
, is between
\
sqrt{S}
and
\
sqrt{2S}
.
By theorem
birthday
, applying
f
to
\
sqrt{S}
arbitrary values in
S
should yield about
\
frac{S}{
\
sqrt{S}+1/2 +
o(1/
\
sqrt{S})}
distinct values. That is, there should be about
1/2
a collision. Si
milarly, mapping a subset of
\
sqrt{2S}
values is expected to cause about
1
collision. (Theorem
half
will produce the same results.)
If no subset had more than one collision, t
hen an expectation of
1/2
a collision would imply every subset had a
1/2
chance of a at
least one collision. Some subsets have more than one collision, though, so the chances of having at least one collision is le
ss than
1/2
when the subset has
\
sqrt{S}
values.
On the other hand, a set of
\
sqrt{2S}
values is expected to produce
1
collision. These values could be chosen one at a time. If a
collision does occur, there should be no more than
1
collision expected in the remaining values. If we solve for
x
in the equation
\
sum
i=0
\
sqrt{2S}
{ix
i+1
} = 1, then
x
is a lower bound on the fraction of subsets with at least one collision. Since the solution is
x = 1/2
,
the probability of having at least one collision is more than
1/2
when
f
maps a set of
\
sqrt{2S}
values.
Until one collision occurs, the sequence f
i
(x) is a list of arbitrary values chosen from
S
. According to the bounds above, over half the
choices of
(f, x)
will go
\
sqrt{S}
values without a collision, and less than half will go
\
sqrt{2S}
value
s without a
collision, so the median must be somewhere in between.
QED sqrt
Hashing
Lecture 21
Steven S. Skiena
Hashing
One way to convert form names to integers is to use the letters to form a base `
`alphabet

size'' number system:
To convert ``STEVE'' to a number, observe that
e
is the 5th letter of the alphabet,
s
is the 19th letter,
t
is the 20th letter, and
v
is the
22nd letter.
Thus ``Steve''
Thus one way we could represent a table of names would be to set aside an array big enough to contain one element for each po
ssible
string
of letters, then store data in the elements corresponding to real
people. By computing this function, it tells us where the person's
phone number is immediately!!
What's the Problem?
Because we must leave room for every possible string, this method will use an incredible amount of memory. We need a data str
ucture
to re
present a
sparse table
, one where almost all entries will be empty.
We can reduce the number of boxes we need if we are willing to put more than one thing in the same box!
Example: suppose we use the base alphabet number system, then take the remainder
Now the table is much smaller, but we need a way to deal with the fact that more than one, (but hopefully every few) keys can
get
mapped to the same arra
y element.
The Basics of Hashing
The basics of hashing is to apply a function to the search key so we can determine
where
the item is without looking at the other items.
To make the table of reasonable size, we must allow for
collisions
, two distinct key
s mapped to the same location.
We a special hash function to map keys (hopefully uniformly) to integers in a certain range.
We set up an array as big as this range, and use the valve of the function as the index to store the appropriate key.
Special
care
must be taken to handle collisions when they occur.
There are several clever techniques we will see to develop good hash functions and deal with the problems of duplicates.
Hash Functions
The verb ``hash'' means ``to mix up'', and so we seek a function
to mix up keys as well as possible.
The best possible hash function would hash
m
keys into
n
``buckets'' with no more than
keys per bucket. Such a func
tion is
called a
perfect hash function
How can we build a hash function?
Let us consider hashing character strings to integers. The ORD function returns the character code associated with a given ch
aracter.
By using the ``base character size'' number sys
tem, we can map each string to an integer.
The First Three SSN digits Hash
The first three digits of the Social Security Number
The last three digits of the Social Security Number
What is the big picture?
1.
A hash function which maps an arbitrary key
to an integer turns searching into array access, hence
O
(1).
2.
To use a finite sized array means two different keys will be mapped to the same place. Thus we must have some way to
handle collisions.
3.
A good hash function must spread the keys uniformly, or e
lse we have a linear search.
Ideas for Hash Functions
Truncation

When grades are posted, the last four digits of your SSN are used, because they distribute students more
uniformly than the first four digits.
Folding

We should get a better spread by
factoring in the entire key. Maybe subtract the last four digits from the first five
digits of the SSN, and take the absolute value?
Modular Arithmetic

When constructing pseudorandom numbers, a good trick for uniform distribution was to take a big
numbe
r
mod
the size of our range. Because of our roulette wheel analogy, the numbers tend to get spread well if the tablesize
is selected carefully.
Prime Numbers are Good Things
Suppose we wanted to hash check totals by the dollar value in pennies mod 1000.
What happens?
,
, and
Prices tend to be clumped by similar last digits, so we get clustering.
If we instead use a prime numbered Modulus like 1007, these clusters will get broken:
,
, and
.
In general, it is a good idea to use prime modulus for hash table size, since it is less likely the data will be multiples of
large primes as
opposed to small primes

all multiples of 4 get mapped to even num
bers in an even sized hash table!
The Birthday Paradox
No matter how good our hash function is, we had better be prepared for collisions, because of the birthday paradox.
Assuming 365 days a year, what is the probability that exactly two people share a
birthday? Once the first person has fixed their
birthday, the second person has 365 possible days to be born to avoid a collision, or a 365/365 chance.
With three people, the probability that no two share is
. In general, the probability of there being
no
collisions after
n
insertions into an
m

element table is
When
m
= 366, this probability sinks below 1/2 when
N
= 23 and to almost 0 when
.
The moral is that collisions are common, even with good h
ash functions.
What about Collisions?
No matter how good our hash functions are, we must deal with collisions. What do we do when the spot in the table we need is
occupied?
Put it somewhere else!

In
open addressing
, we have a rule to decide where to p
ut it if the space is already occupied.
Keep a list at each bin!

At each spot in the hash table, keep a linked list of keys sharing this hash value, and do a sequential
search to find the one we need. This method is called
chaining
.
Collision Resolutio
n by Chaining
The easiest approach is to let each element in the hash table be a pointer to a list of keys.
Insertion, deletion, and query reduce to the problem in linked lists. If the
n
keys are distributed uniformly in a table of size
m
/
n
, each
opera
tion takes
O
(
m
/
n
) time.
Chaining is easy, but devotes a considerable amount of memory to pointers, which could be used to make the table larger. Stil
l, it is
my preferred method.
Open Addressing
We can dispense with all these pointers by using an implic
it reference derived from a simple function:
If the space we want to use is filled, we can examine the remaining locations:
1.
Sequentially
2.
Quadraticall
y
3.
Linearly
The reason for using a more complic
ated scheme is to avoid long runs from similarly hashed keys.
Deletion in an open addressing scheme is ugly, since removing one element can break a chain of insertions, making some elemen
ts
inaccessible.
Performance on Set Operations
With either chainin
g or open addressing:
Search

O
(1) expected,
O
(
n
) worst case.
Insert

O
(1) expected,
O
(
n
) worst case.
Delete

O
(1) expected,
O
(
n
) worst case.
Pragmatically, a hash table is often the best data structure to maintain a dictionary. However, the worst

c
ase running time is
unpredictable.
The best worst

case bounds on a dictionary come from balanced binary trees, such as red

black trees.
Data Structures and Algorithms
8.3 Hash Tables
8.3.1 Direct Address Tables
If we have a collection of
n
elements
whose keys are unique integers in (1,
m
),
where
m
>=
n
,
then we can store the items in a
direct address
table,
T[m]
,
where
T
i
is either empty or contains one of the elements of our collection.
Searching a direct address table is clearly an
O(1)
operation:
for a key,
k
, we access
T
k
,
if it contains an element, return it,
if it doesn't then return a NULL.
There are two constraints here:
1.
the keys must be unique, and
2.
the range of the key must be severely bounded.
If the keys are not unique, then we can simply construct a set of
m
lists and store the
heads of these lists in the direct address table. The time to find an element matching an
input
key will still be
O(1)
.
However, if each element of the collection has some other distinguishing feature (other
than its key), and if the maximum number of duplicates is
n
dup
max
, then searching for a
specific element is
O(n
dup
max
)
. If duplicates are the
exception rather than the rule, then
n
dup
max
is much smaller than
n
and a direct address table will provide good performance.
But if
n
dup
max
approaches
n
, then the time to find a specific element is
O(n)
and a tree
structure will be more efficient.
The range of the key determines the size of the direct address table and may be too large to be practical. For instance it's
not likely that
you'll
be able to use a direct address table to store elements which have arbitrary 32

bit integers as their keys for a few years yet!
Direct addressing is easily generalised to the case where there is a function,
h(k)
=> (1,
m
)
which maps each value of the key
,
k
, to the range (1,
m
). In this case, we place the element in
T[h(k)]
rather than
T[k]
and we can
search in
O(1)
time as before.
8.3.2 Mapping functions
The direct address approach requires that the function,
h(k)
, is a one

to

one mapping from each
k
to
integers in (1,
m
). Such a function
is known as a
perfect hashing function
: it maps each key to a distinct integer within some manageable range and enables us to
trivially build an
O(1)
search time table.
Unfortunately, finding a perfect hashing function i
s not always possible. Let's say that we can find a
hash function
,
h(k)
, which maps
most
of the keys onto unique integers, but maps a small number of keys on to the same integer. If the number of
collisions
(cases
where multiple keys map onto the same inte
ger), is sufficiently small, then
hash tables
work quite well and give
O(1)
search times.
Handling the collisions
In the small number of cases, where multiple keys map to the same integer, then elements with different keys may be stored in
the
same "slot"
of the hash table. It is clear that when the hash function is used to locate a potential match, it will be necessary to compa
re
the key of that element with the search key. But there may be more than one element which should be stored in a single slot o
f
the
table. Various techniques are used to manage this problem:
1.
chaining,
2.
overflow areas,
3.
re

hashing,
4.
using neighbouring slots (linear probing),
5.
quadratic probing,
6.
random probing, ...
Chaining
One simple scheme is to chain all collisions in lists att
ached to the appropriate slot. This allows an unlimited number of collisions to
be handled and doesn't require
a priori
knowledge of how many elements are contained in the collection. The tradeoff is the same as
with linked lists versus array implementatio
ns of collections: linked list overhead in space and, to a lesser extent, in
time
.
Re

hashing
Re

hashing schemes use a second hashing operation when there is a collisi
on. If there
is a further collision, we
re

hash
until an empty "slot" in the table is found.
The re

hashing function can either be a new function or a re

application of the
original one. As long as the functions are applied to a key in the same order, the
n a
sought key can always be located.
Linear probing
One of the simplest re

hashing functions is +1 (or

1),
ie
on a collision, look in the
neighbouring slot in the table. It calculates the new address extremely quickly and
may be extremely efficient on a
modern RISC processor due to efficient cache
utilisation (
cf.
the discussion of linked list efficiency
).
The
animation
gives you a practical demonstration of the effect of linear probing: it
also implements a quadratic re

hash function so that you can compare the difference.
h(j)=h(k)
, so the next hash function,
h1
is used. A second collision occurs,
so
h2
is used.
Clustering
Linear probing is subject to a
clustering
phenomenon. Re

hashes from one location occupy a block of sl
ots in the table which "grows"
towards slots to which other keys hash. This exacerbates the collision problem and the number of re

hashed can become large.
Quadratic Probing
Better behaviour is usually obtained with
quadratic probing
, where the secondary
hash function depends on the re

hash index:
address = h(key) + c i
2
on the
t
th
re

hash. (A more complex function of
i
may also be used.) Since keys which are mapped to the same value by the primary
hash function follow the same sequence of addresses, qua
dratic probing shows
secondary clustering
. However, secondary clustering
is not nearly as severe as the clustering shown by linear probes.
Re

hashing schemes use the originally allocated table space and thus avoid linked list overhead, but require advance
knowledge of the
number of items to be stored.
However, the collision elements are stored in slots to which other key values map directly, thus the potential for multiple c
ollisions
increases as the table becomes full.
Overflow area
Another scheme will
divide the pre

allocated table into two sections: the
primary area
to which keys are mapped and an area for
collisions, normally termed the
overflow area
.
When a collision occurs, a slot in the overflow area is used for the new
element and a link from the primary slot established as in a chained system.
This is essentially the same as chaining, except that the overflow area is pre

allocated and thu
s possibly faster to access. As with re

hashing, the
maximum number of elements must be known in advance, but in this case,
two parameters must be estimated: the optimum size of the primary and
overflow areas.
Of course, it is possible to design systems
with multiple overflow tables, or with a mechanism for handling overflow out of the
overflow area, which provide flexibility without losing the advantages of the overflow scheme.
Summary: Hash Table Organization
Organization
Advantages
Disadvantages
Chai
ning
Unlimited number of elements
Unlimited number of collisions
Overhead of multiple linked lists
Re

hashing
Fast re

hashing
Fast access through use
of main table space
Maximum number of elements must be known
Multiple collisions may become
probable
Ov
erflow area
Fast access
Collisions don't use primary table space
Two parameters which govern performance
need to be estimated
Comments 0
Log in to post a comment