Some Random Theorems

scaleemptyΗλεκτρονική - Συσκευές

10 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

97 εμφανίσεις

Some Random Theorems

I've been requiring that just about everything is a permutation, not just a random mapping. Permutations always map n things
to n
things. Random mappings map them to m, with m <= n.

The Birthday Paradox

Theorem BIRTHDAY PARADOX
: Apply
ing a random mapping
f : S
-
> S

to
\
sqrt{2|S|}

values is expected to produce
about 1 collision.

Proof: Let c
k

be a list of choices, where each choice comes from S, and let C
k

be the list of all such choices.

By definition, if there is a collision in c
k

i
n C
k
, there will be an i,j in 0..k
-
1, i < j, such that c
k
i

= c
k
j
.

Given i,j in 0..k
-
1, i < j, every value in (SxS) is equally likely to be the value of (c
k
i
, c
k
j
). This follows from C
k

being SxSxSx...xS.

There are |S|*|S| possible pairs in SxS,

|S| of w
hich have the same first and second term,

so 1/|S| of all possible pairs have the same first and second term.

Each c in C
sqrt(2|S|)

has (sqrt(2|S|)(sqrt(2|S|)
-
1)/2) = (|S|
-
sqrt(|S|/2)) pairs (c
i
, c
j
) where i < j.

Since every possible pair is equally lik
ely, the chances that c has a (c
i
, c
j
) with i < j and c
i

= c
j

is (|S|
-
sqrt(|S|/2))/|S|, which is about 1.

Therefore a set of sqrt(2|S|) independent choices from S will have about 1 collision.

QED birthday

Distribution of items in buckets

[
Chuck Blake poi
nted out that hash functions produce a binomial distribution in each hash bucket, which in the limit is a Poisson
distribution. That is, if there are
n

buckets and
m

items hashed into them, and
a=m/n
, then the probability of a given bucket having
k

items i
n it will be about

(a
k
e
-
a
)/k!
, so the total number of buckets with
k

items will be about
(na
k
e
-
a
)/k! ].

We want the variable
-
length hash to be an equidistributed function, that is, it should map an equal number of keys to each output. The
only equidistri
buted
mixing functions

are the permutations of the internal state
S
. If the mixing function
m

is not a permutation, the
final mixing steps will guarantee that some hash values are not

possible, so the results will not be equidistributed. Similarly, if the
combining function
g: B
\
times S
-
> S

is not equidistributed for the set of all input blocks
B

given a fixed internal state, the
hash will not be equidistributed even for one
-
bucket k
eys. What if
g

is not a permutation of
S

when the input block is fixed?

Let a hash function be given where, for every block in
B
, the combining step is an arbitrary mapping from
S

to
S
. Then, for a fixed
key, the hash is effectively a sequence of arbitrar
y mappings applied to the initial internal state. (The function proposed in
\
cite{MERKLE} seems to be an example of such a hash.) If we start with all possible internal states, how many are left after
y

blocks
have been hashed? The answer is about
2|S|/y
,
to be shown by theorem
loss
.

To show this, let's start with a simpler question: if we apply a random mapping to x out of n states, how many distinct resul
ts will
there be?

Theo
rem HALF
: Define
q = n/x
. Applying a random mapping to
x

out of
n

possible states is expected to produce about
n/(q+(1/2)+o(1/q))

distinct results if
q <
\
sqrt{n}
.

Proof: First, find the straightforward formula for how many distinct result there will be.

1/n

chance of a given input hitting a given result

(n
-
1)/n

chance of a given input not hitting a given result

((n
-
1)/n)
x


chance of all x inputs not hitting a given result

n(((n
-
1)/n)
x
)

number of results not hit at all by x inputs

n(1
-
(((n
-
1)/n
)
x
))

number of distinct results hit by some input

Next, look at the series expansion of 1
-
(((n
-
1)/n)
x
):


1

-
(((n
-
1)/n)
x
)

=

1

-

((n
-
1)
x
)/(n
x
)

=

1

-

(n
x
)/(n
x
)

+ x(n
x
-
1
)/(n
x
)

-

x(x
-
1)(n
x
-
2
)/(2n
x
)

+ ...

=

1

-

1

+ x/n

-

(x
2
-
x)/(2n
2
)

+ o((
x
3
)/(n
3
))

=

x/n

-

(x
2
)/(2n
2
)

+ x/(2n
2
)


+ o((x
3
)/(n
3
))

=

x/n

-

(x
2
)/(2n
2
)

+ o((x
3
)/(n
3
))

(assuming n > (n
2
)/(x
2
))


=

1/q

-

1/(2q
2
)

+ o(1/(q
3
))



Note that 1/(q+(1/2)+o(1/q)) is also 1/q
-

1/(2q
2
) + o(1/(q
3
)), so

1
-
(((n
-
1)/n)
x
) = 1/((n/x
)+(1/2)+o(x/n))

QED half

For example,
n/10

states will map to about
n/10.5

states.

Corollary BUCKETS
: Hashing
b/q

keys into
b

buckets is expected to hit about
b/(q+(1/2))

buckets, if
q <
\
sqrt(b)
.

OK. Now we can look at the original question, applying a
sequence of
y

random mappings to all
n

possible states.

Theorem LOSS
: After applying
y

random mappings g
i

: S
-
> S to all elements of
S
, the number of elements remaining will be about
(2|S|)/y

when
y <
\
sqrt{n}
.

The exact formula for x
i
, the number of st
ates after
i

mappings, is:

x
0


=

n

x
i+1


=

n(1
-
(((n
-
1)/n)
x
i
))

Hypothesis: after
i

mappings we are left with
(2n)/(i+o(
\
log i))

states. The next mapping will leave us this many states:


n(1
-
((n
-
1)/n)
x
i
)

=

n/{(n/(x
i
))

+ 1/2

+ o((x
i
)/n)}

(by
half
)

=

n/{n/(2n/(i+o(
\
log i)))

+ 1/2

+ o({2n/(i+o(
\
log i))}/n)}

(by hypothesis)

=

n/{((i+o(
\
log i))/2)

+ 1/2

+ o( 2 /(i+o(
\
log i))}


=

2n/{ i+o(
\
log i)

+ 1

+ o( 4 /

i)}


=

2n/{(i+1)


+ o(
\
log (i+1))}

(since
\
sum
{i=1}
n
{4/i} is o(
\
log n))

The base case is satisfied because
o(
\
log i)

can be any constant, in reality
2

for
i=0
. By induction, the number of states left after
y

mappings is
(2n)/(y+o(
\
log y))
. This seq
uence converges to the sequence
2n/y
.


QED loss

Experiments with
n=65536

agree with both of these results.

Using theorem
loss

we

can see that if the combining function does not act as a permutation on
S
, changing the
i
th out of
q

blocks is
expected to allow only
\
frac{2}{q
-
i}

of all possible final states to be produced. If |S| = 2
8

(as in
\
cite{CACMhash}, which does
not have this p
roblem), such a loss would be disasterous. But when |S| = 2
128

(as in
\
cite{MERKLE}) this loss may be negligible.

The birthday paradox
: Applying a random mapping
f : S
-
> S

to
\
sqrt{2|S|}

values is expected to produce about 1 collision.

A common question in random number generators can be answered as a corollary of the theorems above. If a random number
generator is based on a random mapping
f : S
-
> S
, what is its expec
ted cycle length? The answer is about
\
sqrt{|S|}
.

Theorem SQRT

The median number of distinct values in the sequence f
i
(x), where
f : S
-
> S

is a random mapping and
x

is an
element in
S
, is between
\
sqrt{|S|}

and
\
sqrt{2|S|}
.

By theorem
birthday
, applying
f

to
\
sqrt{|S|}

arbitrary values in
S

should yield about
\
frac{|S|}{
\
sqrt{|S|}+1/2 +
o(1/
\
sqrt{|S|})}

distinct values. That is, there should be about
1/2

a collision. Si
milarly, mapping a subset of
\
sqrt{2|S|}

values is expected to cause about
1

collision. (Theorem
half

will produce the same results.)

If no subset had more than one collision, t
hen an expectation of
1/2

a collision would imply every subset had a
1/2

chance of a at
least one collision. Some subsets have more than one collision, though, so the chances of having at least one collision is le
ss than
1/2

when the subset has
\
sqrt{|S|}

values.

On the other hand, a set of
\
sqrt{2|S|}

values is expected to produce
1

collision. These values could be chosen one at a time. If a
collision does occur, there should be no more than
1

collision expected in the remaining values. If we solve for
x

in the equation
\
sum
i=0
\
sqrt{2|S|}
{ix
i+1
} = 1, then
x

is a lower bound on the fraction of subsets with at least one collision. Since the solution is
x = 1/2
,
the probability of having at least one collision is more than
1/2

when
f

maps a set of
\
sqrt{2|S|}

values.

Until one collision occurs, the sequence f
i
(x) is a list of arbitrary values chosen from
S
. According to the bounds above, over half the
choices of
(f, x)

will go
\
sqrt{|S|}

values without a collision, and less than half will go
\
sqrt{2|S|}

value
s without a
collision, so the median must be somewhere in between.


QED sqrt


Hashing

Lecture 21

Steven S. Skiena

Hashing

One way to convert form names to integers is to use the letters to form a base `
`alphabet
-
size'' number system:

To convert ``STEVE'' to a number, observe that
e

is the 5th letter of the alphabet,
s

is the 19th letter,
t

is the 20th letter, and
v

is the
22nd letter.

Thus ``Steve''

Thus one way we could represent a table of names would be to set aside an array big enough to contain one element for each po
ssible
string

of letters, then store data in the elements corresponding to real
people. By computing this function, it tells us where the person's
phone number is immediately!!

What's the Problem?

Because we must leave room for every possible string, this method will use an incredible amount of memory. We need a data str
ucture
to re
present a
sparse table
, one where almost all entries will be empty.

We can reduce the number of boxes we need if we are willing to put more than one thing in the same box!

Example: suppose we use the base alphabet number system, then take the remainder

Now the table is much smaller, but we need a way to deal with the fact that more than one, (but hopefully every few) keys can

get
mapped to the same arra
y element.

The Basics of Hashing

The basics of hashing is to apply a function to the search key so we can determine
where

the item is without looking at the other items.
To make the table of reasonable size, we must allow for
collisions
, two distinct key
s mapped to the same location.



We a special hash function to map keys (hopefully uniformly) to integers in a certain range.



We set up an array as big as this range, and use the valve of the function as the index to store the appropriate key.
Special
care

must be taken to handle collisions when they occur.


There are several clever techniques we will see to develop good hash functions and deal with the problems of duplicates.

Hash Functions

The verb ``hash'' means ``to mix up'', and so we seek a function

to mix up keys as well as possible.

The best possible hash function would hash
m

keys into
n

``buckets'' with no more than
keys per bucket. Such a func
tion is
called a
perfect hash function


How can we build a hash function?

Let us consider hashing character strings to integers. The ORD function returns the character code associated with a given ch
aracter.
By using the ``base character size'' number sys
tem, we can map each string to an integer.

The First Three SSN digits Hash

The first three digits of the Social Security Number



The last three digits of the Social Security Number

What is the big picture?

1.

A hash function which maps an arbitrary key
to an integer turns searching into array access, hence
O
(1).

2.

To use a finite sized array means two different keys will be mapped to the same place. Thus we must have some way to
handle collisions.

3.

A good hash function must spread the keys uniformly, or e
lse we have a linear search.

Ideas for Hash Functions



Truncation

-

When grades are posted, the last four digits of your SSN are used, because they distribute students more
uniformly than the first four digits.



Folding

-

We should get a better spread by
factoring in the entire key. Maybe subtract the last four digits from the first five
digits of the SSN, and take the absolute value?



Modular Arithmetic

-

When constructing pseudorandom numbers, a good trick for uniform distribution was to take a big
numbe
r
mod

the size of our range. Because of our roulette wheel analogy, the numbers tend to get spread well if the tablesize
is selected carefully.

Prime Numbers are Good Things

Suppose we wanted to hash check totals by the dollar value in pennies mod 1000.
What happens?

,
, and

Prices tend to be clumped by similar last digits, so we get clustering.

If we instead use a prime numbered Modulus like 1007, these clusters will get broken:
,
, and
.

In general, it is a good idea to use prime modulus for hash table size, since it is less likely the data will be multiples of

large primes as
opposed to small primes
-

all multiples of 4 get mapped to even num
bers in an even sized hash table!

The Birthday Paradox

No matter how good our hash function is, we had better be prepared for collisions, because of the birthday paradox.

Assuming 365 days a year, what is the probability that exactly two people share a
birthday? Once the first person has fixed their
birthday, the second person has 365 possible days to be born to avoid a collision, or a 365/365 chance.

With three people, the probability that no two share is
. In general, the probability of there being
no

collisions after
n

insertions into an
m
-
element table is


When
m

= 366, this probability sinks below 1/2 when
N

= 23 and to almost 0 when
.

The moral is that collisions are common, even with good h
ash functions.

What about Collisions?

No matter how good our hash functions are, we must deal with collisions. What do we do when the spot in the table we need is
occupied?



Put it somewhere else!

-

In
open addressing
, we have a rule to decide where to p
ut it if the space is already occupied.



Keep a list at each bin!

-

At each spot in the hash table, keep a linked list of keys sharing this hash value, and do a sequential
search to find the one we need. This method is called
chaining
.

Collision Resolutio
n by Chaining

The easiest approach is to let each element in the hash table be a pointer to a list of keys.



Insertion, deletion, and query reduce to the problem in linked lists. If the
n

keys are distributed uniformly in a table of size
m
/
n
, each
opera
tion takes
O
(
m
/
n
) time.

Chaining is easy, but devotes a considerable amount of memory to pointers, which could be used to make the table larger. Stil
l, it is
my preferred method.

Open Addressing

We can dispense with all these pointers by using an implic
it reference derived from a simple function:

If the space we want to use is filled, we can examine the remaining locations:

1.

Sequentially

2.

Quadraticall
y

3.

Linearly

The reason for using a more complic
ated scheme is to avoid long runs from similarly hashed keys.

Deletion in an open addressing scheme is ugly, since removing one element can break a chain of insertions, making some elemen
ts
inaccessible.

Performance on Set Operations

With either chainin
g or open addressing:



Search
-

O
(1) expected,
O
(
n
) worst case.



Insert
-

O
(1) expected,
O
(
n
) worst case.



Delete
-

O
(1) expected,
O
(
n
) worst case.

Pragmatically, a hash table is often the best data structure to maintain a dictionary. However, the worst
-
c
ase running time is
unpredictable.

The best worst
-
case bounds on a dictionary come from balanced binary trees, such as red
-
black trees.


Data Structures and Algorithms


8.3 Hash Tables


8.3.1 Direct Address Tables

If we have a collection of
n

elements
whose keys are unique integers in (1,
m
),
where
m

>=
n
,

then we can store the items in a
direct address

table,
T[m]
,

where
T
i

is either empty or contains one of the elements of our collection.

Searching a direct address table is clearly an
O(1)

operation:

for a key,
k
, we access
T
k
,



if it contains an element, return it,



if it doesn't then return a NULL.

There are two constraints here:

1.

the keys must be unique, and

2.

the range of the key must be severely bounded.


If the keys are not unique, then we can simply construct a set of
m

lists and store the
heads of these lists in the direct address table. The time to find an element matching an
input

key will still be
O(1)
.

However, if each element of the collection has some other distinguishing feature (other
than its key), and if the maximum number of duplicates is
n
dup
max
, then searching for a
specific element is
O(n
dup
max
)
. If duplicates are the
exception rather than the rule, then
n
dup
max

is much smaller than
n

and a direct address table will provide good performance.
But if
n
dup
max

approaches
n
, then the time to find a specific element is
O(n)

and a tree
structure will be more efficient.


The range of the key determines the size of the direct address table and may be too large to be practical. For instance it's
not likely that
you'll
be able to use a direct address table to store elements which have arbitrary 32
-
bit integers as their keys for a few years yet!

Direct addressing is easily generalised to the case where there is a function,

h(k)

=> (1,
m
)

which maps each value of the key
,
k
, to the range (1,
m
). In this case, we place the element in
T[h(k)]

rather than
T[k]

and we can
search in
O(1)

time as before.

8.3.2 Mapping functions

The direct address approach requires that the function,
h(k)
, is a one
-
to
-
one mapping from each
k

to
integers in (1,
m
). Such a function
is known as a
perfect hashing function
: it maps each key to a distinct integer within some manageable range and enables us to
trivially build an
O(1)

search time table.

Unfortunately, finding a perfect hashing function i
s not always possible. Let's say that we can find a
hash function
,
h(k)
, which maps
most

of the keys onto unique integers, but maps a small number of keys on to the same integer. If the number of
collisions

(cases
where multiple keys map onto the same inte
ger), is sufficiently small, then
hash tables

work quite well and give
O(1)

search times.

Handling the collisions

In the small number of cases, where multiple keys map to the same integer, then elements with different keys may be stored in

the
same "slot"

of the hash table. It is clear that when the hash function is used to locate a potential match, it will be necessary to compa
re
the key of that element with the search key. But there may be more than one element which should be stored in a single slot o
f
the
table. Various techniques are used to manage this problem:

1.

chaining,

2.

overflow areas,

3.

re
-
hashing,

4.

using neighbouring slots (linear probing),

5.

quadratic probing,

6.

random probing, ...

Chaining

One simple scheme is to chain all collisions in lists att
ached to the appropriate slot. This allows an unlimited number of collisions to
be handled and doesn't require
a priori

knowledge of how many elements are contained in the collection. The tradeoff is the same as
with linked lists versus array implementatio
ns of collections: linked list overhead in space and, to a lesser extent, in
time
.

Re
-
hashing

Re
-
hashing schemes use a second hashing operation when there is a collisi
on. If there
is a further collision, we
re
-
hash

until an empty "slot" in the table is found.

The re
-
hashing function can either be a new function or a re
-
application of the
original one. As long as the functions are applied to a key in the same order, the
n a
sought key can always be located.

Linear probing

One of the simplest re
-
hashing functions is +1 (or
-
1),
ie

on a collision, look in the
neighbouring slot in the table. It calculates the new address extremely quickly and
may be extremely efficient on a

modern RISC processor due to efficient cache
utilisation (
cf.

the discussion of linked list efficiency
).

The
animation

gives you a practical demonstration of the effect of linear probing: it
also implements a quadratic re
-
hash function so that you can compare the difference.


h(j)=h(k)
, so the next hash function,

h1

is used. A second collision occurs,

so
h2

is used.

Clustering

Linear probing is subject to a
clustering

phenomenon. Re
-
hashes from one location occupy a block of sl
ots in the table which "grows"
towards slots to which other keys hash. This exacerbates the collision problem and the number of re
-
hashed can become large.

Quadratic Probing

Better behaviour is usually obtained with
quadratic probing
, where the secondary
hash function depends on the re
-
hash index:

address = h(key) + c i
2


on the
t
th

re
-
hash. (A more complex function of
i

may also be used.) Since keys which are mapped to the same value by the primary
hash function follow the same sequence of addresses, qua
dratic probing shows
secondary clustering
. However, secondary clustering
is not nearly as severe as the clustering shown by linear probes.

Re
-
hashing schemes use the originally allocated table space and thus avoid linked list overhead, but require advance

knowledge of the
number of items to be stored.

However, the collision elements are stored in slots to which other key values map directly, thus the potential for multiple c
ollisions
increases as the table becomes full.

Overflow area

Another scheme will
divide the pre
-
allocated table into two sections: the
primary area

to which keys are mapped and an area for
collisions, normally termed the
overflow area
.


When a collision occurs, a slot in the overflow area is used for the new
element and a link from the primary slot established as in a chained system.
This is essentially the same as chaining, except that the overflow area is pre
-
allocated and thu
s possibly faster to access. As with re
-
hashing, the
maximum number of elements must be known in advance, but in this case,
two parameters must be estimated: the optimum size of the primary and
overflow areas.

Of course, it is possible to design systems
with multiple overflow tables, or with a mechanism for handling overflow out of the
overflow area, which provide flexibility without losing the advantages of the overflow scheme.

Summary: Hash Table Organization

Organization

Advantages

Disadvantages

Chai
ning



Unlimited number of elements



Unlimited number of collisions



Overhead of multiple linked lists

Re
-
hashing



Fast re
-
hashing



Fast access through use

of main table space



Maximum number of elements must be known



Multiple collisions may become

probable

Ov
erflow area



Fast access



Collisions don't use primary table space



Two parameters which govern performance

need to be estimated