Leveraging Big Data: Lecture 3
Instructors:
http://www.cohenwang.com/edith/bigdataclass2013
Edith Cohen
Amos Fiat
Haim
Kaplan
Tova
Milo
Overview: More on Min

Hash Sketches
Subset/Selection size queries from random
samples
M
in

hash sketches as samples
Other uses of the sampling “view” of sketches:
Sketch

based similarity estimation
Inverse

probability distinct count estimators
Min

hash sketches on a small range (fewer bits)
How samples are useful
We often want to know more than just the
number
of
distinct elements :
How many distinct search queries (or distinct
query/location pairs)…
I
nvolve the recent election?
Are related to flu ?
Reflect financial uncertainty ?
How many distinct IP flows going through our
network…
use a particular protocol ?
are originated from a particular location ?
Such
subset
queries
are specified by a predicate. They
can be
answered
approximately
from the
sample.
Min

hash Sketches as Random
S
amples
A min

hash sketch as a
random
sample:
A distinct
element
is
sampled if it “contributed” to
the
sketch
𝑠
∖
≠
s
To facilitate
subset
queries, we need to retain
meta

data/IDs of sampled elements.
Min

hash samples can be efficiently computed
over data streams
over distributed data (using
mergeability
)
K

mins
sketch as a sample
k

mins
=
32
12
14
7
6
4
0
.
92
0
.
45
0
.
74
0
.
35
0
.
21
0
.
14
ℎ
1
(
)
0
.
20
0
.
19
0
.
07
0
.
51
0
.
70
0
.
55
ℎ
2
(
)
0
.
18
0
.
10
0
.
93
0
.
71
0
.
50
0
.
89
ℎ
3
(
)
(
1
,
2
,
3
)
=
(
,
,
)
0
.
14
0
.
07
0
.
10
k

mins
sample:
(
6
,
1
4
,
32
)
Sampling scheme: k times with replacement
k

partition sketch as a sample
k

partition
=
32
12
14
7
6
4
3
2
1
3
1
2
(
)
0
.
20
0
.
19
0
.
07
0
.
51
0
.
70
0
.
55
ℎ
(
)
(
1
,
2
,
3
)
=
(
,
,
)
0
.
07
0
.
19
0
.
20
p
art

hash
v
alue

hash
k

partition sample:
(
14
,
32
,
4
)
Sampling scheme: throw elements into
buckets
Choose one uniformly from each nonempty bucket
Bottom

k
sketch as a sample
Bottom

k
=
32
12
14
7
6
4
0
.
20
0
.
19
0
.
07
0
.
51
0
.
70
0
.
55
ℎ
(
)
(
1
,
2
,
3
)
=
{
,
,
}
0
.
07
0
.
19
0
.
20
Bottom

k
sample:
{
14
,
32
,
4
}
Sampling scheme: choose
without replacement
Selection/Subset queries
from min

hash samples
Let
be the subset of elements satisfying our
selection predicate. We want to estimate
The number

𝑷
∩
𝑵

of distinct elements
satisfying the predicate or
Their fraction

𝑷
∩
𝑵


𝑵

≡
𝜶
′
≤
distinct elements sampled
T
he sample is
exchangeable
(fixing the sample
size, all subsets are equally likely).
When
𝐧
≫
all three schemes are
similar
.
Our estimator for a
k

mins
sample
(
1
,
…
,
)
(
times with replacement) is:
𝜶
=
𝑰
∈
𝑷
=
Expectation:
𝜇
=
𝛼
Variance:
𝜎
2
=
𝛼
1
−
𝛼
k
Subset queries: k

mins
samples
One uniform sample
∈
has probability
𝛼
to be from
.
Its “presence”
I
∈
𝑃
is 1 with probability
𝛼
and 0 with
probability
1
−
𝛼
.
The expectation and variance of
I
∈
𝑃
are
𝝁
=
𝜶
⋅
+
−
𝜶
⋅
=
𝜶
𝝈
=
𝜶
⋅
+
−
𝜶
⋅
−
𝝁
=
𝜶
(
−
𝜶
)
Sampling is
without replacement
:
Exactly
′
=
times with bottom

k
≤
′
≤
times with k

partition (
’ is the number
of nonempty “buckets” when tossing
balls into
buckets)
W
e use the estimator:
𝜶
=

𝑷
∩



=

𝑷
∩

′
we show:
The expectation
is:
𝐄
𝜶
=

𝑷
∩
𝑵


𝑵

≡
𝜶
The Variance
(Conditioned on
′
)
is:
𝝈
=
𝜶
−
𝜶
′
(
−
′
−
𝒏
−
)
Subset queries:
bottom

k and k

partition samples
Expectation of
𝛼
(k

partition and bottom

k)
We condition on the number of sampled (distinct) elements
′
≥
:
Consider the “positions”
i
=
1
,
…
,
′
in the sample
and their “contributions”
to
𝛼
.
We have
𝜶
=
=
.
If a position
gets an element
∈
𝑷
∩
𝑵
(probability
𝜶
),
then
T
i
=
′
.
Otherwise,
T
i
=
0
.
Therefore,
E
=
𝜶
′
=
𝜶
′
V
ar
=
𝜶
′
−
𝜶
′
=
𝜶
−
𝜶
′
From linearity of expectation
𝐄
[
𝜶
]
=
𝐄
[
]
′
=
=
𝜶
k

partition
: Since this is the expectation for every possible
′
, it is also the expectation overall.
Variance of
𝛼
(k

partition and bottom

k)
Conditioned on
′
≥
∶
Var
𝜶
=
C
ov
[
,
]
,
∈
{
,
…
,
′
}
For
≠
,
Cov
[
i
,
]
=
E
−
𝜶
′
=
=
𝛼
𝛼𝑛
−
1
𝑛
−
1
1
′2
−
𝛼
2
′2
=
−
𝛼
(
1
−
𝛼
)
n
−
1
k
′2
Co
v
[
,
]
=
Var
=
𝜶
−
𝜶
′
Var
𝛼
=
′
𝛼
1
−
𝛼
′
2
−
′
′
−
1
𝛼
1
−
𝛼
n
−
1
k
′2
=
1
−
𝛼
′
(
1
−
′
−
1
−
1
)
Subset estimation: Summary
For any predicate, we obtain an unbiased
estimator
𝜶
of the fraction
𝜶
=

𝑷
∩
𝑵


𝑵

∈
[
,
]
with standard deviation
𝝈
≤
…
M
ore accurate when
𝜶
is close to
0
or to
1
With bottom

k more accurate when
=
(
)
Next:
Sketch

based similarity estimation
Applications of similarity
Modeling using features
Scalability using sketches
Terms and shingling technique for
text
documents.
Jaccard
and cosine
similarity
Sketch

based similarity estimators
Search example
Doc
1′
Doc
2
User issues a query (over images, movies, text
document, Webpages)
Doc
1′′
Doc
1
Search engine finds many matching documents:
Doc
2′
Doc
2′′
Doc
3
Doc
3′
Doc
3′′
Elimination of near duplicates
Doc
1′
Doc
2
A lot of redundant information
–
many documents
are very similar. Want to eliminate
near

duplicates
Doc
1′′
Doc
1
Doc
2′
Doc
2′′
Doc
3
Doc
3′
Doc
3′′
Elimination of near duplicates
A lot of redundant information
–
many documents
are very similar. Want to eliminate near

duplicates
Doc
1
Doc
2′
Doc
3
Elimination of near duplicates
Return to the human user a concise, informative,
result.
Doc
1
Doc
2′
Doc
3
Return to user
Identifying
similar documents in a
collection of documents (text/images)
Why is similarity interesting ?
Search (query is also treated as a “document”)
Find text documents on a similar topic
Face recognition
Labeling documents (collection of images, only
some are labeled, extend label from similarity)
….
Identifying
near

duplicates
(very similar documents)
Why do we want to find near

duplicates ?
Plagiarism
Copyright violations
Clean up search results
Why we find many near

duplicates ?
Mirror pages
Variations on the same source
Exact match is easy: use hash/signature
Document Similarity
Modeling:
Identify a set of
features
for our similarity
application.
Similar documents
should have
similar
features:
similarity is captured by the similarity of
the
feature sets/vectors
(use a
similarity measure
)
Analyse
each document to extract the set of
relevant features
Sketch

based similarity:
Making it scalable
Doc
1
Sketch
1
Doc
2
Sketch
2
Sketch the set of features of each document
such that similar sets imply similar sketches
Estimate similarity of two feature sets from the
similarity of the two sketches
(0,0,1,0,1,1,0…)
(1,0,1,1,1,1,0,…)
Similarity of
text
documents
What is a good set of
features
?
Approach:
Features = words (terms)
View each document as a bag of words
Similar documents have similar bags
This works well (with TF/IDF weighting…) to
detect documents on a
similar topic
.
It does not geared for detecting near

duplicates.
Shingling technique for text documents
(Web pages) [
Broder
97]
For a parameter
:
Each feature corresponds to a

gram (shingle)
: an
ordered set of
“tokens” (words)
Very similar documents have similar sets of
features (even if sentences are shifted, replicated)
Shingling technique for
technique for text
for text documents
t
ext documents Web
All 3

shingles in title:
documents Web pages
Similarity measures
We measure similarity of two documents by the
similarity of their feature sets/vectors
Comment:
will focus on
sets/binary vectors
today.
In general, we sometimes want to associate
“weights” with presence of features in a document
Two popular measures are
T
he
Jaccard
coefficient
Cosine similarity
Jaccard
S
imilarity
Features
1
of
document 1
Features
2
of
document 2
Ratio of size of intersection
to size of union:
𝐽
1
,
2
=

1
∩
2


1
∪
2

A
common similarity measure of two sets
𝐽
=
3
8
=
0
.
375
Comment: Weighted
Jaccard
=
(
0
.
00
,
0
.
23
,
0.00, 0.00, 0.03, 0.00, 1.00,0.13)
Sum of min over
sum of max
𝐽
,
=
min
{
,
}
max
{
,
}
Similarity of weighted (nonnegative) vectors
=
(
0
.
34
,
0
.
21
,
0.00, 0.03, 0.05, 0.00,1.00, 0.00)
min
=
(
0
.
00
,
0
.
21
,
0.00, 0.00, 0.03, 0.00, 1.00, 0.00)
max
=
(
0
.
34
,
0
.
23
,
0.00, 0.03, 0.05, 0.00, 1.00, 0.13)
𝐽
,
=
1
.
24
1.78
Cosine
S
imilarity
Similarity measure between two vectors: The
cosine of the angle between the two vectors.
𝜃
C
,
=
⋅
2
2
Euclidean Norm:
V
2
=
2
Cosine Similarity (binary)
Cosine similarity between
1
and
2
:
C
1
,
2
=
𝑣
1
⋅
𝑣
(
2
)
𝑣
1
2
𝑣
2
2
=
1
∩
2
1

2

View each set
′
⊂
as a vector
𝑣
(
′
)
with
entry to each element in the domain
∈
′
⇔
𝑣
i
′
=
1
∉
′
⇔
𝑣
i
′
=
0
𝜃
𝐶
=
3
5
6
≈
0
.
55
Estimating Similarity of sets using
their Min

Hash sketches
W
e sketch all sets
using the same hash functions
.
T
here is a special relation between the sketches:
We say the sketches are “
coordinated
”
Coordination
is what allows the sketches to be
mergeable
. If we had used different hash
functions for each set, the sketches would not have
been
mergeable
.
Coordination
also implies that
similar sets have
similar sketches
(LSH property). This allows us to
obtain good estimates of the similarity of two sets
from the similarity of sketches of the sets.
Jaccard
Similarity
from Min

Hash sketches
For each
we have a Min

Hash sketch
𝑠
(
)
(use the same hash function/s
ℎ
for all sets)
𝐽
1
,
2
=

1
∩
2


1
∪
2

Merge
𝑠
(
1
)
and
𝑠
(
2
)
to obtain
𝑠
(
1
∪
2
)
For each
∈
s
(
N
1
∪
N
2
)
we know everything on its
membership in
1
or
2
:
∈
(
𝑵
∪
𝑵
)
is in
𝑵
if and only if
∈
(
𝑵
)
In particular, we know if
∈
1
∩
2
𝐽
is the fraction of union members that are intersection
members: apply subset estimator to
𝑠
(
1
∪
2
)
k

mins
sketches:
Jaccard
estimation
=
4
𝑠
1
=
(
0
.
22
,
0
.
11
, 0.14,
0.22)
𝑠
2
=
(
0
.
18
,
0
.
24
,
0.14, 0.35)
𝑠
1
∪
2
=
(
0
.
18
,
0
.
11
,
0.14, 0.22)
∈
1
∖
2
∈
2
∖
1
∈
1
∩
2
⇒
Can estimate
𝛼
=

𝑁
1
∖
𝑁
2


𝑁
1
∪
𝑁
2

,

𝑁
2
∖
𝑁
1


𝑁
1
∪
𝑁
2

,

𝑁
1
∩
𝑁
2


𝑁
1
∪
𝑁
2

unbiasedely with
𝜎
2
=
𝛼
1
−
𝛼
𝛼
=
1
4
𝛼
=
1
4
𝛼
=
2
4
=
1
2
k

partition sketches:
Jaccard
estimation
=
4
𝑠
1
=
(
1
.
00
,
1
.
00
,
0.14,
0.21)
𝑠
2
=
(
0
.
18
,
1
.
00
,
0.14, 0.35)
𝑠
1
∪
2
=
(
0
.
18
,
1
.
00
,
0.14, 0.21)
∈
1
∖
2
∈
2
∖
1
∈
1
∩
2
⇒
Can estimate
𝛼
=

𝑁
1
∖
𝑁
2


𝑁
1
∪
𝑁
2

,

𝑁
2
∖
𝑁
1


𝑁
1
∪
𝑁
2

,

𝑁
1
∩
𝑁
2


𝑁
1
∪
𝑁
2

unbiasedely with
𝜎
2
=
𝛼
1
−
𝛼
′
(conditioned on
’
)
𝛼
=
1
3
𝛼
=
1
3
𝛼
=
1
3
′
=
2
′
=
3
′
=
3
Bottom

k sketches:
Jaccard
estimation
=
4
𝑠
1
=
{
0
.
09
,
0
.
14
, 0.18, 0.21}
𝑠
2
=
{
0
.
14
,
0
.
17
,
0.19, 0.35}
𝑠
1
∪
2
=
{
0
.
09
,
0
.
14
,
0.17, 0.18}
∈
1
∖
2
∈
2
∖
1
∈
1
∩
2
⇒
Can estimate
𝛼
=

𝑁
1
∖
𝑁
2


𝑁
1
∪
𝑁
2

,

𝑁
2
∖
𝑁
1


𝑁
1
∪
𝑁
2

,

𝑁
1
∩
𝑁
2


𝑁
1
∪
𝑁
2

unbiasedely with
𝜎
2
=
𝛼
1
−
𝛼
1
−
−
1
𝑛
−
1
𝛼
=
1
4
𝛼
=
1
4
𝛼
=
2
4
Smallest
=
4
in
union
of sketches
Bottom

k sketches: better estimate
=
4
𝑠
1
=
{
0
.
09
,
0
.
14
, 0.18, 0.21}
𝑠
2
=
{
0
.
14
,
0
.
17
,
0.19, 0.35}
𝑠
1
∪
2
=
{
0
.
09
,
0
.
14
,
0.17, 0.18}
∈
1
∖
2
∈
2
∖
1
∈
1
∩
2
We can look
beyond
the union sketch: We have complete
membership information on all elements with
ℎ
≤
min
{
max
𝑠
1
,
max
𝑠
2
}
. We have
2k
>
′
≥
elements!
0.19, 0.21
′
=
6
>
4
Bottom

k sketches: better estimate
=
4
𝑠
1
=
{
0
.
09
,
0
.
14
, 0.18, 0.21}
𝑠
2
=
{
0
.
14
,
0
.
17
,
0.19, 0.35}
𝑠
1
∪
2
=
{
0
.
09
,
0
.
14
,
0.17, 0.18}
∈
1
∖
2
∈
2
∖
1
∈
1
∩
2
0.19, 0.21
′
=
6
>
4
⇒
Can estimate
𝛼
=

𝑁
1
∖
𝑁
2


𝑁
1
∪
𝑁
2

,

𝑁
2
∖
𝑁
1


𝑁
1
∪
𝑁
2

,

𝑁
1
∩
𝑁
2


𝑁
1
∪
𝑁
2

unbiasedely with
𝜎
2
=
𝛼
1
−
𝛼
′
1
−
′
−
1
𝑛
−
1
(conditioned on
’
)
𝛼
=
1
6
𝛼
=
2
6
=
1
3
𝛼
=
3
6
=
1
2
Cosine Similarity from Min

Hash sketches:
Crude estimator
𝐽
1
,
2
=

1
∩
2


1
∪
2

We have estimates with good relative error (and
concentration) for

1
∪
2

,
1
N
1
,
1
N
2
Plug

in
C
1
,
2
=
1
∩
2
1

2

C
1
,
2
=
𝐽
(
1
,
2
)
1
∪
2
1

2

Next: Back to distinct counting
Inverse

probability distinct count estimators
Separately estimate “presence” of each
element
Historic Inverse

probability distinct count
estimators
General approach for deriving estimators: For
all
distributions, all Min

Hash sketch
types
1
2
the
variance
of purely sketch

based estimators
Inverse probability estimators
[Horvitz Thompson 1952]
is unbiased:
𝐸
=
1
−
⋅
0
+
𝑓
𝑝
=
(
)
Var
=
E
2
−
2
=
𝑓
𝑝
2
−
2
=
2
(
1
𝑝
−
1
)
comment: variance is minimum possible for unbiased
nonnegative estimator if domain includes
with
=
0
Model
: There is a hidden value
. It is observed/sampled with probability
>
0
.
We want to estimate
≥
0
. If
is sampled
we know both
,
a
nd can compute
(
)
.
Inverse Probability Estimator
:
If
is
sampled
=
𝑓
𝑝
(
)
.
Else
,
=
0
Inverse

Probability estimate for a sum
Unbiased
implies unbiased
. It is important, so
bias does not add up
For
distinct count
=
I
∈
𝑁
(indicator function).
We want to estimate the sum:
=
(
)
.
We
have a sample
𝑆
of elements.
(
)
>
0
⟹
>
0
and we know
,
(
)
when
∈
𝑆
.
We use
:
=
𝑓
𝑝
(
)
when
∈
𝑆
.
=0
otherwise.
Sum estimator:
=
=
∈
𝑆
Inverse

Probability estimate for a sum
(
)
can be
conditioned
on a
part in some partition
of outcomes. But
elements with
f
>
0
must
have
>
0
in all
parts
(otherwise we get bias)
We want to estimate the sum:
=
(
)
.
We
have a sample
𝑆
of elements.
(
)
>
0
⟹
>
0
and we know
,
(
)
when
∈
𝑆
.
We use
:
=
𝑓
𝑝
(
)
when
∈
𝑆
.
=0
otherwise.
Sum estimator:
=
=
∈
𝑆
Bottom

k sketches:
Inverse probability estimator
We work with the
uniform
distribution
ℎ
∼
[
0
,
1
]
For each distinct element, we consider the
probability that it is one of the
lowest

hash
−
1
elements.
For sketch
1
<
⋯
<
,
we say element
is
“sampled”
⟺
for some
≤
−
1
,
=
ℎ
(
)
Caveat: Probability is
=
−
1
𝑛
for all elements, but
we do not know
.
⇒
Need to use conditioning.
Bottom

k sketches:
Inverse probability estimator
We use an inverse probability estimate: If
is not
sampled (not one of the
−
1
smallest

hash
elements) estimate is
0
. Otherwise, it is
𝒑
(
)
.
But we do not know
𝒑
! what can we do ?
Need to be able to compute
𝒑
(
)
only
for
“sampled” elements.
We compute
𝒑
(
)
conditioned
on fixing
on
𝑵
∖
but taking
𝐡
∼
,
Bottom

k sketches:
Inverse probability estimator
What
is the probability
that
is sampled if we
fix
ℎ
on
∖
{
}
but take
ℎ
∼
0
,
1
?
is sampled
⟺
ℎ
<
(
−
1
)
th
ℎ

∈
∖
For sampled
, (
−
1
)
th
ℎ

∈
∖
=
⟹
(
)
=
⟹
Inverse probability
estimate is
1
𝑝
(
)
=
1
𝑘
Summing over the
−
1
“sampled” elements:
=
−
1
Explaining conditioning in Inverse
Probability Estimate for bottom

k
Probability Space on
{
ℎ

∈
∖
}
.
Partitioned according to
𝜏
=
(
−
1
)
th
ℎ

∈
∖
Conditional probability that
is sampled in the
part is
Pr
ℎ
(
)
<
𝜏
=
𝜏
If
is “sampled” in outcome, we know
𝜏
(it is
equal to
), estimate is
1
𝜏
. (If
is not sampled
then
𝜏
=
−
1
>
0
–
this is needed for
unbiasedness
but estimate for
is
0
)
Explaining conditioning in Inverse
Probability Estimate for bottom

k
=
{
𝒂
,
,
,
,
}
=
3
The probability that
𝒂
has one of the
−
=
smallest values
in
ℎ
,
ℎ
,
…
,
ℎ
is
Pr
𝒂
∈
𝑆
=
𝟓
but we can not compute it since we
do not know
𝒏
(
=
𝟓
)
.
The conditional probability
Pr
𝒂
∈
𝑆
ℎ
,
…
,
ℎ
(
)
]
can be computed. It is the
𝒏
smallest value in
ℎ
,
…
,
ℎ
.
Explaining conditioning in Inverse
Probability Estimate for bottom

k
(.1,.3,.5,.6)
(.2,.3,.5,.71)
(.15,.3,.32,.4)
(.1,.2,.7,.8)
(.11,.2,.28,.3)
(.03,.2,.4,.66)
(.12,.4,.45,.84)
(.1,.4,.5,.8)
ℎ
,
…
,
ℎ
(
)
=
3
?
Pr
∈
𝑆
ℎ
,
…
,
ℎ
(
)
]
𝜏
=
0
.
3
𝜏
=
0
.
2
𝜏
=
0
.
4
Bottom

k sketches:
Inverse probability estimators
=
k
−
1
We obtain an unbiased estimator.
No need to track element IDs (sample view only
used for analysis).
How good is this estimator?
We can (do not)
show:
CV is
𝜎
𝜇
≤
1
−
2
at least as good as the k

mins
estimator
Better distinct count estimators ?
Recap:
Our estimators (k

mins
, bottom

k) have
CV
𝜎
𝜇
≤
1
−
2
CRLB (k

mins
) says
CV
𝜎
𝜇
≥
1
C
an we improve ? Also, what about k

partition?
CRLB applies when we are limited to using
only
the
information in the sketch.
Idea
: Use information we
discard along the way
“Historic” Inverse Probability Estimators
We maintain an approximate count together with
the sketch:
,
…
,
,
Initially
,
…
,
←
(
,
…
,
)
←
When the sketch
is updated, we compute the
probability
that a new distinct element would
cause an update to the current sketch.
We increase the counter
←
+
1
𝑝
We can (do not) show CV
𝜎
𝜇
≤
1
2
−
2
<
1
2
1
−
2
Easy to apply with
all
min

hash sketches
The estimate is unbiased
Maintaining a k

mins
“historic” sketch
k

mins
sketch
:
Use
“independent” hash functions
:
ℎ
1
,
ℎ
2
,
…
,
ℎ
Track the respective minimum
1
,
2
,
…
,
for each function.
Processing a new element
:
←
1
−
(
1
−
)
For
=
1
,
…
,
:
←
min
{
,
ℎ
}
If
change in
:
←
+
1
𝑝
Update probability:
probability
that at least for one
=
1
,
…
,
,
we get
ℎ
<
:
=
1
−
(
1
−
)
Maintaining a k

partition “historic” sketch
Processing a new element
:
←
first
log
2
bits
of
ℎ
′
(
)
ℎ
←
remaining
bits
of
ℎ
′
(
)
If
<
ℎ
,
←
1
=
1
,
←
+
1
𝑝
←
ℎ
Update probability:
probability
that
ℎ
<
for
part
selected uniformly at random
Maintaining a bottom

k “historic” sketch
Processing a new element
:
If
ℎ
<
y
k
c
←
+
1
𝑘
1
,
2
,
…
,
←
sort
{
1
,
2
,
…
,
−
1
,
ℎ
(
)
}
Bottom

k sketch
:
Use a
single hash function
:
ℎ
Track the
smallest values
1
<
2
<
⋯
<
Probability of update is:
y
k
Summary: Historic distinct estimators
Recap:
Maintain sketch
and
count
. CV is
1
2
−
2
<
1
2
1
−
2
Easy to apply. Trivial to query. Unbiased.
More: (we do not show here)
CV is almost tight for this type of estimator
(estimate presence of each distinct element
entering sketch).
⟹
Can’t do better than
1
2
Mergeability
: Stated for streams. “Sketch” parts
are
mergeable
but merging “counts” requires
work (which uses the sketch parts)
Approach: carefully estimate the overlap (say, using
similarity estimators)
Next:
Working with a small range
So far Min

Hash sketches were
stated/analyzed for distributions (random
hash functions) with a
continuous
range
We explain how to work with a discrete
range, how small the representation can
be, and how estimators are affected.
Back

of

the

envelope calculations
Working with a small (discrete) range
When implementing min

hash sketches:
W
e work with discrete distribution of the hash range
We want to use as fewer bits to represent the sketch.
Natural discrete distribution
:
ℎ
=
2
−
with
probability
2
−
Same as using
u
∼
[
0
,
1
]
and retaining only
the negated exponent
⌊
log
2
1
u
⌋
.
Expectation of the min is about
1
n
≈
1
2
−
log
n
Expected max exponent size is
≈
log
2
log
2
Elements sorted by hash
0.1xxxxx
0.01xx
0.001xx
0.0001xx
Negated exponent:
2
1
3
4
Working with a small (discrete) range
Can also retain few
(
)
bits beyond the exponent.
Sketch size is
≈
+
log
2
log
2
Can be reduced further to
log
2
log
2
+
by
noting that exponents parts are very similar, so
can store only the minimum once and “offsets”.
How does this rounding affect the estimators
(properties and accuracy) ?
We do “back

of

the

envelope” calculations
Working with a small (discrete) range
k

mins
and k

partition:
we can separately look at each
“coordinate”. The expected number of elements with
same “minimum” exponent is fixed. (The probability of
exponent
−
is
2
−
, so expectation is
2
−
). So we can
work with a fixed
.
“parameter estimation” estimators; Similarity estimators
We need to keep enough bits to ensure distinctness of min

hash values in the same sketch (for similarity, two sketches)
with good probability.
To apply “continuous” estimators,
w
e
can take a random completion and apply the estimators.
Working with a small (discrete) range
bottom

k:
we need to separate the smallest
values.
We expect about
/
2
to have the maximum
represented exponent.
S
o we need
log
log
+
log
bits per register. We work with
=
log
“parameter estimation” estimators; Similarity estimators
We need to keep enough bits to ensure distinctness of min

hash values in the same sketch (for similarity, two sketches)
with good probability.
To apply “continuous” estimators,
w
e
can take a random completion and apply the estimators.
Working with a small (discrete) range
Inverse probability (also historic) estimators:
Estimators
a
pply
directly
to discrete range: simply work with
the probability that a hash from the discrete domain is
strictly
below current “threshold”
Unbiasedness
still holds (
on streams
) even with likely hash
collisions (with k

mins
and k

partition)
Variance increases by
×
1
1
−
2
−
𝑏
⇒
we get most of the value
of continuous domain with small
For
mergeability
(
support the needed similarity

like
estimates to merge counts)
or with
bottom

k
, we need to
work with larger
=
(
) to ensure that hash
collisions
are not
likely (on same sketch or two sketches).
Distinct counting/Min

Hash sketches bibliography 1
First use of k

mins
Min

Hash sketches for distinct counting; first streaming algorithm for approximate
distinct counting:
P.
Flajolet
and N. Martin, N. “Probabilistic Counting Algorithms for Data Base
Applications” JCSS (31), 1985.
Use of Min

Hash sketches for similarity, union size,
mergeability
, size estimation (k

mins
, propose
bottom

k):
E. Cohen “Size estimation framework with applications to transitive closure and
reachability”, JCSS (55) 1997
U
se of shingling with k

mins
sketches for
Jaccard
similarity of text documents:
A.
Broder
“On the Resemblance and Containment of Documents” Sequences 1997
A.
Broder
and S. Glassman and M.
Manasse
and G. Zweig “Syntactic Clustering of
the Web” SRC technical note 1997
Better similarity estimators (beyond the union sketch) from bottom

k samples:
E. Cohen and H. Kaplan “Leveraging discarded sampled for tighter estimation of
multiple

set aggregates: SIGMETRICS 2009
.
Asymptotic Lower bound on distinct counter size (taking into account hash representation)
N.
Alon
Y.
Matias
M.
Szegedy
“The space complexity of approximating the frequency moments”
STOC 1996
Introducing k

partition sketches for distinct counting:
Z.
Bar

Yossef
, T. S.
Jayram
, R.
Kumar
, D.
Sivakumar
, L.
Trevisan
“Counting distinct elements in a
data stream” RANDOM 2002.
Distinct counting/Min

Hash sketches bibliography 2
Practical distinct counters based on k

partition sketches:
P.
Flajolet
, E.
Fusy
, O.
Gandouet
, F.
Meunier
“
Hyperloglog
: The analysis of a near

optimal
cardinality estimation algorithm”
S.
Heule
, M.
Nunkeser
, A. Hall “
Hyperloglog
in practice” algorithmic engineering of a state of
the art cardinality estimation algorithm”, EDBT 2013
Theoretical algorithm with asymptotic bounds that match the AMS lower bound:
D.M. Kane, J. Nelson, D. P, Woodruff “An optimal algorithm for the distinct elements
problem”, PODS 2010
Inverse probability “historic” estimators, Application of Cramer
Rao
on min

hash sketches:
E. Cohen “All

Distances
Sketches, Revisited: Scalable Estimation of the Distance Distribution
and Centralities in Massive
Graphs”
arXiv
2013.
The concepts of min

hash sketches and sketch coordination are related to concepts from the
survey sampling literature: Order samples (bottom

k), coordination of samples using
the PRN
method (Permanent Random Numbers).
More on Bottom

k sketches, ML estimator for bottom

k:
E. Cohen, H. Kaplan “Summarizing data using bottom

k sketches” PODS 2007. “Tighter
Estimation using bottom

k sketches” VLDB 2008.
Inverse probability estimator with priority (type of bottom

k) sketches:
N.
Alon
, N. Duffield, M.
Thorup
, C. Lund: “Estimating arbitrary subset sums with a few
probes” PODS 2005
Comments 0
Log in to post a comment