Leveraging Big Data: Lecture 3

soilflippantΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

100 εμφανίσεις

Leveraging Big Data: Lecture 3

Instructors:

http://www.cohenwang.com/edith/bigdataclass2013

Edith Cohen

Amos Fiat

Haim

Kaplan

Tova

Milo

Overview: More on Min
-
Hash Sketches


Subset/Selection size queries from random
samples


M
in
-
hash sketches as samples

Other uses of the sampling “view” of sketches:


Sketch
-
based similarity estimation


Inverse
-
probability distinct count estimators


Min
-
hash sketches on a small range (fewer bits)

How samples are useful

We often want to know more than just the
number

of
distinct elements :


How many distinct search queries (or distinct
query/location pairs)…



I
nvolve the recent election?


Are related to flu ?


Reflect financial uncertainty ?


How many distinct IP flows going through our
network…


use a particular protocol ?


are originated from a particular location ?

Such
subset

queries
are specified by a predicate. They
can be
answered
approximately

from the
sample.

Min
-
hash Sketches as Random
S
amples

A min
-
hash sketch as a
random
sample:

A distinct
element


is
sampled if it “contributed” to
the
sketch


𝑠




s



To facilitate
subset

queries, we need to retain
meta
-
data/IDs of sampled elements.


Min
-
hash samples can be efficiently computed



over data streams



over distributed data (using
mergeability
)

K
-
mins

sketch as a sample

k
-
mins


=


32

12

14


7

6

4

0
.
92

0
.
45

0
.
74

0
.
35

0
.
21

0
.
14




1
(

)

0
.
20

0
.
19

0
.
07

0
.
51

0
.
70

0
.
55


2
(

)

0
.
18

0
.
10

0
.
93

0
.
71

0
.
50

0
.
89


3
(

)

(

1
,

2
,

3
)
=
(







,
,
)


0
.
14

0
.
07

0
.
10

k
-
mins

sample:
(
6
,
1
4
,
32
)

Sampling scheme: k times with replacement

k
-
partition sketch as a sample

k
-
partition

=


32

12

14


7

6

4

3

2

1

3

1

2




(

)

0
.
20

0
.
19

0
.
07

0
.
51

0
.
70

0
.
55


(

)

(

1
,

2
,

3
)
=
(







,
,
)


0
.
07

0
.
19

0
.
20

p
art
-
hash

v
alue
-
hash

k
-
partition sample:
(
14

,
32

,
4
)


Sampling scheme: throw elements into


buckets

Choose one uniformly from each nonempty bucket

Bottom
-
k

sketch as a sample

Bottom
-
k

=


32

12

14


7

6

4



0
.
20

0
.
19

0
.
07

0
.
51

0
.
70

0
.
55


(

)

(

1
,

2
,

3
)
=
{







,
,
}


0
.
07

0
.
19

0
.
20

Bottom
-
k

sample:
{
14

,
32

,
4
}


Sampling scheme: choose


without replacement

Selection/Subset queries

from min
-
hash samples

Let


be the subset of elements satisfying our
selection predicate. We want to estimate


The number
|
𝑷

𝑵
|

of distinct elements
satisfying the predicate or


Their fraction
|
𝑷

𝑵
|
|
𝑵
|

𝜶











distinct elements sampled


T
he sample is
exchangeable

(fixing the sample
size, all subsets are equally likely).


When
𝐧




all three schemes are
similar
.




Our estimator for a

k
-
mins

sample
(

1
,

,


)

(


times with replacement) is:


𝜶

=

𝑰



𝑷


=





Expectation:

𝜇
=
𝛼




Variance:
𝜎
2
=
𝛼
1

𝛼
k

Subset queries: k
-
mins

samples

One uniform sample




has probability
𝛼

to be from

.

Its “presence”
I


𝑃

is 1 with probability
𝛼

and 0 with
probability
1

𝛼
.

The expectation and variance of
I


𝑃

are

𝝁
=
𝜶


+


𝜶


=
𝜶



𝝈

=
𝜶



+


𝜶




𝝁

=
𝜶
(


𝜶
)

Sampling is
without replacement
:



Exactly


=


times with bottom
-
k









times with k
-
partition (

’ is the number
of nonempty “buckets” when tossing


balls into


buckets)

W
e use the estimator:
𝜶

=
|
𝑷


|
|

|
=

|
𝑷


|
′

we show:

The expectation
is:
𝐄
𝜶

=
|
𝑷

𝑵
|
|
𝑵
|

𝜶

The Variance
(Conditioned on


)

is:

𝝈

=
𝜶


𝜶


(






𝒏


)

Subset queries:

bottom
-
k and k
-
partition samples

Expectation of
𝛼



(k
-
partition and bottom
-
k)

We condition on the number of sampled (distinct) elements




:


Consider the “positions”
i
=
1
,

,
′


in the sample
and their “contributions”




to
𝛼

.

We have
𝜶

=





=

.

If a position


gets an element


𝑷

𝑵

(probability
𝜶
),
then
T
i
=



.

Otherwise,
T
i
=
0
.

Therefore,


E


=
𝜶

′
=
𝜶



V
ar


=
𝜶





𝜶




=
𝜶


𝜶

′


From linearity of expectation
𝐄
[
𝜶

]
=

𝐄
[


]
′

=

=
𝜶

k
-
partition
: Since this is the expectation for every possible


, it is also the expectation overall.

Variance of
𝛼



(k
-
partition and bottom
-
k)

Conditioned on





Var
𝜶

=

C
ov
[


,


]

,


{

,

,


}

For




,
Cov
[

i
,


]
=
E





𝜶


′
=


=
𝛼
𝛼𝑛

1
𝑛

1
1

′2

𝛼
2

′2
=

𝛼
(
1

𝛼
)
n

1
k
′2


Co
v
[


,


]
=
Var


=
𝜶


𝜶




Var
𝛼

=


𝛼
1

𝛼


2






1
𝛼
1

𝛼
n

1
k
′2
=
1

𝛼


(
1




1


1
)

Subset estimation: Summary

For any predicate, we obtain an unbiased
estimator
𝜶


of the fraction
𝜶
=
|
𝑷

𝑵
|
|
𝑵
|

[

,

]




with standard deviation
𝝈







M
ore accurate when
𝜶

is close to
0

or to
1


With bottom
-
k more accurate when

=

(

)


Next:

Sketch
-
based similarity estimation


Applications of similarity


Modeling using features


Scalability using sketches


Terms and shingling technique for
text
documents.


Jaccard

and cosine
similarity


Sketch
-
based similarity estimators

Search example

Doc
1′

Doc
2

User issues a query (over images, movies, text
document, Webpages)

Doc
1′′

Doc
1

Search engine finds many matching documents:

Doc
2′

Doc
2′′

Doc
3

Doc
3′

Doc
3′′

Elimination of near duplicates

Doc
1′

Doc
2

A lot of redundant information


many documents
are very similar. Want to eliminate
near
-
duplicates

Doc
1′′

Doc
1

Doc
2′

Doc
2′′

Doc
3

Doc
3′

Doc
3′′

Elimination of near duplicates

A lot of redundant information


many documents
are very similar. Want to eliminate near
-
duplicates

Doc
1

Doc
2′

Doc
3

Elimination of near duplicates

Return to the human user a concise, informative,
result.

Doc
1

Doc
2′

Doc
3

Return to user

Identifying
similar documents in a
collection of documents (text/images)


Why is similarity interesting ?


Search (query is also treated as a “document”)


Find text documents on a similar topic


Face recognition


Labeling documents (collection of images, only
some are labeled, extend label from similarity)


….

Identifying
near
-
duplicates

(very similar documents)


Why do we want to find near
-
duplicates ?


Plagiarism


Copyright violations


Clean up search results


Why we find many near
-
duplicates ?


Mirror pages


Variations on the same source


Exact match is easy: use hash/signature

Document Similarity

Modeling:


Identify a set of
features
for our similarity
application.


Similar documents
should have
similar
features:
similarity is captured by the similarity of
the
feature sets/vectors
(use a
similarity measure
)


Analyse

each document to extract the set of
relevant features

Sketch
-
based similarity:

Making it scalable

Doc
1

Sketch
1

Doc
2

Sketch
2


Sketch the set of features of each document
such that similar sets imply similar sketches


Estimate similarity of two feature sets from the
similarity of the two sketches

(0,0,1,0,1,1,0…)

(1,0,1,1,1,1,0,…)

Similarity of
text

documents


What is a good set of
features
?

Approach:


Features = words (terms)


View each document as a bag of words


Similar documents have similar bags


This works well (with TF/IDF weighting…) to
detect documents on a
similar topic
.


It does not geared for detecting near
-
duplicates.

Shingling technique for text documents
(Web pages) [
Broder

97]

For a parameter

:


Each feature corresponds to a

-
gram (shingle)
: an
ordered set of


“tokens” (words)


Very similar documents have similar sets of
features (even if sentences are shifted, replicated)

Shingling technique for

technique for text

for text documents

t
ext documents Web

All 3
-
shingles in title:

documents Web pages

Similarity measures

We measure similarity of two documents by the
similarity of their feature sets/vectors

Comment:

will focus on
sets/binary vectors
today.
In general, we sometimes want to associate
“weights” with presence of features in a document


Two popular measures are


T
he
Jaccard

coefficient



Cosine similarity

Jaccard

S
imilarity

Features

1

of
document 1

Features

2

of
document 2

Ratio of size of intersection
to size of union:

𝐽

1
,

2
=
|

1


2
|

|

1


2
|



A

common similarity measure of two sets

𝐽
=
3
8
=
0
.
375

Comment: Weighted
Jaccard


=
(
0
.
00
,
0
.
23
,

0.00, 0.00, 0.03, 0.00, 1.00,0.13)

Sum of min over
sum of max

𝐽

,

=

min
{


,


}



max
{


,


}



Similarity of weighted (nonnegative) vectors


=
(
0
.
34
,
0
.
21
,

0.00, 0.03, 0.05, 0.00,1.00, 0.00)

min
=
(
0
.
00
,
0
.
21
,

0.00, 0.00, 0.03, 0.00, 1.00, 0.00)

max
=
(
0
.
34
,
0
.
23
,

0.00, 0.03, 0.05, 0.00, 1.00, 0.13)

𝐽

,

=
1
.
24

1.78


Cosine
S
imilarity

Similarity measure between two vectors: The
cosine of the angle between the two vectors.

𝜃

C

,

=




2

2

Euclidean Norm:
V
2
=



2


Cosine Similarity (binary)

Cosine similarity between

1

and

2
:

C

1
,

2
=
𝑣

1

𝑣
(

2
)
𝑣

1
2
𝑣

2
2
=

1


2

1

|

2
|

View each set






as a vector
𝑣
(


)

with
entry to each element in the domain









𝑣
i


=

1








𝑣
i


=
0

𝜃

𝐶
=
3

5
6

0
.
55

Estimating Similarity of sets using
their Min
-
Hash sketches


W
e sketch all sets
using the same hash functions
.

T
here is a special relation between the sketches:
We say the sketches are “
coordinated



Coordination

is what allows the sketches to be
mergeable
. If we had used different hash
functions for each set, the sketches would not have
been
mergeable
.


Coordination

also implies that
similar sets have
similar sketches

(LSH property). This allows us to
obtain good estimates of the similarity of two sets
from the similarity of sketches of the sets.

Jaccard

Similarity

from Min
-
Hash sketches

For each



we have a Min
-
Hash sketch
𝑠
(


)

(use the same hash function/s


for all sets)

𝐽

1
,

2
=
|

1


2
|

|

1


2
|



Merge

𝑠
(

1
)

and

𝑠
(

2
)

to obtain
𝑠
(

1


2
)



For each


s
(
N
1

N
2
)


we know everything on its
membership in

1

or


2
:










(
𝑵


𝑵

)

is in
𝑵


if and only if



(
𝑵

)


In particular, we know if



1


2


𝐽

is the fraction of union members that are intersection
members: apply subset estimator to

𝑠
(

1


2
)


k
-
mins

sketches:
Jaccard

estimation


=
4


𝑠

1
=
(
0
.
22
,
0
.
11
, 0.14,
0.22)

𝑠

2
=
(
0
.
18
,
0
.
24
,
0.14, 0.35)

𝑠

1


2
=
(
0
.
18
,
0
.
11
,
0.14, 0.22)



1


2



2


1



1


2




Can estimate
𝛼
=

|
𝑁
1

𝑁
2
|
|
𝑁
1

𝑁
2
|

,
|
𝑁
2

𝑁
1
|
|
𝑁
1

𝑁
2
|
,
|
𝑁
1

𝑁
2
|
|
𝑁
1

𝑁
2
|




unbiasedely with
𝜎
2
=
𝛼
1

𝛼


𝛼

=
1
4

𝛼

=
1
4

𝛼

=
2
4
=
1
2

k
-
partition sketches:
Jaccard

estimation


=
4


𝑠

1
=
(
1
.
00
,
1
.
00
,
0.14,
0.21)

𝑠

2
=
(
0
.
18
,
1
.
00
,
0.14, 0.35)

𝑠

1


2
=
(
0
.
18
,
1
.
00
,
0.14, 0.21)



1


2



2


1



1


2




Can estimate
𝛼
=

|
𝑁
1

𝑁
2
|
|
𝑁
1

𝑁
2
|

,
|
𝑁
2

𝑁
1
|
|
𝑁
1

𝑁
2
|
,
|
𝑁
1

𝑁
2
|
|
𝑁
1

𝑁
2
|




unbiasedely with
𝜎
2
=
𝛼
1

𝛼



(conditioned on
’
)

𝛼

=
1
3

𝛼

=
1
3

𝛼

=
1
3



=
2




=
3




=
3


Bottom
-
k sketches:
Jaccard

estimation


=
4


𝑠

1
=
{
0
.
09
,
0
.
14
, 0.18, 0.21}

𝑠

2
=
{
0
.
14
,
0
.
17
,
0.19, 0.35}

𝑠

1


2
=
{
0
.
09
,
0
.
14
,
0.17, 0.18}



1


2



2


1



1


2




Can estimate
𝛼
=

|
𝑁
1

𝑁
2
|
|
𝑁
1

𝑁
2
|

,
|
𝑁
2

𝑁
1
|
|
𝑁
1

𝑁
2
|
,
|
𝑁
1

𝑁
2
|
|
𝑁
1

𝑁
2
|




unbiasedely with
𝜎
2
=
𝛼
1

𝛼

1



1
𝑛

1

𝛼

=
1
4

𝛼

=
1
4

𝛼

=
2
4

Smallest

=
4

in
union
of sketches

Bottom
-
k sketches: better estimate


=
4


𝑠

1
=
{
0
.
09
,
0
.
14
, 0.18, 0.21}

𝑠

2
=
{
0
.
14
,
0
.
17
,
0.19, 0.35}

𝑠

1


2
=
{
0
.
09
,
0
.
14
,
0.17, 0.18}



1


2



2


1



1


2

We can look
beyond

the union sketch: We have complete
membership information on all elements with



min

{
max

𝑠

1
,
max

𝑠

2
}
. We have
2k
>





elements!

0.19, 0.21



=
6
>
4


Bottom
-
k sketches: better estimate


=
4


𝑠

1
=
{
0
.
09
,
0
.
14
, 0.18, 0.21}

𝑠

2
=
{
0
.
14
,
0
.
17
,
0.19, 0.35}

𝑠

1


2
=
{
0
.
09
,
0
.
14
,
0.17, 0.18}



1


2



2


1



1


2

0.19, 0.21



=
6
>
4





Can estimate
𝛼
=

|
𝑁
1

𝑁
2
|
|
𝑁
1

𝑁
2
|

,
|
𝑁
2

𝑁
1
|
|
𝑁
1

𝑁
2
|
,
|
𝑁
1

𝑁
2
|
|
𝑁
1

𝑁
2
|




unbiasedely with
𝜎
2
=
𝛼
1

𝛼


1




1
𝑛

1
(conditioned on
’
)

𝛼

=
1
6

𝛼

=
2
6
=
1
3

𝛼

=
3
6
=
1
2

Cosine Similarity from Min
-
Hash sketches:
Crude estimator

𝐽

1
,

2
=
|

1


2
|

|

1


2
|



We have estimates with good relative error (and
concentration) for
|

1


2
|

,
1
N
1
,
1
N
2


Plug
-
in

C

1
,

2
=

1


2

1

|

2
|

C

1
,

2
=
𝐽
(

1
,

2
)

1


2

1

|

2
|

Next: Back to distinct counting


Inverse
-
probability distinct count estimators


Separately estimate “presence” of each
element


Historic Inverse
-
probability distinct count
estimators


General approach for deriving estimators: For
all
distributions, all Min
-
Hash sketch
types


1
2

the
variance

of purely sketch
-
based estimators


Inverse probability estimators

[Horvitz Thompson 1952]








is unbiased:


𝐸


=
1



0
+

𝑓

𝑝
=

(

)




Var


=
E



2



2
=

𝑓

𝑝
2



2
=


2
(
1
𝑝

1
)


comment: variance is minimum possible for unbiased
nonnegative estimator if domain includes


with


=
0


Model
: There is a hidden value

. It is observed/sampled with probability

>
0
.

We want to estimate



0
. If


is sampled
we know both

,


a
nd can compute

(

)
.

Inverse Probability Estimator
:

If



is
sampled



=
𝑓

𝑝
(

)
.

Else
,


=
0



Inverse
-
Probability estimate for a sum


Unbiased





implies unbiased


. It is important, so
bias does not add up


For
distinct count


=
I


𝑁

(indicator function).

We want to estimate the sum:

=


(

)

.

We
have a sample
𝑆

of elements.

(

)
>
0





>
0

and we know


,

(

)


when



𝑆
.


We use
:



=
𝑓

𝑝
(

)

when


𝑆
.





=0
otherwise.


Sum estimator:


=





=






𝑆


Inverse
-
Probability estimate for a sum



(

)

can be
conditioned

on a
part in some partition
of outcomes. But
elements with
f

>
0


must
have


>
0

in all
parts

(otherwise we get bias)

We want to estimate the sum:

=


(

)

.

We
have a sample
𝑆

of elements.

(

)
>
0





>
0

and we know


,

(

)


when



𝑆
.


We use
:



=
𝑓

𝑝
(

)

when


𝑆
.





=0
otherwise.


Sum estimator:


=





=






𝑆


Bottom
-
k sketches:


Inverse probability estimator


We work with the
uniform
distribution





[
0
,
1
]



For each distinct element, we consider the
probability that it is one of the
lowest
-
hash


1


elements.


For sketch

1
<

<


,

we say element


is
“sampled”


for some




1
,


=

(

)


Caveat: Probability is
=


1
𝑛

for all elements, but
we do not know

.



Need to use conditioning.

Bottom
-
k sketches:


Inverse probability estimator


We use an inverse probability estimate: If


is not
sampled (not one of the


1

smallest
-
hash
elements) estimate is
0
. Otherwise, it is

𝒑
(

)
.

But we do not know
𝒑

! what can we do ?


Need to be able to compute
𝒑
(

)

only

for
“sampled” elements.

We compute
𝒑
(

)

conditioned

on fixing


on
𝑵



but taking
𝐡




,




Bottom
-
k sketches:


Inverse probability estimator

What
is the probability


that


is sampled if we
fix


on


{

}

but take




0
,
1

?



is sampled



<
(


1
)
th


|






For sampled

, (


1
)
th


|





=





(

)
=






Inverse probability
estimate is
1
𝑝
(

)
=
1

𝑘

Summing over the


1


“sampled” elements:




=


1



Explaining conditioning in Inverse
Probability Estimate for bottom
-
k


Probability Space on
{


|





}
.


Partitioned according to
𝜏
=
(


1
)
th


|







Conditional probability that


is sampled in the
part is
Pr

(

)
<
𝜏
=
𝜏



If


is “sampled” in outcome, we know
𝜏

(it is
equal to


), estimate is
1
𝜏

. (If


is not sampled
then
𝜏
=



1
>
0



this is needed for
unbiasedness

but estimate for


is
0
)

Explaining conditioning in Inverse
Probability Estimate for bottom
-
k


=
{
𝒂
,

,

,

,

}


=
3

The probability that
𝒂

has one of the



=


smallest values
in


,


,

,



is

Pr
𝒂

𝑆
=


𝟓

but we can not compute it since we
do not know
𝒏
(
=
𝟓
)
.

The conditional probability
Pr
𝒂

𝑆



,

,

(

)
]

can be computed. It is the

𝒏

smallest value in


,

,


.

Explaining conditioning in Inverse
Probability Estimate for bottom
-
k

(.1,.3,.5,.6)

(.2,.3,.5,.71)

(.15,.3,.32,.4)

(.1,.2,.7,.8)

(.11,.2,.28,.3)

(.03,.2,.4,.66)

(.12,.4,.45,.84)

(.1,.4,.5,.8)



,

,

(

)


=
3

?

Pr


𝑆



,

,

(

)
]

𝜏
=
0
.
3

𝜏
=
0
.
2

𝜏
=
0
.
4

Bottom
-
k sketches:


Inverse probability estimators



=
k

1




We obtain an unbiased estimator.


No need to track element IDs (sample view only
used for analysis).

How good is this estimator?

We can (do not)
show:


CV is
𝜎
𝜇

1


2

at least as good as the k
-
mins

estimator

Better distinct count estimators ?


Recap:


Our estimators (k
-
mins
, bottom
-
k) have
CV

𝜎
𝜇

1


2


CRLB (k
-
mins
) says
CV
𝜎
𝜇

1



C
an we improve ? Also, what about k
-
partition?


CRLB applies when we are limited to using
only

the
information in the sketch.


Idea
: Use information we
discard along the way


“Historic” Inverse Probability Estimators


We maintain an approximate count together with
the sketch:



,

,



,



Initially



,

,



(

,

,

)






When the sketch


is updated, we compute the
probability



that a new distinct element would
cause an update to the current sketch.


We increase the counter



+
1
𝑝


We can (do not) show CV
𝜎
𝜇

1
2


2
<
1
2
1


2


Easy to apply with
all

min
-
hash sketches


The estimate is unbiased

Maintaining a k
-
mins

“historic” sketch

k
-
mins

sketch
:
Use


“independent” hash functions
:

1
,

2
,

,




Track the respective minimum

1
,

2
,

,



for each function.

Processing a new element


:





1


(
1



)




For

=
1
,

,


:




min
{



,



}


If

change in


:



+
1
𝑝

Update probability:
probability



that at least for one

=
1
,

,

,

we get



<



:


=
1


(
1



)



Maintaining a k
-
partition “historic” sketch

Processing a new element


:




first

log
2




bits

of




(

)




remaining

bits

of




(

)


If


<


,




1






=
1

,





+
1
𝑝











Update probability:
probability



that


<



for
part


selected uniformly at random

Maintaining a bottom
-
k “historic” sketch

Processing a new element


:

If



<
y
k


c


+
1

𝑘



1
,

2
,

,



sort

{

1
,

2
,

,



1
,

(

)
}




Bottom
-
k sketch
:
Use a
single hash function
:



Track the


smallest values

1
<

2
<

<




Probability of update is:
y
k



Summary: Historic distinct estimators

Recap:


Maintain sketch
and
count
. CV is
1
2

2
<
1
2
1


2


Easy to apply. Trivial to query. Unbiased.

More: (we do not show here)


CV is almost tight for this type of estimator
(estimate presence of each distinct element
entering sketch).


Can’t do better than
1
2



Mergeability
: Stated for streams. “Sketch” parts
are
mergeable

but merging “counts” requires
work (which uses the sketch parts)


Approach: carefully estimate the overlap (say, using
similarity estimators)

Next:


Working with a small range


So far Min
-
Hash sketches were
stated/analyzed for distributions (random
hash functions) with a
continuous

range



We explain how to work with a discrete
range, how small the representation can
be, and how estimators are affected.


Back
-
of
-
the
-
envelope calculations

Working with a small (discrete) range

When implementing min
-
hash sketches:


W
e work with discrete distribution of the hash range


We want to use as fewer bits to represent the sketch.

Natural discrete distribution
:


=
2



with
probability
2





Same as using
u


[
0
,
1
]


and retaining only
the negated exponent

log
2
1
u

.


Expectation of the min is about


1
n

1
2

log

n


Expected max exponent size is


log
2

log
2



Elements sorted by hash

0.1xxxxx

0.01xx

0.001xx

0.0001xx

Negated exponent:

2

1

3

4

Working with a small (discrete) range


Can also retain few
(

)

bits beyond the exponent.


Sketch size is


+



log
2

log
2




Can be reduced further to
log
2

log
2

+



by
noting that exponents parts are very similar, so
can store only the minimum once and “offsets”.

How does this rounding affect the estimators
(properties and accuracy) ?

We do “back
-
of
-
the
-
envelope” calculations

Working with a small (discrete) range

k
-
mins

and k
-
partition:
we can separately look at each
“coordinate”. The expected number of elements with
same “minimum” exponent is fixed. (The probability of
exponent



is
2


, so expectation is

2



). So we can
work with a fixed

.

“parameter estimation” estimators; Similarity estimators

We need to keep enough bits to ensure distinctness of min
-
hash values in the same sketch (for similarity, two sketches)
with good probability.

To apply “continuous” estimators,
w
e
can take a random completion and apply the estimators.

Working with a small (discrete) range

bottom
-
k:
we need to separate the smallest


values.
We expect about

/
2

to have the maximum
represented exponent.
S
o we need
log

log


+

log



bits per register. We work with

=

log



“parameter estimation” estimators; Similarity estimators

We need to keep enough bits to ensure distinctness of min
-
hash values in the same sketch (for similarity, two sketches)
with good probability.

To apply “continuous” estimators,
w
e
can take a random completion and apply the estimators.

Working with a small (discrete) range

Inverse probability (also historic) estimators:

Estimators
a
pply
directly

to discrete range: simply work with
the probability that a hash from the discrete domain is
strictly

below current “threshold”


Unbiasedness

still holds (
on streams
) even with likely hash
collisions (with k
-
mins

and k
-
partition)


Variance increases by
×
1
1

2

𝑏



we get most of the value
of continuous domain with small



For
mergeability

(
support the needed similarity
-
like
estimates to merge counts)
or with
bottom
-
k
, we need to
work with larger

=

(




) to ensure that hash
collisions
are not
likely (on same sketch or two sketches).



Distinct counting/Min
-
Hash sketches bibliography 1

First use of k
-
mins

Min
-
Hash sketches for distinct counting; first streaming algorithm for approximate
distinct counting:



P.
Flajolet

and N. Martin, N. “Probabilistic Counting Algorithms for Data Base
Applications” JCSS (31), 1985.

Use of Min
-
Hash sketches for similarity, union size,
mergeability
, size estimation (k
-
mins
, propose
bottom
-
k):



E. Cohen “Size estimation framework with applications to transitive closure and
reachability”, JCSS (55) 1997

U
se of shingling with k
-
mins

sketches for
Jaccard

similarity of text documents:



A.
Broder

“On the Resemblance and Containment of Documents” Sequences 1997



A.
Broder

and S. Glassman and M.
Manasse

and G. Zweig “Syntactic Clustering of
the Web” SRC technical note 1997

Better similarity estimators (beyond the union sketch) from bottom
-
k samples:


E. Cohen and H. Kaplan “Leveraging discarded sampled for tighter estimation of
multiple
-
set aggregates: SIGMETRICS 2009
.

Asymptotic Lower bound on distinct counter size (taking into account hash representation)



N.
Alon

Y.
Matias

M.
Szegedy

“The space complexity of approximating the frequency moments”
STOC 1996


Introducing k
-
partition sketches for distinct counting:



Z.
Bar
-
Yossef
, T. S.
Jayram
, R.
Kumar
, D.
Sivakumar
, L.
Trevisan

“Counting distinct elements in a
data stream” RANDOM 2002.

Distinct counting/Min
-
Hash sketches bibliography 2

Practical distinct counters based on k
-
partition sketches:


P.
Flajolet
, E.
Fusy
, O.
Gandouet
, F.
Meunier


Hyperloglog
: The analysis of a near
-
optimal
cardinality estimation algorithm”


S.
Heule
, M.
Nunkeser
, A. Hall “
Hyperloglog

in practice” algorithmic engineering of a state of
the art cardinality estimation algorithm”, EDBT 2013

Theoretical algorithm with asymptotic bounds that match the AMS lower bound:


D.M. Kane, J. Nelson, D. P, Woodruff “An optimal algorithm for the distinct elements
problem”, PODS 2010

Inverse probability “historic” estimators, Application of Cramer
Rao

on min
-
hash sketches:


E. Cohen “All
-
Distances
Sketches, Revisited: Scalable Estimation of the Distance Distribution
and Centralities in Massive
Graphs”
arXiv

2013.

The concepts of min
-
hash sketches and sketch coordination are related to concepts from the
survey sampling literature: Order samples (bottom
-
k), coordination of samples using

the PRN
method (Permanent Random Numbers).

More on Bottom
-
k sketches, ML estimator for bottom
-
k:


E. Cohen, H. Kaplan “Summarizing data using bottom
-
k sketches” PODS 2007. “Tighter
Estimation using bottom
-
k sketches” VLDB 2008.

Inverse probability estimator with priority (type of bottom
-
k) sketches:



N.
Alon
, N. Duffield, M.
Thorup
, C. Lund: “Estimating arbitrary subset sums with a few
probes” PODS 2005