Data Compression by

savagelizardΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

116 εμφανίσεις

Data Compression by
Quantization

Edward J. Wegman


Center for Computational Statistics


George Mason University



Outline


Acknowledgements


Complexity


Sampling Versus Binning


Some Quantization Theory


Recommendations for Quantization

Acknowledgements


This is joint work with Nkem
-
Amin (Martin) Khumbah


This work was funded by the Army Research Office

Complexity


Descriptor

Data Set Size in Bytes

Storage Mode



Tiny



10
2


Piece of Paper


Small




10
4


A Few Pieces of
Paper


Medium


10
6


A Floppy Disk


Large



10
8


Hard Disk


Huge




10
10


Multiple Hard Disks








e.g. RAID Storage

Massive




10
12


Robotic Magnetic
Tape








Storage Silos

Super Massive


10
15


Distributed Archives


The Huber/Wegman Taxonomy of Data Set Sizes



Complexity

O(r)



Plot a scatterplot

O(n)



Calculate means, variances, kernel density






estimates

O(n log(n))


Calculate fast Fourier transforms

O(nc)



Calculate singular value decomposition of
an




rc matrix; solve a multiple linear
regression

O(n
2
)



Solve most clustering algorithms.

O(a
n
)



Detect Multivariate Outliers


Algorithmic Complexity



Complexity

Table 7: Computational Feasibility on a Teraflop Grand Challenge Computer
1000 gigaflop performance assumed
n
n
1/2
n
n log(n)
n
3/2
n
2
tiny
10
-11
seconds
10
-10
seconds
2x10
-10
seconds
10
-9
seconds
10
-8
seconds
small
10
-10
seconds
10
-8
seconds
4x10
-8
seconds
10
-6
seconds
10
-4
seconds
medium
10
-9
seconds
10
-6
seconds
6x10
-6
seconds
.001
seconds
1
second
large
10
-8
seconds
10
-4
seconds
8x10
-4
seconds
1
second
2.8
hours
huge
10
-7
seconds
.01
seconds
.1
seconds
16.7
minutes
3.2
years
Motivation


Massive data sets can make many
algorithms computationally infeasible, e.g.
O(n
2
) and higher


Must reduce effective number of cases


Reduce computational complexity


Reduce data transfer requirements


Enhance visualization capabilities


Data Sampling


Database Sampling


Exhaustive search may not be practically feasible
because of their size


The KDD systems must be able to assist in the
selection of appropriate parts if the databases to be
examined


For sampling to work, the data must satisfy certain
conditions (not ordered, no systematic biases)


Sampling can be very expensive operation especially
when the sample is taken from data stored in a
DBMS. Sampling 5% of the database can be more
expensive that a sequential full scan of the data.


Data Compression


Squishing, Squashing, Thinning, Binning


Squishing = # cases reduced


Sampling = Thinning


Quantization = Binning


Squashing = # dimensions (variables) reduced


Depending on goal, one of sampling or quantization
may be preferable

Data Quantization

Thinning vs Binning



People’s first thoughts about Massive Data
usually is statistical subsampling


Quantization is engineering’s success
story


Binning is statistician’s quantization

Data Quantization


Images are quantized in 8 to 24 bits, i.e. 256 to
16 million levels.


Signals (audio on CDs) are quantized in 16 bits,
i.e. 65,536 levels


Ask a statistician how many bins to use, likely
response is a few hundred, ask a CS data miner,
likely response is 3


For a terabyte data set, 10
6

bins


Data Quantization


Binning, but at microresolution


Conventions


d = dimension


k = # of bins


n = sample size


Typically k << n

Data Quantization


Choose E[W|Q = y
j
] = mean of
observations in j
th

bin = y
j


In other words, E[W|Q] = Q


The quantizer is self
-
consistent


Data Quantization


E[W] = E[Q]


If


is a linear unbiased estimator, then so is E[

|Q]


If h is a convex function, then E[h(Q)]


E[h(W)].


In particular, E[Q
2
]


E[W
2
] and var (Q)


var (W).


E[Q(Q
-
W)] = 0


cov (W
-
Q) = cov (W)
-

cov (Q)


E[W
-
P]
2



E[W
-
Q]
2

where P is any other quantizer.

Data Quantization

Distortion due to Quantization


Distortion is the error due to quantization.


In simple terms, E[W
-
Q]
2
.


Distortion is minimized when the
quantization regions, S
j
, are most like a
(hyper
-
) sphere.

Geometry
-
based Quantization


Need space
-
filling tessellations


Need congruent tiles


Need as spherical as possible

Geometry
-
based Quantization


In one dimension


Only polytope is a straight line segment (also
bounded by a one
-
dimensional sphere).


In two dimensions


Only polytopes are equilateral triangles, squares and
hexagons

Geometry
-
based Quantization


In 3 dimensions


Tetrahedrons (3
-
simplex), cube, hexagonal prism,
rhombic dodecahedron, truncated octahedron.


In 4 dimensions


4 simplex, hypercube, 24 cell


Truncated octahedron
tessellation

Geometry
-
based Quantization


Tetrahedron
*



.1040042…


Cube
*




.0833333…


Octahedron




.0825482…


Hexagonal Prism
*



.0812227…


Rhombic Dodecahedron
*


.0787451…


Truncated Octahedron
*


.0785433…


Dodecahedron



.0781285…


Icosahedron



.0778185…


Sphere




.0769670


Dimensionless Second Moment for 3
-
D Polytopes

Geometry
-
based Quantization

Tetrahedron

Cube

Octahedron

Icosahedron

Dodecahedron

Truncated
Octahedron

Geometry
-
based Quantization

Rhombic Dodecahedron

http://www.jcrystal.com/steffenweber/POLYHEDRA/p_07.html

Geometry
-
based Quantization

Hexagonal Prism

24 Cell with Cuboctahedron Envelope

Geometry
-
based Quantization


Using 10
6

bins is computationally and visually feasible.


Fast binning, for data in the range [a,b], and for k bins

j = fixed[k*(x
i
-
a)/(b
-
a)]


gives the index of the bin for x
i
in one dimension.


Computational complexity is 4n+1=O(n).


Memory requirements drop to 3k

-

location of bin + #
items in bin + representor of bin, I.e. storage
complexity is 3k.



Geometry
-
based Quantization


In two dimensions


Each hexagon is indexed by 3 parameters.


Computational complexity is 3 times 1
-
D complexity,


I.e. 12n+3=O(n).


Complexity for squares is 2 times 1
-
D complexity.


Ratio is 3/2.


Storage complexity is still 3k.

Geometry
-
based Quantization


In 3 dimensions


For truncated octahedron, there are 3 pairs of square
sides and 4 pairs of hexagonal sides.


Computational complexity is 28n+7 = O(n).


Computational complexity for a cube is 12n+3.


Ratio is 7/3.


Storage complexity is still 3k.

Quantization Strategies


Optimally for purposes of minimizing distortion,
use roundest polytope in d
-
dimensions.


Complexity is always O(n).


Storage complexity is 3k.


# tiles grows exponentially with dimension, so
-
called
curse of dimensionality.


Higher dimensional geometry is poorly known.


Computational complexity grows faster than
hypercube.

Quantization Strategies


For purposes of simplicity, always use hypercube or d
-
dimensional simplices


Computational complexity is always O(n).


Methods for data adaptive tiling are available


Storage complexity is 3k.


# tiles grows exponentially with dimension.


Both polytopes depart spherical shape rapidly as d increases.


Hypercube approach is known as datacube in computer science
literature and is closely related to multivariate histograms in
statistical literature.


Quantization Strategies


Conclusions on Geometric Quantization


Geometric approach good to 4 or 5 dimensions.


Adaptive tilings may improve rate at which # tiles
grows, but probably destroy spherical structure.


Good for large n, but weaker for large d.

Quantization Strategies


Alternate Strategy


Form bins via clustering


Known in the electrical engineering literature as vector
quantization.


Distance based clustering is O(n
2
) which implies poor
performance for large n.


Not terribly dependent on dimension, d.


Clusters may be very out of round, not even convex.


Conclusion


Cluster approach may work for large d, but fails for large n.


Not particularly applicable to “massive” data mining.

Quantization Strategies


Third strategy


Density
-
based clustering


Density estimation with kernel estimators is O(n).


Uses modes m


to form clusters


Put x
i

in cluster


if it is closest to mode m

.


This procedure is distance based, but with complexity O(kn)
not O(n
2
).


Normal mixture densities may be an alternative approach.


Roundness may be a problem.



But quantization based on density
-
based clustering
offers promise for both large d and large n.

Data Quantization


Binning does not lose fine structure in tails as sampling
might.


Roundoff analysis applies.


With scale of binning, discretization not likely to be
much less accurate than accuracy of recorded data.


Discretization
-

finite number of bins implies discrete
variables more compatible with categorical data.

Data Quantization


Analysis on a finite subset of the integers
has theoretical advantages


Analysis is less delicate


different forms of convergence are equivalent


Analysis is often more natural since data is
already quantized or categorical


Graphical analysis of numerical data is not
much changed since 10
6

pixels is at limit of
HVS