Compressed Compact Suffix

concretecakeΠολεοδομικά Έργα

29 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

74 εμφανίσεις

Compressed Compact Suffix
Arrays

Veli Mäkinen

University of Helsinki

Gonzalo Navarro

University of Chile

compact

compress

Introduction


We consider exact string matching on static
text.


The task is to construct an index for the text
such that the occurrences of a given pattern
can be found efficiently.


Well known optimal solution exists: build a
suffix tree

over the text.


Introduction...


The suffix
-
tree
-
based solution has a
weakness:





In some applications the space usage is the
real bottleneck, not the search efficiency.


It takes too much space!

Introduction...


During the last 10 years, many practical /
theoretical solutions with reduced space
complexities have been proposed.


The work can roughly be divided into three
categories:

(1)
Reducing constant factors

(2)
Concrete optimization

(3)
Abstract optimization

Reducing constant factors


Suffix arrays

(Manber & Myers 1990)


Suffix cactuses

(Kärkkäinen 1995)


Sparse suffix trees

(Kärkkäinen & Ukkonen
1996)


Space
-
efficient suffix trees

(Kurtz 1998)


Enhanced suffix arrays

(Abouelhoda &
Ohlebusch & Kurtz 2002)

Concrete optimization



¼

Minimizing automata”


DAWGS

(Blumer & Blumer & Haussler &
McConnel & Ehrenfeucht 1983)


Compact DAWGS

(Crochemore & Vérin
1997)


Compact suffix arrays

(Mäkinen 2000)

Abstract optimization


Objective
: Use as few space as possible to
support the functionality of a given abstract
definition of a data structure.


Space is measured in bits and usually given
proportional to the entropy of the text.

Abstract optimization: Example


A
full text index

for a given text
T

supports
the following operations:

-

Exists(P)
: is
P

a substring of
T
?

-

Count(P)
: how many times

P

occurs in
T
?

-

Report(P)
: list occurrences of
P

in
T
.


Abstract optimization...


Seminal work by Jacobson 1989
:
rank
-
select queries on bit
-
vectors
.


Rank
-
select
-
type structures for suffix trees

(Munro & Raman & Rao & Clark 1996
-
)


Lempel
-
Ziv index

(Kärkkäinen & Ukkonen
1996)


Abstract optimization...


Compressed suffix arrays

(Grossi & Vitter
2000, Sadakane 2000, 2002)


FM
-
index

(Ferragina & Manzini 2000)


LZ
-
self
-
index

(Navarro 2002)


Space
-
optimal full
-
text indexes

(Grossi &
Gupta & Vitter 2003, 2004)


This paper


We use
both

concrete and abstract
optimization
.


We
compress

compact suffix array into a
succinct full
-
text index, supporting:

-

Exists(P)
,
Count(P)

in
O(|P| log |T|)

time.

-

Report(P)

in
O((|P|+occ)log |T|)

time, where


occ

is the number of occurrences.

This paper...


Space requirement of our index is
O(n(1+H
k

log n))

bits, where
H
k
=H
k
(T)

is the
order
-
k
empirical entropy

of
T
.


H
k
: “the average number of bits needed to
encode a symbol after seeing the
k

previous
ones, using a fixed codebook”.

This paper...


In practice, the size of our index is 1.67 times
the text size including the text.


Search times are comparable to compressed
suffix arrays that occupy
O(H
0

n)

bits.


Our index takes
O(log n)

times more space
than FM
-
index and the other space
-
optimal
indexes.


This paper...


Simpler than the previous approaches and
more efficient in practice.


No limitations on the alphabet size
s
:

-

FM
-
index assumes constant alphabet.

-

Some compressed suffix arrays assume


s
=polylog(n)
.

Big picture


Compact suffix array (CSA):

some areas of a
suffix array are replaced by links to similar
areas.


Compressed CSA (CCSA)
: We use the
conceptual structure of optimal CSA as such.


We represent the links with respect to the
original suffix array.

Big picture...


A bit
-
vector represents the boundaries of
areas replaced by links.


Each area is represented by an integer
denoting the start of the linked area.


Some additional structures are attached to
encode the text inside CCSA, etc.

Example: suffix array


sa suffix

1: 12 $

2: 11 i$

3: 8 ippi$

4: 5 issippi$

5: 2 ississippi$

6: 1 mississippi$

7: 10 pi$

8: 9 ppi$

9: 7 sippi$

10: 4 sissippi$

11: 6 ssippi$

12: 3 ssissippi$

1
2
3
4
5
6
7
8
9
10
11
12
m
i
s
s
i
s
s
i
p
p
i
$
T=

Example: CSA


sa

1: 12

2: 11

3: 8

4: 5

5: 2

6: 1

7: 10

8: 9

9: 7

10: 4

11: 6

12: 3


csa

1: (5,0,1)

2: (1,0,1)

3: (7,0,1)

4: (9,0,2)


5: (4,1,1)

6: (2,0,1)

7: (6,0,1)

8: (3,0,2)


9: (8,0,2)

Example: CCSA


sa

1: 12

2: 11

3: 8

4: 5

5: 2

6: 1

7: 10

8: 9

9: 7

10: 4

11: 6

12: 3


csa

1: (5,0,1)

2: (1,0,1)

3: (7,0,1)

4: (9,0,2)


5: (4,1,1)

6: (2,0,1)

7: (6,0,1)

8: (3,0,2)


9: (8,0,2)

ccsa

1: 6

2: 1

3: 8

4: 11


5: 5

6: 2

7: 7

8: 3


9: 9


1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

Example: CCSA...


sa

1: 12

2: 11

3: 8

4: 5

5: 2

6: 1

7: 10

8: 9

9: 7

10: 4

11: 6

12: 3

ccsa

1: 6

2: 1

3: 8

4: 11


5: 5

6: 2

7: 7

8: 3


9: 9


1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1
2
3
4
5
6
7
8
9
10
11
12
m
i
s
s
i
s
s
i
p
p
i
$
1: $

2: i

3: i

4: i

5: i

6: m

7: p

8: p

9: s

10: s

11: s

12: s

Example: CCSA...


sa

1: 12

2: 11

3: 8

4: 5

5: 2

6: 1

7: 10

8: 9

9: 7

10: 4

11: 6

12: 3

ccsa

1: 6

2: 1

3: 8

4: 11


5: 5

6: 2

7: 7

8: 3


9: 9


1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1: $

2: i

3: m

4: p

5: s

1: 1

2: 1

3: 0

4: 0

5: 0

6: 1

7: 1

8: 0

9: 1

10: 0

11: 0

12: 0

1: $

2: i

3: i

4: i

5: i

6: m

7: p

8: p

9: s

10: s

11: s

12: s

Search on CCSA


We simulate the standard binary search of
suffix array on CCSA.


A sub
-
problem in the search is to compare
the pattern P against a suffix
T
sa[i]...|T|
.


For this, we extract
t
sa[i]
, t
sa[i]+1
,

t
sa[i]+2
, ...,
t
sa[i]+|P|
-
1
, following the links of the CCSA.

Example: Search on CCSA

ccsa

1: 6

2: 1

3: 8

4: 11


5: 5

6: 2

7: 7

8: 3


9: 9


1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1: $

2: i

3: m

4: p

5: s

P
=“isi” vs.
T
sa[4]...|T|

?

4

1: 1

2: 1

3: 0

4: 0

5: 0

6: 1

7: 1

8: 0

9: 1

10: 0

11: 0

12: 0

2

T
sa[4]...|T|

=

i

Example: Search on CCSA

ccsa

1: 6

2: 1

3: 8

4: 11


5: 5

6: 2

7: 7

8: 3


9: 9


1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1: $

2: i

3: m

4: p

5: s

1: 1

2: 1

3: 0

4: 0

5: 0

6: 1

7: 1

8: 0

9: 1

10: 0

11: 0

12: 0

i

9

5

s

P
=“isi” vs.
T
sa[4]...|T|

?

T
sa[4]...|T|

=

Example: Search on CCSA

ccsa

1: 6

2: 1

3: 8

4: 11


5: 5

6: 2

7: 7

8: 3


9: 9


1: 1

2: 1

3: 1

4: 1

5: 0

6: 1

7: 1

8: 1

9: 1

10: 0

11: 1

12: 0

1: $

2: i

3: m

4: p

5: s

1: 1

2: 1

3: 0

4: 0

5: 0

6: 1

7: 1

8: 0

9: 1

10: 0

11: 0

12: 0

i

s

8

5

s

> P

P
=“isi” vs.
T
sa[4]...|T|

?

T
sa[4]...|T|

=

Search on CCSA...


To follow a link in constant time, we need
the operations
rank(i)

and
selectprev(i)

on
bit
-
vectors:

-

rank(i)

gives the number of
1
’s upto


position
i
.

-

selectprev(i)

gives the position of the


previous
1

before position
i
.

Search on CCSA...


Lemma

[Jacobson 89, Munro et al. 96]: A
bit
-
vector of length
n

can be replaced with a
structure of size
n+o(n)

so that queries
rank(i)

and
selectprev(i)

can be supported in
constant time.

Search on CCSA...


Corollary
: Existence and counting queries
can be supported by CCSA in time
O(|P| log
|T|)
.


Reporting queries can be supported by a
similar technique to access sampled suffixes.

Size of CCSA


Overall we use
O(n)+n’log n

bits of space,
where
n’

is the number of entries in the main
CCSA table.


We show in the paper that
n’

is also the
number of

runs

of symbols in the
Burrows
-
Wheeler transformed text
.


Finally, we show that
n’


2H
k

n +
s
k
.

Comparison: default settings


times |T|

FM 0.36

CSA 0.69

CCSA 1.67

LZ 1.5

Comparison: default settings...


times |T|

FM 0.36

CSA 0.69

CCSA 1.67

LZ 1.5

Comparison: same sample rate


times |T|

FM 0.41

CSA 0.58

CCSA 1.67

Comparison: same space


times |T|

FM 1.69

CSA 1.59

CCSA 1.67

LZ 1.5

Comparison: same space...


times |T|

FM 1.69

CSA 1.59

CCSA 1.67

LZ 1.5

Conclusion


CCSA is much faster than the default
implementations of other small indexes in
reporting (except LZ
-
index).


However, as the basic structure of the other
indexes takes less space, it is possible to
implement them using smaller sampling step
to make them occupy the same space as
CCSA and to work as efficiently.

Future


In a subsequent work we have developed an
index (a cross between CCSA and FM
-
index)
taking
O(H
k

log
s
)

bits of space supporting
counting queries in time
O(|P|)
.

-

optimal space/time on constant alphabet

-

turns the exponential additive alphabet


factor of FM
-
index into a logarithmic


multiplicative factor.