Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

Nicola Prezza

We consider the problem of encoding a string of length $n$ from an alphabet $[0,\sigma-1]$ so that access and substring-equality queries (that is, determining the equality of any two substrings) can be answered efficiently. A clear lower bound on the size of any prefix-free encoding of this kind is $n\log\sigma + \Theta(\log (n\sigma))$ bits. We describe a new encoding matching this lower bound when $\sigma\leq n^{O(1)}$ while supporting queries in optimal $O(1)$-time in the cell-probe model, and show how to extend the result to the word-RAM model using $\Theta(\log^2n)$ bits of additional space. Using our new encoding, we obtain the first optimal-space algorithms for several string-processing problems in the word-RAM model with rewritable input. In particular, we describe the first in-place algorithm computing the LCP array in $O(n\log n)$ expected time and the first in-place Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. Our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first optimal-space and sublinear-time algorithm for building the sparse suffix tree.

Knowledge Graph



Sign up or login to leave a comment