In previous blog, we learned about the suffix tree which are $\mathcal{O}(n)$ space complexity. But in practice, it still has a big cost for storing, because Big-Oh notation hidden the coefficient of $n$. We need take around $20$ bytes for per character. So, we will introduce suffix array in this blog, which can implement most of the functions that suffix tree can do. Note that, it’s also a trade-off that we use less space to store suffix array but will take more time on operations on it.
Other part of this blog is talking about an efficient constructing algorithm for suffix array, which is called Skew Algorithm.
What’s Suffix Array
Given a text string $s$ with size $\vert s \vert = n$, we index all the suffix of $s$ by the starting point. And then sort all the suffix in lexicographical, we can just store the index in this order. Here is an example,
\[\begin{equation} \begin{aligned} & 1 & \text{ATTCATG\$} \\ & 2 & \text{TTCATG\$} \\ & 3 & \text{TCATG\$} \\ & 4 & \text{CATG\$} \\ & 5 & \text{ATG\$} \\ & 6 & \text{TA\$} \\ & 7 & \text{A\$} \\ & 8 & \text{\$} \\ \end{aligned} \, \xrightarrow{\text{sort lexicographically}} \, \begin{aligned} & 8 & \text{\$} \\ & 5 & \text{ATG\$} \\ & 1 & \text{ATTCATG\$} \\ & 4 & \text{CATG\$} \\ & 7 & \text{A\$} \\ & 3 & \text{TCATG\$} \\ & 6 & \text{TA\$} \\ & 2 & \text{TTCATG\$} \\ \end{aligned} \, \xrightarrow{\text{store indices}} \, \begin{aligned} 8 \\ 5 \\1 \\ 4 \\7 \\ 3\\ 6 \\2 \end{aligned} \end{equation}\]It’s clear that the space complexity is $\mathcal{O}(n)$.
Relationship between Suffix Array and Suffix Tree
We know each leaf in suffix tree represents a suffix of string $s$, then we assign the start position of suffix to each leaf (which also equals to $n-d$, where $d$ is the depth of the suffix trie). Then, we can use DFS and at each branch node we select the outgoing edge with smallest lexicographical substring. Then, the order of reaching leaves is the order of suffix array.
Some Operations on Suffix Array
- Given a query string $q$, search $q$ on Suffix Array, we can use binary search, because the Suffix Array is sorted. Then we will return the first position that start with string $q$. Because, all the suffixes start with string $q$ are located around this position, we just need to count the neighbors.
-
$k$-mer Counting: given an integer $k$, output all pairs $(b,i)$, where $b$ is a length-$k$ substring of $s$ that occurs $i$ times.
-
Go through the suffix array and keep a count variable $\textbf{Count}$
-
If the current suffix with length $<k$, skip it.
-
If the current suffix start with same length-$k$ string as previous suffix, $\textbf{Count}++$
-
Else, Output $\textbf{Count}$ and previous length-$k$ suffix. And reset $\textbf{Count} := 1$, which means we move to a new $k$-mer.
-
Constructing Suffix Array
A straightforward way to do sorting will takes $\mathcal{O}(n^2 \lg n)$ running time (Here we use merge sort or other fast sorting algorithms), because we need do $\mathcal{O}(n \lg n)$ comparisons and each comparisons takes $\mathcal{O}(n)$.
In fact, we can use Radix Sort here, for each digital we consume $\mathcal{O}(n)$ running time, and we have $n$ digitals. So the running time is $\mathcal{O}(n^2)$. Here, we will introduce a efficient way to implement sorting for suffixes, that is Skew Algorithm using divided-and-conquer idea.
Skew Algorithm
-
Divide suffixes into $3$ groups: starting position $i \text{ mod } 3 = 0$, $1$ and $2$ and we call them group 0, 1 and 2. If $3$ cannot divide the length of string, we pad $ at the end of string. Take an example mississippi$$
-
Consider the group 1 and 2.
because, for each group, the difference between any two suffixes starting positions is the multiple of $3$. So, we can translate each triple of the suffix to