Last blog, we use Trie to represent all the words in dictionary. In this blog, we have a new task, given a huge collection of genome sequences $\mathcal{G}$, and an RNA sequence $R$, we want to know which genome $G \in \mathcal{G}$ will have a substring equal to $R$. Basically, the collection of genome $\mathcal{G}$ will not change, but we will query different $R$. So, our goal is indexing all the substring of $\mathcal{G}$.
Suffix Tries: Index One Genome
Let’s first consider only one string $T$ and see how to index all the substring of $T$. The idea is quite directly, we hope to put all the substring to the Trie. But it’s not necessary, we can just put all the suffix of $T$ in Trie. Because, any substring can be found as a prefix of some suffix. Also, we can all $ at the end of string $T$. It’s clear that the number of suffix equals to the number of leaves (also the length of string $T$). This data structure are called Suffix Trie.
Here we first ignore the constructing time of Suffix Trie and adding suffix link on Suffix Trie. We will introduce an efficient algorithm to construct it in next blog, that is Ukkonen’s algorithm. The Suffix Trie has $\mathcal{O}(\vert T \vert ^2)$ space complexity. Use Suffix Trie we can do a lot of thing fast.
- Check if string $S$ is a suffix/substring of . We just need to check it from root and extend a path from root until we find a we cannot push forward any more or we finish $S$, the running time is $\mathcal{O}(\vert S \vert)$.
- Count the number of occurrences of a string $S$ in text $T$. In fact, we can start from the leaves and calculate the number of leaves for each branch node, we store these value for each node and we just need to check $S$ from root and extend a path in Suffix Trie and we check the number of leaves for the ending node of the path. The running time is still $\mathcal{O}(\vert S \vert)$.
- Find the longest repeat in the text $T$. We use DFS until find the first branch node and stop, try all the possible path from root and return the longest path with no branch that is out longest repeat. The running time is $\mathcal{O}(\vert \text{Trie}\vert )$.
- Find the longest repeat with at least $M$ multiplicity in the text $T$. We can also store the number of leaves for each branch node and when we doing our search and stop at some node we check the number of leaves for this node, if less than $M$, we drop it. The running time is $\mathcal{O}(\vert \text{Trie}\vert )$.
- Find the longest common substring between $S$ and $T$. We can use the Suffix Trie with all the suffix links. And iterate from $0$ to the end of $S$. In each iteration, we map the character of $S$ to a node in Suffix Trie, if we cannot extend we will jump through suffix link. So, the running time is $\mathcal{O}(\vert S \vert)$, and the analysis likes KMP and Aho-Corasick.