Extra topic 1:Palindrome and Manacher algorithm

 

A DNA string is a reverse palindrome if it is equal to its reverse complement. For instance, $\textbf{ATGCAT}$ is a reverse palindrome, because its reverse string $\textbf{TACGTA}$ can complementarily match to origin string.

\[\begin{align*} & \text{ATGCAT}\\ & \mid \,\; \mid \,\; \mid \,\; \mid \,\; \mid \,\; \mid \,\\ & \text{TACGTA} \end{align*}\]

This phenomenon is quite common in our DNA or RNA sequences. We hope to find all occurrences of reverse palindrome in text $T$. In fact, there are at most $\Theta(n^2)$ palindrome substring in text with length $\vert T \vert = n$. In this blog, we will first give a naïve algorithm and use Manacher algorithm to solve this problem in $\mathcal{O}(n)$. In fact, we can do lowest common accessor on Suffix tree to implement this also, which will be introduced in Extra topic 2.

In this topic, we also consider other problem which is quite similar to above problem. Instead of consider reverse palindrome, we only consider symmetric string that is $T = T^\text{R}$ here $T^\text{R}$ is the reverse string of $T$, so in this case we have odd and even symmetric string. For example, $\textbf{ATGGTA}$ and $\textbf{ATGCGTA}$.

Naïve Algorithm

We iterate each character in text from start to end. In each position, we try to treat it as a center (odd and even center) and extend longer. For $i$th character, we need compare at most $i$ pairs of character, so the total running time would be $\mathcal{O}(n^2)$.

In fact, we can represent palindrome or symmetric string in a more compact way. For example, $\textbf{ATGCGTA}$ has three symmetric substrings with $\text{C}$ as center, which equals to the longest length of radius of symmetric substring with $\text{C}$ as center.

Hence, for each position $i$ in text $T$, we can record the odd/even symmetric substring with $i$th character as center. So, we define array $d_o[i],d_e[i]$ for the longest radius or number of odd/even symmetric substrings that take $i$th character as center.

\[\begin{aligned} \underbrace{\cdots \text{ATGCGTA} \cdots}_{d_o[i]=3} \\ \underbrace{\cdots \text{ATGGTA} \cdots}_{d_e[i]=3} \end{aligned}\]

Thus, the size of arrays are $\Theta(n)$, which inspire us to have a linear time algorithm.

Manacher Algorithm

To describe the algorithm, we only consider the odd cases here, but notice that even cases can be solved in similar way.

Main idea

During the process of whole algorithm, we always maintain the rightmost symmetric substring $[\ell,r]$, where $\ell,r$ is the left and right bound. We iterate the text string from start to end. Suppose we need to calculate $d_o[i]$ in this loop, and we have already known all the numbers of $d_o[k], k<i$. Now, we have several case

  • If $i>r$, we use naïve algorithm to extend it, until we find mismatch. And, we record the radius of longest symmetric substring as $d_o[i]$.

  • If $i \leq r$, we can try to employ the information from previous “knowledge”. Because we already know the information of symmetric position of $i$ in $[\ell,r]$, namely $j = \ell + (r-i)$th position.

    • If $i+d_o[j] < r$, we have following case
    \[\cdots \overbrace{ s_\ell \cdots \underbrace{s_{j - d_o[j]} \cdots s_j \cdots s_{j + d_o[j]}}_{\text{ longest sym-substring on } s_j} \cdots \underbrace{s_{i-d_o[j]} \cdots s_i \cdots s_{i+d_o[j]}}_{\text{symmetric substring on } s_i} \cdots s_r}^{\text{rightmost symmetric substring }[\ell,r]} \cdots\]

    Because $s_{i-d_o[j]} \cdots s_i \cdots s_{i+d_o[j]}$ is still in the range of $[\ell,r]$, we can directly assert that this symmetric substring is longest with $s_i$ as center. We don’t update rightmost symmetric substring in this case.

    • If $i+d_o[j] \geq r$, we know $[i-(r-i),r]$ is a symmetric substring with $s_i$ as center, but we don’t know if it’s longest, so we need to explore the toward two side. That is
    \[\cdots \overbrace{\underbrace{s_{\ell} \cdots s_j \cdots s_{j + (r-i)}}_{\text{ longest sym-substring on } s_j} \cdots \underset{?\leftarrow}{\cdots} \underbrace{s_{i-(r-i)} \cdots s_i \cdots s_{r}}_{\text{symmetric substring on } s_i}}^{\text{rightmost symmetric substring }[\ell,r]} \underset{\rightarrow ?}{\cdots}\]

    So, we need to do more comparisons to explore it. After that, we record $d_o[i]$ and update $[\ell,r]$ as $[i-d_o[i], i+d_o[i]]$.

For the even cases, we can easily have similar way to solve.

Implement odd and even cases simultaneously

Instead of fill arrays $d_o,d_e$ separately, we can insert # ahead of each character and the end of the string, for example

\[\text{ATGCAT} \rightarrow \text{\#A\#T\#G\#C\#A\#T\#}\]

It’s clear that each # symbol is located at even position and each origin character located at odd position (start from $0$). Hence, we can apply Manacher algorithm only for odd-length symmetric substring, and denote array $d$. Then, we know $d[2i+2]=2d_e[i],d[2i+1] = 2d_o[i]$.

Running time

At each iteration, we conduct comparisons and other part takes, which constant time. When we do a comparison, $r$ will exactly increase one, when $r= \vert S \vert $, we don’t need to do comparisons anymore. Hence, the total number of comparisons is $\mathcal{O}(n)$. So, the total running time will be $\mathcal{O}(n)$.

Reverse Palindrome

Basically, the similar procedure with above symmetric substring. We describe it as below

  • If $i>r$, we check if $(S[i],S[i+1]), (S[i-1],S[i+2]) \cdots $ these character pairs are complementary pairing, i.e $\textbf{A} - \textbf{T}, \textbf{G} - \textbf{C}$. Until we find mismatch, then we stop and record the radius of longest palindrome substring as $d[i]$.

  • If $i \leq r$, we can try to employ the information from previous “knowledge”. Because we already know the information of symmetric position of $i$ in $[\ell,r]$, namely $j = \ell + (r-i) - 1$th position.

    • If $i+d[j] < r$, we have following case

      \[\cdots \overbrace{ s_\ell \cdots \underbrace{s_{j - d[j] + 1} \cdots s_j s_{j+1} \cdots s_{j + d[j]}}_{\text{longest palindrome on } s_j} \cdots \underbrace{s_{i-d[j]+1} \cdots s_i s_{i+1} \cdots s_{i+d[j]}}_{\text{palindrome on } s_i } \cdots s_r}^{\text{rightmost palindrome }[\ell,r]} \cdots\]

      Because $s_{i-d[j]+1} \cdots s_i s_{i+1} \cdots s_{i+d[j]}$ is still in the range of $[\ell,r]$, we can directly assert that this palindrome is longest with $s_i$ as center. We don’t update rightmost palindrome in this case.

      • If $i+d[j] \geq r$, we know $[i-(r-i-1), r]$ is a palindrome with $s_i$ as center, but we don’t know if it’s longest, so we need to explore the toward two side. That is
      \[\cdots \overbrace{\underbrace{s_{\ell} \cdots s_j s_{j+1} \cdots s_{j + (r-i)}}_{\text{palindrome on } s_j} \cdots \underset{?\leftarrow}{\cdots} \underbrace{s_{i-(r-i-1)} \cdots s_i s_{i+1} \cdots s_{r}}_{\text{palindrome on } s_i}}^{\text{rightmost palindrome }[\ell,r]} \underset{\rightarrow ?}{\cdots}\]

      So, we need to do more comparisons to explore it. After that, we record $d[i]$ and update $[\ell,r]$ as $[i-d[i]+1, i+d[i]]$.

Here, we show the pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Pair(s,t)
    If s == "A" & t == "T"
        return True
    elif s == "T" & t == "A"
        return True
    elif s == "C" & t == "G"
        return True
    elif s == "G" & t == "C"
        return True
    else
        return False

Find-Reverse-Palindrome(S)
    // Initialization
    d = [0]*n \\ longest redius of palindrome with ith as center
    l = 0
    r = -1
    
    // Iteration for each character
    for i = 0 to n
        if i > r // extend by comparisons
            k = 0
            while 0 <= i-k & i+k+1 < n & Pair(s[i-k], s[i+k+1]):
                k += 1
            d[i] = k
            // Update [l,r]
            l = i - d[i] + 1
            r = i + d[i]
        else
            j = l + (r-j) + 1 // symmetric position
            if i + d[j] < r // directly copy
                d[i] = d[j]
            else
                k = min(d[j], r -i)
                while 0 <= i-k & i+k+1 < n & Pair(s[i-k], s[i+k+1]):
                    k += 1
                    d[i] = k
                // Update [l,r]
                l = i - d[i] + 1
                r = i + d[i]
    return d