Dynamic Programming:RNA secondary strucure problem, Sequence alignment, Multiple sequence alignment

 

This lecture continues to talk about Dynamic Programming, we are going to introduce RNA secondary stucture problem and squence alignment problem.

Dynamic Programming

RNA secondary structure

In biology, RNA is a string ${ s_1,\cdots,s_n }$ over ${ {A,C,G,U} }$. And, it’s tries to form base pairs with itself by folding back.

The set of base pairs formed by RNA moleculars by this process is called “secondary structure”.

Definition: “Secondary Structure” is a set of pairs ${ S = { (s_i, s_j) } }$ satisfying the following properties

  1. Watson-Crick rule

Allowed pairs: A-U, U-A, G-C, C-G

  1. No sharp turn:

If ${ (s_i, s_j) \in S }$ then ${ i < j - 4 }$

  1. No crossing:

If ${ (s_i,s_j), (s_k ,s_{\ell}) \in S}$, they don’t crossing.

  • Objective:

Given RNA sequence ${ s_1,\cdots,s_n }$, find a secondary structure ${ S }$ that maximizes the total number of matching pairs.

Idea

We define the subproblem as

${ OPT(i,j) = }$ maximum number of matched pairs in ${ s_i, \cdots,s_j }$

So, there are several cases

Case 1: If ${ i \geq j -4 }$, it’s a sharp turn, we can not match them ${ OPT(i,j) = 0 }$

Case 2: If ${ s_j }$ is not matched with any charcter in ${ s_i, \cdots, s_{j-1} }$, so

$$ OPT(i,j) = OPT(i,j-1) $$

Case 3: ${ s_j }$ can be matched with ${ s_k }$ where ${ i \leq k < j -4 }$. Assume ${ s_k }$ and ${ s_j }$ are matched, because we can not do crossing match, so we have to find optimal match solution from ${ i }$ to ${ k-1 }$ and ${ k+1 }$ to ${ j-1 }$

$$ OPT(i,j) = 1 + \max_{ i \leq k < j -4, \text{and } \atop s_j, s_k \text{ can match}} \{ OPT(i,k-1)+OPT(k+1,j-1)\} $$

pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
RNA_Folding(s_1,s_2,...,s_n)
    // Initialization DP matrix
    for i = 1 ... n 
        for j = 1 ... n
            M[i,j] = 0
    for k = 5,6,...,n-1 //iterate all the distance of interval
        for i = 1...n-k //iterate all the starting position
            j = i + k
            M[i,j] = Compute_Folding(s_i,...,s_j)
    Return M[1,n]

Compute_Folding(s_i,...,s_j)
    for t = i,i+1,...,j-5
        if s_t and s_j are matched
            M[i,j] = max{M[i,j],1+M[i,t-1]+M[t+1,j-1]} //We keep M[i,j] store maxinum in each comparision, finnaly we get the maxinum for all situation. If there is no s_t to match s_j, M[i,j] still keep 0

runing time

The runing time is ${ O\left((n-k)(5+6+\cdots+n-1) \right) = O(n^3)}$

Sequence Alignment

We have two ways to define this problem

Longest Common Subsequence (LCS) and Edit Distance (ED).

Given two strings ${ x [1\cdots m] }$, ${ y [1\cdots n ] }$, their LCS is the longest subsequence that is present in both ${ x,y }$ (Note: subsequences are not required to occupy consecutive positions within the original sequences)

Edit distance between ${ x,y }$ is the minmum number of character insetion, deltion and substitution required to convert ${ x }$ to ${ y }$.

If the allowed edit operations are only intertion and deletion, the ${ ED = n + m - 2\vert LCS \vert }$

triangle inequality ${ d(x,z) \leq d(u,y) + d(y,z) }$ not satisfy for ED

Dynamic Programming for LCS

We can define the subproblem as

$$ OPT[i,j] = LCS(x[1,...,i],y[1,...,j]) $$

And, we give the state transition equation

$$ \begin{equation} OPT[i,j] = \begin{cases} OPT[i-1,j-1]+1, &\text{ if } x[i] = y[i], \\ \max\{OPT[i-1,j],OPT[i,j-1]\}, &\text{ if } x[i] \neq y[i]. \end{cases} \end{equation} $$

Correctness

Case 1: If ${ x[i] \neq y[j] }$, we can align them, we need to delete at least one. Thus

$$ OPT[i,j] = \max\{OPT[i-1,j],OPT[i,j-1]\} $$

Case 2: If ${ x[i] = y[j] }$, we will prove following claim

  • Claim: ${ OPT[i,j] = OPT[i-1,j-1]+1 }$

Proof. we suppose we don’t match them, and we will deduce controdiction.

If we delete both of them, the largest common sequence of ${ x[1…i],y[1…j] }$ is ${ OPT[i-1,j-1] }$, which less than ${ OPT[i-1,j-1]+1 }$, which contridicts to optimal solution.

If we delete either of them, without loss generality, we delete ${ x[i] }$, and align ${ y[j] }$ to ${ x[k], k< i }$

Now, optimal value is ${ OPT[i,j] = OPT[k-1,j-1] + 1 }$. But obviously, ${ OPT[k-1,j-1] \leq OPT[i-1,j-1] }$, which contradicts to optimal value.

Hence,${ OPT[i,j] = OPT[i-1,j-1]+1 }$. QED.

Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
Longest_Common_Subsequence(x[1...m],y[1...n])
    // Initialization of DP matrix
    for (i= 0 to m)
        for (j = to n)
            M[i,j] = 0
    for i = 1 to m //itertate eles in x
        for j = 1 to n // iterate eles in y
            If x[i] = y[j]
                M[i,j] = 1 + M[i-1,j-1]
            else 
                M[i,j] = max{M[i-1,j],M[i,j-1]}
    return M[m,n] 

Running time and Space

It’s easy to check the running time is ${ O(nm) }$. (We have ${ O(mn) }$ entries and each entry consume ${ O(1) }$)

It seems that the space complexity is ${ O(nm) }$, but we don’t need to store all value, we just need store last row/colunm, so space is ${ O(min{n,m}) }$. (Because we only need ${ M[i-1,j-1],M[i,j-1],M[i-1,j] }$ to get the ${ M[i,j] }$)

LCS for multiple sequences

Given ${ \ell }$ strings ${ x^1,…,x^\ell }$

$$ x^i = x^i_1 \cdots x^i_{m_i} $$

Our goal is finding the LCS of these ${ \ell}$ sequences.

We can define a subproblem similar with pairwise sequence alignment.

${OPT[i_1,i_2,\cdots,i_\ell] = }$ the longest common sequence of ${ x^1[1…x_{i_1}],x^2[1…x_{i_2}],\cdots,x^\ell[1…x_{i_\ell}] }$

Idea

If ${ x^1{i_1}= x^2{i_2}=\cdots=\cdots=x^3{i\ell} }$, we can match all the each current element from each squence.

$$ OPT[i_1,i_2,\cdots,i_\ell] = 1 + OPT[i_1-1,i_2-1,\cdots,i_\ell-1] $$

Else, they don’t match to each other, we can recursivly find the optimal solution. That is we choose the maxinum plan from following OPT value

$$ \begin{equation} OPT[i_1,i_2,\cdots,i_\ell] = \max \left\{ \begin{aligned} OPT&[i_1-1,i_2,\cdots,i_\ell], \\ OPT&[i_1,i_2-1,\cdots,i_\ell], \\ &\vdots \\ OPT&[i_1,i_2,\cdots,i_\ell-1]. \\ \end{aligned} \right\} \end{equation} $$

Running time

The DP matrix have ${ O(m_1\cdot m_2 \cdot \cdots m_\ell) }$ entries, and each entry need ${ O(\ell) }$ time. So, total running time is ${ O(\ell \cdot m_1\cdot m_2 \cdot \cdots m_\ell) }$