This lecture continues to talk about Dynamic Programming, we are going to introduce RNA secondary stucture problem and squence alignment problem.
Dynamic Programming
RNA secondary structure
In biology, RNA is a string ${ s_1,\cdots,s_n }$ over ${ {A,C,G,U} }$. And, it’s tries to form base pairs with itself by folding back.
The set of base pairs formed by RNA moleculars by this process is called “secondary structure”.
Definition: “Secondary Structure” is a set of pairs ${ S = { (s_i, s_j) } }$ satisfying the following properties
- Watson-Crick rule
Allowed pairs: A-U, U-A, G-C, C-G
- No sharp turn:
If ${ (s_i, s_j) \in S }$ then ${ i < j - 4 }$
- No crossing:
If ${ (s_i,s_j), (s_k ,s_{\ell}) \in S}$, they don’t crossing.
- Objective:
Given RNA sequence ${ s_1,\cdots,s_n }$, find a secondary structure ${ S }$ that maximizes the total number of matching pairs.
Idea
We define the subproblem as
${ OPT(i,j) = }$ maximum number of matched pairs in ${ s_i, \cdots,s_j }$
So, there are several cases
Case 1: If ${ i \geq j -4 }$, it’s a sharp turn, we can not match them ${ OPT(i,j) = 0 }$
Case 2: If ${ s_j }$ is not matched with any charcter in ${ s_i, \cdots, s_{j-1} }$, so
Case 3: ${ s_j }$ can be matched with ${ s_k }$ where ${ i \leq k < j -4 }$. Assume ${ s_k }$ and ${ s_j }$ are matched, because we can not do crossing match, so we have to find optimal match solution from ${ i }$ to ${ k-1 }$ and ${ k+1 }$ to ${ j-1 }$
pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
RNA_Folding(s_1,s_2,...,s_n)
// Initialization DP matrix
for i = 1 ... n
for j = 1 ... n
M[i,j] = 0
for k = 5,6,...,n-1 //iterate all the distance of interval
for i = 1...n-k //iterate all the starting position
j = i + k
M[i,j] = Compute_Folding(s_i,...,s_j)
Return M[1,n]
Compute_Folding(s_i,...,s_j)
for t = i,i+1,...,j-5
if s_t and s_j are matched
M[i,j] = max{M[i,j],1+M[i,t-1]+M[t+1,j-1]} //We keep M[i,j] store maxinum in each comparision, finnaly we get the maxinum for all situation. If there is no s_t to match s_j, M[i,j] still keep 0
runing time
The runing time is ${ O\left((n-k)(5+6+\cdots+n-1) \right) = O(n^3)}$
Sequence Alignment
We have two ways to define this problem
Longest Common Subsequence (LCS) and Edit Distance (ED).
Given two strings ${ x [1\cdots m] }$, ${ y [1\cdots n ] }$, their LCS is the longest subsequence that is present in both ${ x,y }$ (Note: subsequences are not required to occupy consecutive positions within the original sequences)
Edit distance between ${ x,y }$ is the minmum number of character insetion, deltion and substitution required to convert ${ x }$ to ${ y }$.
If the allowed edit operations are only intertion and deletion, the ${ ED = n + m - 2\vert LCS \vert }$
triangle inequality ${ d(x,z) \leq d(u,y) + d(y,z) }$ not satisfy for ED
Dynamic Programming for LCS
We can define the subproblem as
And, we give the state transition equation
Correctness
Case 1: If ${ x[i] \neq y[j] }$, we can align them, we need to delete at least one. Thus
Case 2: If ${ x[i] = y[j] }$, we will prove following claim
- Claim: ${ OPT[i,j] = OPT[i-1,j-1]+1 }$
Proof. we suppose we don’t match them, and we will deduce controdiction.
If we delete both of them, the largest common sequence of ${ x[1…i],y[1…j] }$ is ${ OPT[i-1,j-1] }$, which less than ${ OPT[i-1,j-1]+1 }$, which contridicts to optimal solution.
If we delete either of them, without loss generality, we delete ${ x[i] }$, and align ${ y[j] }$ to ${ x[k], k< i }$
Now, optimal value is ${ OPT[i,j] = OPT[k-1,j-1] + 1 }$. But obviously, ${ OPT[k-1,j-1] \leq OPT[i-1,j-1] }$, which contradicts to optimal value.
Hence,${ OPT[i,j] = OPT[i-1,j-1]+1 }$. QED.
Pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
Longest_Common_Subsequence(x[1...m],y[1...n])
// Initialization of DP matrix
for (i= 0 to m)
for (j = to n)
M[i,j] = 0
for i = 1 to m //itertate eles in x
for j = 1 to n // iterate eles in y
If x[i] = y[j]
M[i,j] = 1 + M[i-1,j-1]
else
M[i,j] = max{M[i-1,j],M[i,j-1]}
return M[m,n]
Running time and Space
It’s easy to check the running time is ${ O(nm) }$. (We have ${ O(mn) }$ entries and each entry consume ${ O(1) }$)
It seems that the space complexity is ${ O(nm) }$, but we don’t need to store all value, we just need store last row/colunm, so space is ${ O(min{n,m}) }$. (Because we only need ${ M[i-1,j-1],M[i,j-1],M[i-1,j] }$ to get the ${ M[i,j] }$)
LCS for multiple sequences
Given ${ \ell }$ strings ${ x^1,…,x^\ell }$
Our goal is finding the LCS of these ${ \ell}$ sequences.
We can define a subproblem similar with pairwise sequence alignment.
${OPT[i_1,i_2,\cdots,i_\ell] = }$ the longest common sequence of ${ x^1[1…x_{i_1}],x^2[1…x_{i_2}],\cdots,x^\ell[1…x_{i_\ell}] }$
Idea
If ${ x^1{i_1}= x^2{i_2}=\cdots=\cdots=x^3{i\ell} }$, we can match all the each current element from each squence.
Else, they don’t match to each other, we can recursivly find the optimal solution. That is we choose the maxinum plan from following OPT value
Running time
The DP matrix have ${ O(m_1\cdot m_2 \cdot \cdots m_\ell) }$ entries, and each entry need ${ O(\ell) }$ time. So, total running time is ${ O(\ell \cdot m_1\cdot m_2 \cdot \cdots m_\ell) }$