We must know, we will know !

万物并作,吾以观复

Lecture 5:Suffix Array and Skew Algorithm

In previous blog, we learned about the suffix tree which are $\mathcal{O}(n)$ space complexity. But in practice, it still has a big cost for storing, because Big-Oh notation hidden the coefficient of $n$. We need take around $20$ bytes for per character. So, we will introduce suffix array in this blog, which can implement most of the functions that suffix tree can do. Note that, it’s also a trade-off that we use less space to store suffix array but will take more time on operations on it.

Other part of this blog is talking about an efficient constructing algorithm for suffix array, which is called Skew Algorithm.

Read more

Lecture 3:Suffix Trie and Suffix Tree

Last blog, we use Trie to represent all the words in dictionary. In this blog, we have a new task, given a huge collection of genome sequences $\mathcal{G}$, and an RNA sequence $R$, we want to know which genome $G \in \mathcal{G}$ will have a substring equal to $R$. Basically, the collection of genome $\mathcal{G}$ will not change, but we will query different $R$. So, our goal is indexing all the substring of $\mathcal{G}$.

Read more

Lecture 2:Trie and Aho-Corasick Algorithm

Last blog, we are talking about searching a pattern string in a text string. In fact, we treat the pattern string as a finite automaton which can jump to other position when we meet a mismatch. In this blog, we will consider a general case, that is find multiple pattern string in a text string. In specific, given a dictionary $D$ of $d$ words which has $m$ characters in total, we need to find all their occurrences in the text string $T$ with length $\vert T \vert = n$. If we employ KMP for each word in dictionary one-by-one, the total running time will be $\mathcal{O}(m+dn)$. In this blog, we will still use the idea similar as KMP, you will see the Trie and Aho-Algorithm in this blog.

Read more

Extra topic 1:Palindrome and Manacher algorithm

A DNA string is a reverse palindrome if it is equal to its reverse complement. For instance, $\textbf{ATGCAT}$ is a reverse palindrome, because its reverse string $\textbf{TACGTA}$ can complementarily match to origin string.

\[\begin{align*} & \text{ATGCAT}\\ & \mid \,\; \mid \,\; \mid \,\; \mid \,\; \mid \,\; \mid \,\\ & \text{TACGTA} \end{align*}\]

This phenomenon is quite common in our DNA or RNA sequences. We hope to find all occurrences of reverse palindrome in text $T$. In fact, there are at most $\Theta(n^2)$ palindrome substring in text with length $\vert T \vert = n$. In this blog, we will first give a naïve algorithm and use Manacher algorithm to solve this problem in $\mathcal{O}(n)$. In fact, we can do lowest common accessor on Suffix tree to implement this also, which will be introduced in Extra topic 2.

In this topic, we also consider other problem which is quite similar to above problem. Instead of consider reverse palindrome, we only consider symmetric string that is $T = T^\text{R}$ here $T^\text{R}$ is the reverse string of $T$, so in this case we have odd and even symmetric string. For example, $\textbf{ATGGTA}$ and $\textbf{ATGCGTA}$.

Read more

Lecture 1:Knuth-Morris-Pratt (KMP) algorithm

At the beginning of our bioinformatic algorithms journal, we will focus on the similarity of strings. We we touch following aspects,

  • Equality
    • Find occurrences of string $S$ in a given string $T$.
  • Edit distance
    • Find edit distance between strings $S$ and $T$ (global alignment)
    • Find substring of $T$ with minimal edit distance to given string $S$ (local alignment)
  • Probability
    • Construct probability model that best describes a given set of strings.
    • Check if a given strings is likely to be of the same type.

Today, we will introduce the algorithm for first problem: find all occurrences of a string $P$ (called pattern string) in a given string $T$ as a substring.

Read more

Prerequisite:Logic and Number Theory

In this series, I will introduce the Fundamental Contents in Computer Theory and Algorithms. If you hope to learn more advanced algorithms and techniques, please move to my another series Algorithm Design and Analysis.

In this blog, I try to do some preparation for following contents. I prefer to introduce the algorithms in a more rigorous and strict way. In theoretical computer science and algorithms, we will encounter lot of statements, like the correctness and running time for some specific algorithm. The way to assert a statement is true is Proof, then the statement will become a theorem (or lemma, claim, corollary anyway). Hence, we start from Logic which is very beginning of our story (may not be romantic but must be interesting and attractive).

Another part of this blog is number theory, which is powerful and common tool in our journey. Off course, we hope we can master more math, which are always the enrich toolbox for computer scientists.

Read more