We must know, we will know !

万物并作,吾以观复

Near-Linear Time Edit Distance for Indel Channels

In this blog, we will check out a theoretical paper about edit distance. In fact edit distance is a kernel idea for sequence alignment in bioinformatics. There are many available tools developed to solve various kinds of alignment scenario. While there exists a gap between theoretical algorithms and practical tools. In practice, we also embed some heuristic part to the tools to facilitate alignment. As theoreticians, they need to figure out some techniques to get a theoretical analysis of these tools that is more close to the practical performance. Further, a good theoretical analysis will light the direction of improvement.

This paper proposed a framework based on the classic NW algorithm with running time $\mathcal{O}(n\lg n)$ and get edit distance with high probability, which is also adapted by many classic implements, like BLAST. Hence, the analysis of the algorithm mentioned in this paper is also a theoretical justification of these heuristic-based alignment tools.

Read more

The Backpack Quotient Filter: a dynamic and space-efficient data structure for querying k-mers with abundance

This paper proposes a novel data structure, Backpack Quotient Filter (BQF), to index $k$-mer. which support querying with efficient space and negligible false positive rate. Additionally, this blog will introduce Counting Quotient Filter (CQF)1 at first, to lead a better understanding of this work.

Read more

Lecture 5:Suffix Array and Skew Algorithm

In previous blog, we learned about the suffix tree which are $\mathcal{O}(n)$ space complexity. But in practice, it still has a big cost for storing, because Big-Oh notation hidden the coefficient of $n$. We need take around $20$ bytes for per character. So, we will introduce suffix array in this blog, which can implement most of the functions that suffix tree can do. Note that, it’s also a trade-off that we use less space to store suffix array but will take more time on operations on it.

Other part of this blog is talking about an efficient constructing algorithm for suffix array, which is called Skew Algorithm.

Read more

Lecture 3:Suffix Trie and Suffix Tree

Last blog, we use Trie to represent all the words in dictionary. In this blog, we have a new task, given a huge collection of genome sequences $\mathcal{G}$, and an RNA sequence $R$, we want to know which genome $G \in \mathcal{G}$ will have a substring equal to $R$. Basically, the collection of genome $\mathcal{G}$ will not change, but we will query different $R$. So, our goal is indexing all the substring of $\mathcal{G}$.

Read more

Lecture 2:Trie and Aho-Corasick Algorithm

Last blog, we are talking about searching a pattern string in a text string. In fact, we treat the pattern string as a finite automaton which can jump to other position when we meet a mismatch. In this blog, we will consider a general case, that is find multiple pattern string in a text string. In specific, given a dictionary $D$ of $d$ words which has $m$ characters in total, we need to find all their occurrences in the text string $T$ with length $\vert T \vert = n$. If we employ KMP for each word in dictionary one-by-one, the total running time will be $\mathcal{O}(m+dn)$. In this blog, we will still use the idea similar as KMP, you will see the Trie and Aho-Algorithm in this blog.

Read more

Extra topic 1:Palindrome and Manacher algorithm

A DNA string is a reverse palindrome if it is equal to its reverse complement. For instance, $\textbf{ATGCAT}$ is a reverse palindrome, because its reverse string $\textbf{TACGTA}$ can complementarily match to origin string.

\[\begin{align*} & \text{ATGCAT}\\ & \mid \,\; \mid \,\; \mid \,\; \mid \,\; \mid \,\; \mid \,\\ & \text{TACGTA} \end{align*}\]

This phenomenon is quite common in our DNA or RNA sequences. We hope to find all occurrences of reverse palindrome in text $T$. In fact, there are at most $\Theta(n^2)$ palindrome substring in text with length $\vert T \vert = n$. In this blog, we will first give a naïve algorithm and use Manacher algorithm to solve this problem in $\mathcal{O}(n)$. In fact, we can do lowest common accessor on Suffix tree to implement this also, which will be introduced in Extra topic 2.

In this topic, we also consider other problem which is quite similar to above problem. Instead of consider reverse palindrome, we only consider symmetric string that is $T = T^\text{R}$ here $T^\text{R}$ is the reverse string of $T$, so in this case we have odd and even symmetric string. For example, $\textbf{ATGGTA}$ and $\textbf{ATGCGTA}$.

Read more