Sequence aligners can guarantee accuracy in almost $O(mlog n)$ time: a rigorous average-case analysis of the seed-chain-extend heuristic
In this blog, we will check out a theoretical paper about seed and chain heursitic.
Near-Linear Time Edit Distance for Indel Channels
In this blog, we will check out a theoretical paper about edit distance. In fact edit distance is a kernel idea for sequence alignment in bioinformatics. There are many available tools developed to solve various kinds of alignment scenario. While there exists a gap between theoretical algorithms and practical tools. In practice, we also embed some heuristic part to the tools to facilitate alignment. As theoreticians, they need to figure out some techniques to get a theoretical analysis of these tools that is more close to the practical performance. Further, a good theoretical analysis will light the direction of improvement.
This paper proposed a framework based on the classic NW algorithm with running time $\mathcal{O}(n\lg n)$ and get edit distance with high probability, which is also adapted by many classic implements, like BLAST. Hence, the analysis of the algorithm mentioned in this paper is also a theoretical justification of these heuristic-based alignment tools.
The Backpack Quotient Filter: a dynamic and space-efficient data structure for querying k-mers with abundance
This paper proposes a novel data structure, Backpack Quotient Filter (BQF), to index $k$-mer. which support querying with efficient space and negligible false positive rate. Additionally, this blog will introduce Counting Quotient Filter (CQF)1 at first, to lead a better understanding of this work.
Lecture 5:Suffix Array and Skew Algorithm
In previous blog, we learned about the suffix tree which are $\mathcal{O}(n)$ space complexity. But in practice, it still has a big cost for storing, because Big-Oh notation hidden the coefficient of $n$. We need take around $20$ bytes for per character. So, we will introduce suffix array in this blog, which can implement most of the functions that suffix tree can do. Note that, it’s also a trade-off that we use less space to store suffix array but will take more time on operations on it.
Other part of this blog is talking about an efficient constructing algorithm for suffix array, which is called Skew Algorithm.
Lecture 3:Suffix Trie and Suffix Tree
Last blog, we use Trie to represent all the words in dictionary. In this blog, we have a new task, given a huge collection of genome sequences $\mathcal{G}$, and an RNA sequence $R$, we want to know which genome $G \in \mathcal{G}$ will have a substring equal to $R$. Basically, the collection of genome $\mathcal{G}$ will not change, but we will query different $R$. So, our goal is indexing all the substring of $\mathcal{G}$.
Lecture 2:Trie and Aho-Corasick Algorithm
Last blog, we are talking about searching a pattern string in a text string. In fact, we treat the pattern string as a finite automaton which can jump to other position when we meet a mismatch. In this blog, we will consider a general case, that is find multiple pattern string in a text string. In specific, given a dictionary $D$ of $d$ words which has $m$ characters in total, we need to find all their occurrences in the text string $T$ with length $\vert T \vert = n$. If we employ KMP for each word in dictionary one-by-one, the total running time will be $\mathcal{O}(m+dn)$. In this blog, we will still use the idea similar as KMP, you will see the Trie and Aho-Algorithm in this blog.
100 post articles, 13 pages.