FastGA -- fast genome alignment
From algorithms to tools – engineering optimization
This blog, I will do a literature review for paper of FastGA. The initial point that motivated me to read this paper is from the chats with other people at Genome Informatics 2025 (by the way, it is a wonderful conference) and some experiences met in my research. I realize engineering optimization is very important in the field of bioinformatics. As I mentioned in my academic page I hope to
-
Design methods with solid math/algorithms or at least mathematical intuition.
-
Build practically useful tools with heuristics benefiting the bio/bioinfo communities.
As a math/algorithm person, the first thing is easier for me and this type of works always touch me. But the second thing I may ignore and think about it in a too naive way. I was thinking a useful tool is to design the software more user-friendly, like clear and easy instructions for install and running, deploying software on website that could be easier for biologist to use. Of course, these things matters in our community and are also worth to do. But these things do not match my taste in most time (but it also provides a sense of accomplishment for me when people can use my tool by just clicking rather than “git clone repo”).
My research focuses on methodology, specifically, developing faster, less space or more accurate bioinformatics methods. In general, these three things cannot be achieved in the same time, computer scientists try to balance these things in reality, in other words, try to have a better trade-off. In modern seqbio research, one of the common topic is to sacrifice accuracy in an acceptable level and achieve super fast running time. But when I met this kind of problem, I alway use asymptotic time to understand, think and explore. When I pub a paper, the codes implemented is just a prototype to valid my idea (but this also works for theoretical paper, they have different focus). But in recent months, I gradually realized
- Algorithms in same asymptomatic time could achieve very different wall clock times. That depends on the implementation and carefully design to suite modern computer architecture.
- And even faster asymptomatic time does not means faster wall clock time.
I realized that I have to consider these engineering optimizations, not just for beating other tools when pub a paper (sure, it is important for me, otherwise a shining algorithm/theory may be overlooked because it does not achieve a good practical performance due to lack of engineering optimization), but for developing some popular and practically useful tool in this community. In this context, minimap2, Edlib and other tools is our role model.
So, here I want learn from this paper/tool FastGA, that shows another art by how to polish each details of an algorithm to implement a practical faster tool. This call back my title how to implement the algorithm to a tool is a new thing for me to pay more attention to. This paper also provides a good example to illustrate this idea, and also this is the main philosophy behind FastGA.
Creat a k-mer index H of a genome h (suffix tree, hash table, FM-index)
For each position i (k-mer starts form i) in genome g do:
Look up k_i in H, and for all postions q found, report (p,q)
The look-up in index H is not a cache-coherent process i.e. including a lot of cache miss. Let’s consider another implementation (assume we use cache-coherent sorting algorithm)
Build a list G = {(p,k_p)} for each postion p and its k-mer in genome g and sort on the k-mer.
Build a list H = {(q,k_q)} for each postion p and its k-mer in genome h and sort on the k-mer.
Merge G and H in order of k-mer and make list C of pairs (p,q) with the same k-mer.
The first and second algorithm can be done in linear time in asymptomatic (we can call radix sort for second one to achieve $O(N)$), maybe the first one can have less operations than second one, BUT memory latency caused by cache miss makes the first algorithm worse than the second. In this case, the constants behind big-$O$ notions matter!
Background and Abstract
I do not want to emphasize the importance of alignment between two genomes again. But I really like the definition of problems “genome alignment” and “genome homology” mentioned in the paper.
- Genome alignment is the problem of finding all regions that have a statistically significant alignment between them
- Genome homology is the problem of finding regions that evolved from the same region of an ancestral genome.
Of course, FastGA is a tool for first goal. And we review the pervious work here
- LastZ: Blast-like seed and extend strategy; Space-seed.
- NUCmer: maximal unique matches (MUMs) as seed hits; seed match based on suffix arrays.
- Minimap2: reduce k-mers by minimizer; affine gaps.
- Wfmash: pangenome graph builder; Min-hash estimate Jaccard similarity; for high similarity companion.
The innovations of FastGA is from engineering and algorithms, and it can compare two $>2Gbp$ genomes in 2mins using 8 threads on a Laptop.
Preliminary
- Canonical value $\Phi(\alpha)$ of a string $\alpha$ is the smaller value of $\alpha$ or its complements i.e $\Phi(\alpha) = \min (\phi(\alpha),\phi(\alpha^c))$, where $\phi$ is a hash function.
- Syncmers: Given a k-mer $\alpha$ and $m<k$, $\alpha$ is a closed $(k,m)$ syncmer iff the canonical value of the m-mer at the start or end of $\alpha$ is the smallest of all m-mers of $\alpha$ i.e. $\min (\Phi(\alpha[0,m]), \Phi(\alpha[k-m,k])) = \min_{I \in [0,k-m]} \Phi(\alpha[i,i+m])$. The average density of closed $(k,m)$ syncmer is around $2/(k-m)$.
- Adaptamers: Given a position $i$ in genome $A$, the adaptamer at position $i$ is the longest string $t$ starting from $i$, such that $t$ is a substring of genome $B$. Note, this concept is not symmetric. In this paper, FastGA filters keeps the adapamers $ \leq \tau$.
Methods
I do not obey the order of the original paper, I try to organized it in a pipeline order/logic. Basically, this a typical seed-and-chain alignment tool, finding adaptamers, chaining, and then extending.
Genome Index
- Observations: very long adaptamers in genomes is rare and their prefix is sufficient to locate a specific match on second genomes. So, FastGA uses $K=40$, in other words, FastGA find adamptamers $\leq 40bp$.
- Use following cache-coherent MSD to sort list of $K$-mers in genomes and its complement with copy numbers less than $\tau$ (user specify). For each $K$-mer, record the positions and orientations (represented in a signed integer).
- When doing sorting for the index $\mathcal{G}$ of genome, FastGA will also generate a **longest common prefix (lcp) **array, in which each position $i$ stores the length of lcp between k-mers of $\mathcal{G}[i]$ and $\mathcal{G}[i-1]$ in sort order.
- If we consider all the $K$-mers and its complement, there are roughly $26N$ bytes are needed. To reduce this, FastGA only keeps $K$-mers whose first $s$ bases are a close $(s,m)$ syncmer (using $(s,m)=(12,8)$ in default). Note
- a) If an adaptamer with length $\geq 12bp$ and its corresponding $K$-mer is selected in $\mathcal{G}$, the adaptamer and its corresponding $K$-mer will also be selected in $\mathcal{H}$;
-
b) the complement of a syncmer is also a syncmer, so the matches between forward $K$-mer in $\mathcal{G}$ and reverse $K$-mer in $\mathcal{H}$ are kept.
-
c) the adaptamer with length $< 12bp$ maybe lost, but they are few and is spurious match in most cases.
-
d) it turns that we only need $11N$ bytes now for index.
- $K=40, s = 12,$ and $m = 8$ are compile time parameters of FastGA.
- For each entry of index, FastGA use 16 byte encode ($40$-mer, contig, position). $40$-mer takes $10$ bytes by using 2-bits per base, contig number takes $2$ bytes and position takes $4$ bytes.
Most Significant Digit (MSD) Radix Sorting (Refined version)
Finding Adaptamers
Find Seed Chains and Alignment
Storing and Delivering Alignments
Enjoy Reading This Article?
Here are some more articles you might like to read next: