Automated annotation of human centromeres with HORmon

Centromere Evolution (CE) Postulate

This paper adopts the current view of centromere evolution i.e. CE Postulate to develop algorithms (as of 2022).

  1. Each extant human centromere has evolved from a “single” ancestral HOR formed by “$k$ different” monomers. Hence, each monomer occurs in a HOR only once. The parameter $k$ (number of monomers in a HOR) varies between various centromeres
  2. Each frequent nonhybrid monomer in a centromere has evolved from a single ancestral monomer. The number of ancestral monomers equals the number of frequent nonhybrid monomers in the extant centromere.
  3. Each hybrid monomer has evolved from a concatenate of two (or even more) ancestral monomers and does not participate in the ancestral HOR.
  4. In addition to units formed by canonical HORs, there exist units formed by “partial HORs” (substrings of canonical HORs). All other units consist of a single hybrid monomer and are referred to as “auxiliary HORs.” Although the canonical HOR corresponds to the most frequent unit for most human centromeres, it is not always the case.

Research Gaps and Innovation

Dvorkina et al. (2021) developed the CentromereArchitect tool that addressed the monomer and HOR inference as two separate problems. In particular, the HOR inference was addressed as a data compression problem rather than an evolutionary problem that takes into account the CE postulate. Thus, although CentromereArchitect enabled an automated inference of monomers, it remains unclear whether its HORs inference adequately reflects the centromere evolution. Our analysis revealed that to generate a biologically adequate centromere annotation, the monomer and HOR inference should be viewed as two interconnected problems in such a way that HOR (monomer) generation affects monomer (HOR) generation.

Method

Definitions

  1. monocentromere: sequence of monomer-blocks that represent a decomposition of Centromere into the input monomer-set
  2. monomer graph: Given a monocentromere, its directed monomer graph is constructed on the vertex set of all its monomers and the edge set formed by all pairs of its consecutive monomers. The “multiplicity” of an edge $(M,M’ )$ in the monomer graph is defined as the number of times the monomer $M’$ follows the monomer $M$​ in the monocentromere

Positionally similar monomers

Given a monomer $M$ in a monomer set $Monomers$ for a given $monocentromere$, we identify all triples of consecutive blocks $XMY$ and appears in this $monocentromere$, and construct the $\vert Monomers \vert \times \vert Monomers \vert$ matrix $Triples_M$ where $Triples_M(X,Y)$ is count of the number of triples XMY in the $monocentromere$. Then construct a normalized matrix $NormalizedTriples_M$ (square sum of all its entries is $1$).

Given two monomers $M$ and $M’$, we define positionally similarity $PosSim(M,M’)= \sum_{i,j} NormalizedTriples_M(i,j)\times NormalizedTriples_{M’}(i,j)$.

Two monomers are called similar if the percent identity $>minPI$ (default $94\%$). Two monomers are called positionally similar if the positionally similarity $>minPosSim$ (default $0.4$​).

Merging positionally similar monomers

HORmon do iteratively

  1. Identify pairs of monomers with highest positionally similarity
  2. Merge into new monomer and launches StringDecomposer on new monomer set
  3. Repeat 1 and 2 until no positionally similar pairs of monomers

HORmon do similar processes for triples XYM and MXY as above.

Splitting aggregated monomers

Given a monomer $M$, we refer the largest element $(X,Y)$ in the matrix $NormalizedTriples_M$ as $M$ champion. We classify $(X,Y)$ and $(X’,Y’)$ in the matrix as $M$ comparable if \(NormalizedTriples_M(X,Y)/NormalizedTriples_M(X',Y') > splitValue\) (default $1/8$). HORmon iteratively collect $M$ comparable pairs using single linkage clustering starting from $M$ champion. And then the finally set is call $M$-candidate pairs. Monomer pairs $(X,Y)$ and $(X’,Y’)$ are called independent if all four monomers $X, Y, X$ , and $Y’$ are different. A monomer $M$ is called breakable if all $M$​-candidate pairs are (pairwise) independent, and unbreakable, otherwise.

Given a breakable monomer $M$, HORmon consider all $M$-candidate pairs $(X_1,Y_1) \cdots (X_t,Y_t)$ and splits $M$ into $t$ monomer $M_1,\cdots,M_t$ where $M_i$ is the consensus of all blocks arise from $X_iMY_i$.

Simplified monomer graphs

Given a monomer graph, HORmon construct a complete bipartite graph where each part represents all vertices (monomers) of monomer graph. A monomer $M’$ in the “upper” part is connected with a monomer $M’’$ in the “bottom” part by an edge of the weight equal to the multiplicity of the edge $(M’ ,M’’)$ in the monomer graph. HORmon solves the “assignment problem” to find the “maximum weight bipartite matching” in the bipartite graph.

Then, we have a set of non-overlapping cycles and paths. An edge of a monomer graph is called a removable if it is a chord in one of these cycles/paths. Remove all the removable edges and get simplified monomer graphs.

Inference of hybrid monomers

For monomers $A$, $B$, and $C$, we define $HybridDivergence_A(B, C)$ as the divergence between $A$ and a concatenate of a prefix of $B$ and a suffix of $C$ that is most similar to $A$. A monomer $A$ from a monomer set Monomers is a hybrid candidate of monomers $B$ and $C$ if $HybridDivergence_A(B,C)$ is below the $maxResolvedDivergence$ threshold and does not exceed divergence between $A$ and any another monomer from Monomers. HORmon first generates a set $HybridCandidates$ by iterating over concatenates of all possible prefixes and suffixes for every pair of distinct monomers $B$ and $C$. Afterward, if there is a single pair of monomers $B$ and $C$ that give rise to a hybrid candidate $A$, we classify $A$ as a hybrid of $B$ and $C$. If several pairs of such monomers exist, we select a pair of monomers $B$ and $C$ that are 1) not hybrid candidates themselves and 2) form a concatenate with the minimal divergence from the monomer $A$, and classify $A$ as a hybrid of $B$ and $C$.

Decomposing a centromere into HORs

Consider the complete monomer graph ($DB(Centromere^\star,2)$, i.e. the monomers are frequent monomers from CentromereArchitect), we classify an edge as

  1. HOR edge if it connects two consecutive monomers in a HOR
  2. otherwise, non-HOR edge

A monocentromere defines a traversal of edges in the graph. We break the monocentromere at all the non-HOR edges. Each substrings represent

  1. A canonical HOR or multiple consecutive canonical HORs and followed by a partial HOR.
  2. A partial HOR that includes monomers from $i$ to $j$ denoted $p_{i,j}$.
  3. An auxiliary HOR represented by hybrid or an infrequent monomer

Results for all human chromosomes

Dataset

We extracted the alpha satellite arrays from the assembly (public release v1.0) of the effectively haploid CHM13 human cell line constructed by the T2T Consortium. We also extracted the alpha satellite array of the newly assembled centromeres of Chromosome X and Chromosome Y from the HG002 cell line sequenced by the HPRC.

Ground truth (manually annotation)

Shepelev et al. 2015

Uralsky et al. 2019

McNulty and Sullivan 2018

Alexandrov et al. 2001

Paper data link https://figshare.com/articles/dataset/HORmon/16755097/2?file=34438856

Overview

  1. Chr 3, 11, 14, 16, 17, 19, 20, 21, 22, X and Y, we can observe cycles directly from monomer graph by removing edges that have multiplicity below threshold and vertices that are hybrid monomers. (Original paper describe unclear here, this claim is from my guess)
  2. Chr 2, 4, 6, 7, 10, 12, 14, 15, we can observe cycles directly from simplified monomer graph.
  3. Chr 1, 5, 8, 9, 13, and 18 do not have Hamilton cycles and represent special cases shown below

Splitting unbreakable monomers reveals HORs in cen1, cen13, and cen18

Split junction vertex and then still do simplified monomer graph.

Cen1 and Cen13 get clear results after splitting unbreakable monomers.

Cen18 gets cycle after process but raises a concern about the validity of splitting the unbreakable monomer.

Dehybridization reveals HORs in cen5 and cen8

Replace hybrid vertex to hybrid edge.

Cen9

No way can build a Hamiltonian cycle, except changing the parameters.

Thoughts and Comments

  1. Learn the latest theories of centromere to replace CE postulate.
  2. Combine HOR and monomer inference to a problem is a good idea. But too many magic parameters here, can we do better?
  3. For me, it is like a case-lead solving algorithm. Can we do really automatically and de-novo?



    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Momonmer Types, Suprachromosomal Families (SFs) and HORs Identification
  • CentromereArchitect--inference and analysis of the architecture of centromeres
  • FastGA -- fast genome alignment
  • Polynimial Reduction
  • Gready Alg and Local Search:Minimum-Degree Spanning Tree