Automated annotation of human centromeres with HORmon

Centromere Evolution (CE) Postulate

Research Gaps

Method

Definitions

  1. monocentromere: sequence of monomer-blocks that represent a decomposition of Centromere into the input monomer-set
  2. monomer graph: Given a monocentromere, its directed monomer graph is constructed on the vertex set of all its monomers and the edge set formed by all pairs of its consecutive monomers. The “multiplicity” of an edge $(M,M’ )$ in the monomer graph is defined as the number of times the monomer $M’$ follows the monomer $M$ in the monocentromere

Positionally similar monomers

Given a monomer $M$ in a monomer set $Monomers$ for a given $monocentromere$, we identify all triples of consecutive blocks $XMY$ and appears in this $monocentromere$, and construct the $\vert Monomers \vert \times \vert Monomers \vert$ matrix $Triples_M$ where $Triples_M(X,Y)$ is count of the number of triples XMY in the $monocentromere$. Then construct a normalized matrix $NormalizedTriples_M$ (square sum of all its entries is $1$).

Given two monomers $M$ and $M’$, we define positionally similarity $PosSim(M,M’)= \sum_{i,j} NormalizedTriples_M(i,j)\times NormalizedTriples_{M’}(i,j)$.

Two monomers are called similar if the percent identity $>minPI$ (default $94\%$). Two monomers are called positionally similar if the positionally similarity $>minPosSim$ (default $0.4$​).

Merging positionally similar monomers

HORmon do iteratively

  1. Identify pairs of monomers with highest positionally similarity
  2. Merge into new monomer and launches StringDecomposer on new monomer set
  3. Repeat 1 and 2 until no positionally similar pairs of monomers

HORmon do similar processes for triples XYM and MXY as above.

Splitting aggregated monomers

Given a monomer $M$, we refer the largest element $(X,Y)$ in the matrix $NormalizedTriples_M$ as $M$ champion. We classify $(X,Y)$ and $(X’,Y’)$ in the matrix as $M$ comparable if \(NormalizedTriples_M(X,Y)/NormalizedTriples_M(X',Y') > splitValue\) (default $1/8$). HORmon iteratively collect $M$ comparable pairs using single linkage clustering starting from $M$ champion. And then the finally set is call $M$-candidate pairs. Monomer pairs $(X,Y)$ and $(X’,Y’)$ are called independent if all four monomers $X, Y, X$ , and $Y’$ are different. A monomer $M$ is called breakable if all $M$​-candidate pairs are (pairwise) independent, and unbreakable, otherwise.

Given a breakable monomer $M$, HORmon consider all $M$-candidate pairs $(X_1,Y_1) \cdots (X_t,Y_t)$ and splits $M$ into $t$ monomer $M_1,\cdots,M_t$ where $M_i$ is the consensus of all blocks arise from $X_iMY_i$.

Simplified monomer graphs

Given a monomer graph, HORmon construct a complete bipartite graph where each part represents all vertices (monomers) of monomer graph. A monomer $M’$ in the “upper” part is connected with a monomer $M’’$ in the “bottom” part by an edge of the weight equal to the multiplicity of the edge $(M’ ,M’’)$ in the monomer graph. HORmon solves the “assignment problem” to find the “maximum weight bipartite matching” in the bipartite graph.

Then, we have a set of non-overlapping cycles and paths. An edge of a monomer graph is called a removable if it is a chord in one of these cycles/paths. Remove all the removable edges and get simplified monomer graphs.

Inference of hybrid monomers

For monomers $A$, $B$, and $C$, we define $HybridDivergence_A(B, C)$ as the divergence between $A$ and a concatenate of a prefix of $B$ and a suffix of $C$ that is most similar to $A$. A monomer $A$ from a monomer set Monomers is a hybrid candidate of monomers $B$ and $C$ if $HybridDivergence_A(B,C)$ is below the $maxResolvedDivergence$ threshold and does not exceed divergence between $A$ and any another monomer from Monomers. HORmon first generates a set $HybridCandidates$ by iterating over concatenates of all possible prefixes and suffixes for every pair of distinct monomers $B$ and $C$. Afterward, if there is a single pair of monomers $B$ and $C$ that give rise to a hybrid candidate $A$, we classify $A$ as a hybrid of $B$ and $C$. If several pairs of such monomers exist, we select a pair of monomers $B$ and $C$ that are 1) not hybrid candidates themselves and 2) form a concatenate with the minimal divergence from the monomer $A$, and classify $A$ as a hybrid of $B$ and $C$.

Decomposing a centromere into HORs

Results for all human chromosomes

Dataset

Thoughts and Comments




    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • CentromereArchitect--inference and analysis of the architecture of centromeres
  • Polynimial Reduction
  • Applications of Max flow
  • FastGA -- fast genome alignment
  • NP, NP-Complete Problems