Given a threshold $L$ and a set $\mathcal{R} = \{R_1, \ldots, R_m\}$ of $m$ haplotype sequences, each having length $n$, the minimum segmentation problem for founder reconstruction is to partition the sequences into disjoint segments $\mathcal{R}[i_1{+}1,i_2], \mathcal{R}[i_2{+}1, i_3], \ldots, \mathcal{R}[i_{r-1}{+}1, i_r]$, where $0 = i_1 < \cdots < i_r = n$ and $\mathcal{R}[i_{j-1}{+}1, i_j]$ is the set $\{R_1[i_{j-1}{+}1, i_j], \ldots, R_m[i_{j-1}{+}1, i_j]\}$, such that the length of each segment, $i_j - i_{j-1}$, is at least $L$ and $K = \max_j\{ |\mathcal{R}[i_{j-1}{+}1, i_j]| \}$ is minimized. The distinct substrings in the segments $\mathcal{R}[i_{j-1}{+}1, i_j]$ represent founder blocks that can be concatenated to form $K$ founder sequences representing the original $\mathcal{R}$ such that crossovers happen only at segment boundaries. We give an optimal $O(mn)$ time algorithm to solve the problem, improving over earlier $O(mn^2)$. This improvement enables to exploit the algorithm on a pan-genomic setting of haplotypes being complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling.

Thanks. We have received your report. If we find this content to be in
violation of our guidelines,
we will remove it.

Ok