Context binning, model clustering and adaptivity for data compression of genetic data

Jarek Duda

Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations of Markov-like models, especially context binning and model clustering. The former allows to merge similar contexts to reduce model size, e.g. allowing inexpensive approximations of high order models. Model clustering uses k-means clustering in space of statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity. This article is work in progress, to be expanded in the future.

Knowledge Graph



Sign up or login to leave a comment