Coded Information Retrieval for Block-Structured DNA-Based Data Storage

Daniella Bar-Lev

We study the problem of coded information retrieval for block-structured data, motivated by DNA-based storage systems where a database is partitioned into multiple files that must each be recoverable as an atomic unit. We initiate and formalize the block-structured retrieval problem, wherein $k$ information symbols are partitioned into two files $F_1$ and $F_2$ of sizes $s_1$ and $s_2 = k - s_1$. The objective is to characterize the set of achievable expected retrieval time pairs $\bigl(E_1(G), E_2(G)\bigr)$ over all $[n,k]$ linear codes with generator matrix $G$. We derive a family of linear lower bounds via mutual exclusivity of recovery sets, and develop a nonlinear geometric bound via column projection. For codes with no mixed columns, this yields the hyperbolic constraint $s_1/E_1 + s_2/E_2 \le 1$, which we conjecture to hold universally whenever $\max\{s_1,s_2\} \ge 2$. We analyze explicit codes, such as the identity code, file-dedicated MDS codes, and the systematic global MDS code, and compute their exact expected retrieval times. For file-dedicated codes we prove MDS optimality within the family and verify the hyperbolic constraint. For global MDS codes, we establish dominance by the proportional local MDS allocation via a combinatorial subset-counting argument, providing a significantly simpler proof compared to recent literature and formally extending the result to the asymmetric case. Finally, we characterize the limiting achievability region as $n \to \infty$: the hyperbolic boundary is asymptotically achieved by file-dedicated MDS codes, and is conjectured to be the exact boundary of the limiting achievability region.

picture_as_pdf flag

Knowledge Graph

arrow_drop_up

Comments

Sign up or login to leave a comment