S\={a}mayik: A Benchmark and Dataset for English-Sanskrit Translation

Ayush Maheshwari, Ashim Gupta, Amrith Krishna, Ganesh Ramakrishnan, G. Anil Kumar, Jitin Singla

Sanskrit is a low-resource language with a rich heritage. Digitized Sanskrit corpora reflective of the contemporary usage of Sanskrit, specifically that too in prose, is heavily under-represented at present. Presently, no such English-Sanskrit parallel dataset is publicly available. We release a dataset, S\={a}mayik, of more than 42,000 parallel English-Sanskrit sentences, from four different corpora that aim to bridge this gap. Moreover, we also release benchmarks adapted from existing multilingual pretrained models for Sanskrit-English translation. We include training splits from our contemporary dataset and the Sanskrit-English parallel sentences from the training split of Itih\={a}sa, a previously released classical era machine translation dataset containing Sanskrit.

Knowledge Graph

arrow_drop_up

Comments

Sign up or login to leave a comment