Use of Machine Translation to Obtain Labeled Datasets for Resource-Constrained Languages

Emrah Budur, Rıza Özçelik, Tunga Güngör, Christopher Potts

The large annotated datasets in NLP are overwhelmingly in English. This is an obstacle to progress for other languages. Unfortunately, obtaining new annotated resources for each task in each language would be prohibitively expensive. At the same time, commercial machine translation systems are now robust. Can we leverage these systems to translate English-language datasets automatically? In this paper, we offer a positive response to this for natural language inference (NLI) in Turkish. We translated two large English NLI datasets into Turkish and had a team of experts validate their quality. As examples of the new issues that these datasets help us address, we assess the value of Turkish-specific embeddings and the importance of morphological parsing for developing robust Turkish NLI models.

Knowledge Graph

arrow_drop_up

Comments

Sign up or login to leave a comment