Diacritization as a machine translation problem and as a sequence labeling problem

Schlippe, Tim; Nguyen, ThuyLinh; Vogel, Stephan


In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.

DOI: 10.5445/IR/1000166370
Veröffentlicht am 19.02.2024
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2008
Sprache Englisch
Identifikator KITopen-ID: 1000166370
Erschienen in Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop
Veranstaltung 8th Conference of the Association for Machine Translation in the Americas (AMTA 2008), Waikiki, HI, USA, 21.10.2008 – 25.10.2008
Verlag Association for Machine Translation in the Americas (AMTA)
Seiten 270–278
