Diacritization as a machine translation problem and as a sequence labeling problem

Schlippe, Tim; Nguyen, ThuyLinh; Vogel, Stephan

Diacritization as a machine translation problem and as a sequence labeling problem

Schlippe, Tim; Nguyen, ThuyLinh; Vogel, Stephan

Abstract:

In this paper we describe and compare two techniques for the automatic diacritization of Arabic text: First, we treat diacritization as a monotone machine translation problem, proposing and evaluating several translation and language models, including word and character-based models separately and combined as well as a model which uses statistical machine translation (SMT) to post-edit a rule-based diacritization system. Then we explore a more traditional view of diacritization as a sequence labeling problem, and propose a solution using conditional random fields (Lafferty et al., 2001). All these techniques are compared through word error rate and diacritization error rate both in terms of full diacritization and ignoring vowel endings. The empirical experiments showed that the machine translation approaches perform better than the sequence labeling approaches concerning the error rates.

KITopen-Download

Verlagsausgabe

DOI: 10.5445/IR/1000166370

Veröffentlicht am 19.02.2024

Export

Statistiken

Seitenaufrufe: 260
seit 19.02.2024

Downloads: 218
seit 03.03.2024

Zugehörige Institution(en) am KIT	Institut für Anthropomatik und Robotik (IAR)
Publikationstyp	Proceedingsbeitrag
Publikationsjahr	2008
Sprache	Englisch
Identifikator	KITopen-ID: 1000166370
Erschienen in	Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Student Research Workshop
Veranstaltung	8th Conference of the Association for Machine Translation in the Americas (AMTA 2008), Waikiki, HI, USA, 21.10.2008 – 25.10.2008
Verlag	Association for Machine Translation in the Americas (AMTA)
Seiten	270–278

Repository KITopen

Diacritization as a machine translation problem and as a sequence labeling problem

Abstract: