KIT | KIT-Bibliothek | Impressum | Datenschutz

Low cost portability for statistical machine translation based on n-gram coverage

Eck, M.; Vogel, S.; Waibel, A.

Abstract:

Statistical machine translation relies heavily on the available training data. However, in some cases, it is necessary to limit the amount of training data that can be created for or actually used by the systems. To solve that problem, we introduce a weighting scheme that tries to select more informative sentences first. This selection is based on the previously unseen n-grams the sentences contain, and it allows us to sort the sentences according to their estimated importance. After sorting, we can construct smaller training corpora, and we are able to demonstrate that systems trained on much less training data show a very competitive performance compared to baseline systems using all available training data.


Verlagsausgabe §
DOI: 10.5445/IR/1000009615
Veröffentlicht am 24.06.2025
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Theoretische Informatik (ITI)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2004
Sprache Englisch
Identifikator KITopen-ID: 1000009615
Erschienen in Proceedings of Machine Translation Summit X: Papers, Phuket, 13th-15th September 2005
Veranstaltung 10th Machine Translation Summit (MTS 2005), Phuket, Thailand, 13.09.2005 – 15.09.2005
Verlag Association for Computational Linguistics (ACL)
Seiten 227–234
Externe Relationen Siehe auch
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page