KIT | KIT-Bibliothek | Impressum | Datenschutz

Low cost portability for statistical machine translation based on n-gram frequency and TF-IDF

Eck, M.; Vogel, S.; Waibel, A.

Abstract:

Statistical machine translation relies heavily on the available
training data. In some cases it is necessary to limit the amount
of training data that can be created for or actually used by the
systems. We introduce weighting schemes which allow us to
sort sentences based on the frequency of unseen n-grams. A
second approach uses TF-IDF to rank the sentences. After
sorting we can select smaller training corpora and we are able
to show that systems trained on much less training data
achieve a very competitive performance compared to baseline
systems using all available training data.


Postprint §
DOI: 10.5445/IR/1000009612
Veröffentlicht am 24.06.2025
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Theoretische Informatik (ITI)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2005
Sprache Englisch
Identifikator KITopen-ID: 1000009612
Erschienen in International Workshop on Spoken Language Translation, IWSLT 2005, 24th-25th October , 2005, Pittsburgh, USA.
Veranstaltung 2nd International Workshop on Spoken Language Translation (IWSLT 2005), Pittsburgh, PA, USA, 24.10.2005 – 25.10.2005
Verlag Association for Computational Linguistics (ACL)
Externe Relationen Siehe auch
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page