KIT | KIT-Bibliothek | Impressum | Datenschutz

Letter N-Gram-based Input Encoding for Continuous Space Language Models

Sperr, H.; Niehues, J. ORCID iD icon; Waibel, A.

Abstract:

We present a letter-based encoding for words in continuous space language models. We represent the words completely by letter n-grams instead of using the word index. This way, similar words will automatically have a similar representation. With this we hope to better generalize to unknown or rare words and to also capture morphological information. We show their influence in the task of machine translation using continuous space language models based on restricted Boltzmann machines. We evaluate the translation quality as well as the training time on a German-to-English translation task of TED and university lectures as well as on the news translation task translating from English to German. Using our new approach a gain in BLEU score by up to 0.4 points can be achieved.


Verlagsausgabe §
DOI: 10.5445/IR/1000037718
Veröffentlicht am 12.06.2025
Cover der Publikation
Zugehörige Institution(en) am KIT Fakultät für Informatik – Institut für Anthropomatik (IFA)
Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2013
Sprache Englisch
Identifikator ISBN: 978-1-937284-67-1
KITopen-ID: 1000037718
Erschienen in ACL 2013 : 51st Annual Meeting of the Association for Computational Linguistics : Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, August 9, 2013, Sofia, Bulgaria
Veranstaltung 8th ACL Workshop on Statistical Machine Translation (WMT 2013), Sofia, Bulgarien, 08.08.2013 – 09.08.2013
Verlag Association for Computational Linguistics (ACL)
Seiten 30-39
Externe Relationen Siehe auch
Abstract/Volltext
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page