KIT | KIT-Bibliothek | Impressum | Datenschutz

Improving In-Domain Data Selection for Small In-Domain Sets

Mediani, Mohammed; Winebarger, Joshua; Waibel, Alexander

Abstract:

Finding sufficient in-domain text data for language modeling is a recurrent challenge. Some methods have already been proposed for selecting parts of out-of-domain text data most closely resembling the in-domain data using a small amount of the latter. Including this new “near-domain” data in training can potentially lead to better language model performance, while reducing training resources relative to incorporating all data. One popular, state-of-the-art selection process based on cross-entropy scores makes use of in-domain and out-ofdomain language models. In order to compensate for the limited availability of the in-domain data required for this method, we introduce enhancements to two of its steps. Firstly, we improve the procedure for drawing the outof-domain sample data used for selection. Secondly, we use word-associations in order to extend the underlying vocabulary of the sample language models used for scoring. These enhancements are applied to selecting text for language modeling of talks given in a technical subject area. Besides comparing perplexity, we judge the resulting language models by their performance in automatic speech recognition and machine translation tasks. ... mehr


Verlagsausgabe §
DOI: 10.5445/IR/1000166290
Veröffentlicht am 06.02.2024
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2014
Sprache Englisch
Identifikator KITopen-ID: 1000166290
Erschienen in Proceedings of the 11th International Workshop on Spoken Language Translation: Papers. Ed.: M. Federico, S. Stüker, F. Yvon
Veranstaltung 11th International Workshop on Spoken Language Translation (IWSLT 2014), Lake Tahoe, NV, USA, 04.12.2014 – 05.12.2014
Verlag Association for Computational Linguistics (ACL)
Seiten 249–256
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page