KIT | KIT-Bibliothek | Impressum | Datenschutz

Domain-independent Punctuation and Segmentation Insertion

Waibel, Alex; Cho, Eunah; Niehues, Jan ORCID iD icon

Abstract:

Punctuation and segmentation is crucial in spoken language translation, as it has a strong impact to translation performance. However, the impact of rare or unknown words in the performance of punctuation and segmentation insertion has not been thoroughly studied. In this work, we simulate various degrees of domain-match in testing scenario and investigate their impact to the punctuation insertion task. We explore three rare word generalizing schemes using part-of-speech (POS) tokens. Experiments show that generalizing rare and unknown words greatly improves the punctuation insertion performance, reaching up to 8.8 points of improvement in F-score when applied to the out-of-domain test scenario. We show that this improvement in punctuation quality has a positive impact on a following machine translation (MT) performance, improving it by 2 BLEU points.


Verlagsausgabe §
DOI: 10.5445/IR/1000166206
Veröffentlicht am 17.01.2024
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2017
Sprache Englisch
Identifikator KITopen-ID: 1000166206
Erschienen in Proceedings of the 14th International Conference on Spoken Language Translation. Ed.: S. Sakti, M. Utiyama
Veranstaltung 14th International Conference on Spoken Language Translation (IWSLT 2017), Tokio, Japan, 14.12.2017 – 15.12.2017
Verlag Association for Computational Linguistics (ACL)
Seiten 74–81
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page