KIT | KIT-Bibliothek | Impressum | Datenschutz

Tools for Collecting Speech Corpora via Mechanical-Turk

Lane, Ian; Waibel, Alex; Eck, Matthias; Rottmann, Kay


To rapidly port speech applications to new languages one of the most difficult tasks is the initial collection of sufficient speech corpora. State-of-the-art automatic speech recognition systems are typical trained on hundreds of hours of speech data. While pre-existing corpora do exist for major languages, a sufficient amount of quality speech data is not available for most world languages. While previous works have focused on the collection of translations and the transcription of audio via Mechanical-Turk mechanisms, in this paper we introduce two tools which enable the collection of speech data remotely. We then compare the quality of audio collected from paid part-time staff and unsupervised volunteers, and determine that basic user training is critical to obtain usable data.

Verlagsausgabe §
DOI: 10.5445/IR/1000166345
Veröffentlicht am 09.02.2024
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Anthropomatik und Robotik (IAR)
Publikationstyp Proceedingsbeitrag
Publikationsjahr 2010
Sprache Englisch
Identifikator KITopen-ID: 1000166345
Erschienen in Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Ed.: C. Callison-Burch, M. Dredze
Veranstaltung Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (CSLDAMT 2010), Los Angeles, CA, USA, 06.06.2010
Verlag Association for Computational Linguistics (ACL)
Seiten 184–187
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page