High Performance Neural Networks for Online Speech Recognizer

Nguyen, Thai-Son

doi:10.5445/IR/1000128854

Abstract:

Automatische Spracherkennung (engl. automatic speech recognition, ASR) beschreibt die Fähigkeit einer Maschine, Wörter und Ausdrücke gesprochener Sprache zu identifizieren und diese in ein für Menschen lesbares Format zu konvertieren.
Die Anwendungen sind ein maßgeblicher Teil des digitalen Lebens bspw. wird der Dialog zwischen Mensch und Maschine oder ein Dialog zwischen Menschen, die unterschiedliche Muttersprachen sprechen, ermöglicht.
Um diese Fähigkeit in vollem Maße zu gewährleisten, müssen ASR-Anwendungen nicht nur mit hoher Genauigkeit, sondern, für eine Interaktion mit einem Benutzer, auch schnell genug, antworten.
... mehr

Abstract (englisch):

Automatic speech recognition (ASR) refers to the ability of a machine to identify words and phrases in spoken languages and convert them to a human-readable format. Its application remains an essential ability for human digital life, such as allowing verbal dialog between humans and machines or enabling cross-lingual communication between people speaking different native languages. To fully afford this ability, ASR applications not only need to work with high accuracy but also have to respond quickly enough for their expected interactions with users. This mixture of both constraints opens up the research area of online speech recognition differing from conventional speech recognition, which addresses solely the accuracy problem.
... mehr

The research on automatic speech recognition have been active for over half of a century. Several patterns and template matching approaches were proposed until the mid-1980 when Hidden Markov Model (HMM) became a breakthrough to solve the speech recognition task. The HMM approach allows a generic framework to statistically decouple and model both temporal and spectral variations in speech. At the latest fashion, an HMM-based recognizer is built up on top of a complex pipeline, composed of several statistical and non-statistical components such as pronunciation dictionaries, HMM topologies, phoneme cluster trees, an acoustic model and a language model. The recent advances of artificial neural networks (ANN) in both acoustic modeling and language modeling have made the hybrid HMM/ANN approach dominant in many types of ASR applications.

In recent years, the introduction of all-neural end-to-end speech recognition, which uses a neural network architecture to approximate the direct mapping from acoustic signals to the textual transcription, has been received significant interest. The advantage of the end-to-end approaches lies in their simplification of training an entire speech recognition system, thereby hiding the awareness of complicated components as in the HMM-based pipeline. At the same time, the end-to-end ASRs typically require a more substantial amount of training data, and it is more challenging to adapt end-to-end models to perform well on a new task.

This thesis is devoted to the development of high-performance speech recognition systems for the online and streaming scenario. The author achieved this target by a two-stage approach.
In the first stage, various techniques, applied in both hybrid HMM/ANN and end-to-end paradigms, were proposed to construct high-performance systems in batch mode, i.e., the complete audio data is available when starting processing.
In the second stage, efficient adaptations were explored to enable the high-performance batch-mode systems to be capable of online and run-on inferences. At the same time, novel algorithms were developed to reduce user-perceived latency, which is the most critical issue of online speech recognizers.

First Stage. The proposed techniques aiming for high-performance achievement are categorized by which stage in the speech recognition pipeline they involved in, which are feature extraction and data augmentation.

Speech signals, known as the convolution of multiple frequency components in a wide dynamic range, before becoming a digital form, can be changed dramatically with natural factors, such as different speakers, environments, or recording tools. The large variability of speech signals typically causes the mismatch between training and testing, and then may largely degrade recognition performance. We address this mismatch problem by introducing two high-level network-based feature extraction approaches. In the first approach, a new feature space with less speaker variance is conducted via a hierarchical combination of bottleneck neural network and speaker adaptation techniques such as maximum likelihood linear regression (MLLR) transformation and speaker identity vector (i-vector) extraction. We showed that a deep neural network (DNN) acoustic model trained on these speaker-adapted features, gains up to 19% relative in word error rate (WER) over the conventional feature extraction. In the second approach, long short-term memory (LSTM) network trained with the connectionist temporal classification criterion (CTC) on phone labels is used as a high-level feature transformation. The combination of the CTC-network derived features and the bottleneck features resulted in an efficient feature space which made a DNN acoustic model outperform a strong CTC-based baseline with a large margin. Besides, we revealed the use of the standard cepstral mean and variance normalization (CMVN) at low-level feature extraction causes a potential mismatch between offline training and online testing, and proposed a Linear Discriminant Analysis (LDA) based linear transformation for the replacement.

Data augmentation has been used in speech recognition for producing additional training data to increase the quality of the training data. This technique then improves the robustness of the models and avoids overfitting. We pointed out that overfitting is the most critical issue when training end-to-end sequence-to-sequence (S2S) models for the speech recognition task, and proposed two novel on-the-fly data augmentation methods as the solution. The first method, so-called dynamic time stretching, obtains the effect of speed perturbation by manipulating the time series of the frequency vectors directly with a real-time interpolation function. In the second method, we proposed an efficient strategy to sub-sample speech sentences on-the-fly, and then enlarge the training data with more variants of original samples. We showed that these methods are very efficient to avoid overfitting, and the combination of them with the SpecAugument method in the literature boosted up the performance of the proposed S2S model to be state-of-the-art on the telephone conversation benchmark.

Second Stage. We showed that the proposed high-performance batch-mode ASR systems of both hybrid HMM/ANN and end-to-end paradigms could meet the requirements of online real-world settings with the additional adaptation and inference techniques.

Neither the commonly used real-time factor nor commitment latency are sufficient to indicate the latency that users perceive. We proposed a novel and efficient method for measuring user-perceived latency in online and streaming setup. We further revealed that to better capture user experience, a run-on hybrid HMM/ANN recognizer needs to be optimized for the latency at either its peak or average. To improve these latency metrics, we introduced a mechanism so-called hypothesis update, which allows sending hypothetical transcripts early to the users, then later revising a part of it. Experiments on a real-world setup of the lecture presentation domain showed that this approach largely reduced the word-based latency of our recognizers, i.e., from 2.10 to 1.09 seconds.

Sequence-to-sequence (S2S) attention-based model has become increasingly popular for end-to-end speech recognition. Several advances have been proposed to the architecture and the optimization of S2S models to achieve state-of-the-art performance on standard benchmarks. However, how to employ S2S models with their batch-mode capacity in online speech recognition still a research question. We approached this problem by analyzing the latency issues that occurred from the regular soft-attention function, bidirectional encoder, and beam-search inference. We addressed all the latency issues as a whole by proposing an additional training loss to control the uncertainty of the attention function on look-ahead frames and novel inference algorithms for providing partial hypotheses. Our experiments on the standard telephone conversation task show that with a delay of 1.5 seconds in all output elements, our streaming recognizer can fully achieve the performance of a batch-mode system of the same configuration. To the best of our knowledge for the first time, a S2S speech recognition model can be used in online conditions without scarifying accuracy.

Zugehörige Institution(en) am KIT	Institut für Anthropomatik und Robotik (IAR)
Publikationstyp	Hochschulschrift
Publikationsdatum	02.02.2021
Sprache	Englisch
Identifikator	KITopen-ID: 1000128854
Verlag	Karlsruher Institut für Technologie (KIT)
Art der Arbeit	Dissertation
Fakultät	Fakultät für Informatik (INFORMATIK)
Institut	Institut für Anthropomatik und Robotik (IAR)
Prüfungsdatum	20.11.2020
Schlagwörter	Automatic Speech Recognition, Neural Network, Online ASR
Nachgewiesen in	OpenAlex
Globale Ziele für nachhaltige Entwicklung
Referent/Betreuer	Waibel, A.

Repository KITopen

High Performance Neural Networks for Online Speech Recognizer

Abstract:

Abstract (englisch):