Second-language learners often exhibit accents and mispronunciations, which can significantly affect their communication proficiency. Accents and pronunciation difficulties can create language barriers between speakers. This challenge motivated our research on "Accent Conversion and Pronunciation Improvement," which aims to make non-native speech sound more native, enhance intelligibility while preserving the original content, emotion and speaker identity. The scope of this thesis focuses specifically on accent conversion for English, and mostly converting from non-native accent to American(native) accent. Our approach focus mainly on various English accents, the approach is generally applicable to any pair of accented or non-accented speech. To date, we have published five research papers and two patents on this topic, each tackling different challenges in accent conversion step by step.
\bigbreak
The one of most challenge of Accent Conversion is lacking ground truth data. There has not been a parallel corpus that contains pairs of audios having the same contents yet coming from the same speakers in different accents. In my first work, we provide a solution to create these parallel data. ... mehrThe approach involves synthesizing training data using a voice conversion model that retains pronunciation patterns and prosody while altering speaker identity.
\bigbreak
In the second work, we have a motivation that create the TTS which can synthesize audio from transcripts with any voices and many accents without demanding a lot of accented training resources. We proposed SYNTACC, the adapts conventional multi-speaker text-to-speech (TTS) models to synthesize speech in multiple accents. The method factorizes model weights into a shared component and an accent-dependent component, which reduces the number of training parameters needed per accent. This approach enables high-quality multi-accent speech synthesis in low-resource training conditions. The evaluation on Indian, Spanish, Chinese, and Vietnamese accents demonstrates that SYNTACC effectively generates accented speech while preserving speaker identity.
\bigbreak
After the first two works, we gained the ability to generate a larger amount of accented speech data. With this increased data availability, we can explore training a more advanced and robust model for this task. In the third research, we approach accent conversion in a way similar to speech-to-speech translation, treating each accent as a distinct language and converting between accents in the same manner as translating between languages. To bridge the acoustic variability between accents and enable language-like modeling, we introduce discrete speech units as an intermediate representation. These units are obtained by quantizing hidden representations from neural network model, capturing high-level linguistic content while abstracting away speaker. By using discrete units, we can apply the state-of-the-art speech-to-speech translation model for solving the accent conversion model. The approach significantly improves non-native fluency and convert accent while preserving speaker identity. The evaluation shows that this method outperforms previous AC approaches in fluency and naturalness.
\bigbreak
In the first three works, our approaches basically builds the mapping function from one accent to another accent. In fourth work, we come up with different approach: disentangle-resynthesis, which can leverage non-parallel data by decomposing speech into components like speaker identity, content, prosody, and accent, which are then recombined for synthesis. We assume TTS system trained solely on native speech will produce accent-independent linguistic representations. By generating high-quality ground-truth audio, native TTS ensures correct pronunciation, consistent speaker identity, aligned duration, and synchronized prosody with the original non-native speech. Additionally, this native TTS system is able to generate ideal ground-truth data for non-native speakers, ensuring native pronunciation, same speaker identity, duration, prosody, and precise alignment with the original non-native audio. Their AC model leverages TTS text representations to learn accent-independent features and uses synthetic ground-truth to learn the mapping function from non-native accent speech to native-like speech. This approach outperforms previous methods in term of preserving speaker identity, prosody (emotion) and duration.
\bigbreak
Although the model in the fourth research achieves fast inference due to its non-autoregressive architecture, it still requires the entire input for prediction, making it unsuitable for streaming applications. Recognizing the previous research as one of the best for accent conversion, in the fifth work, we focus on essential architectural and training modifications to develop a streaming-capable AC model while maintaining the performance of its non-streaming counterpart.
In this thesis, we address the challenge of ground-truth data in accent conversion by developing solutions for generating high-quality training corpora. Furthermore, we categorize accent conversion approaches into two main paradigms: (1) mapping-based models that learn a direct transformation between accents, and (2) the disentangle-resynthesis framework, which decomposes speech into multiple components such as content, accent, and speaker identity, enabling controlled modification and synthesis. Additionally, we propose a streaming accent conversion model, making streaming applications feasible. Our contributions advance the field by improving data availability, enhancing model performance, and enabling more natural and practical accent conversion.