An Accurate Text-to-Speech System for Natural Sounding Audio

Client

Our client, a leading messaging application developer, recognized the need to enhance user accessibility and convenience within their platform. They needed a feature enabling text messages to be listened to, catering to users who prefer auditory communication or cannot read text messages in certain situations. The client aimed to improve the user experience and ensure inclusivity within their messaging application.

Challenges

  • Build a Text-to-speech system capable of synthesizing given text into an audio file.
  • The synthesized audio should sound natural and expressive, reducing the gap between computer-generated voices and human-like intonation.
  • The system should be able to accurately pronounce words, including proper names and acronyms relevant to their context.

Approach

  • Utilized the Coqui TTS tool, which supports models across various languages and architectures. It also allows model fine-tuning and training with multiple languages.
  • Collected a custom dataset with a diverse range of text inputs and corresponding speech outputs with linguistic nuances.
  • Prepared data for model training by performing restructuring, formatting, and normalization operations.
  • Glow-TTS architecture is used for text2spec model training on NVIDIA A10 GPU.
  • Used Tensorboard for monitoring the training progress and evaluation of model performance.
  • Tested the model rigorously and evaluated it against loss function and audio quality assessment metrics.
  • Used a pre-trained vocoder model from the Coqui TTS library ensuring a reliable and scalable TTS system.
  • An inference script is used to synthesize the speech.

Impact

  • Achieved a low Word Error Rate (WER) of 3%, signifying high accuracy of speech pronunciation.
  • 68% of users found the synthesized speech more natural and enjoyable.
  • Successfully supported the synthesis of 6 languages ensuring client requirements. 
  • Users with visual impairments now constitute 15% of total platform engagement, up from 5% pre-implementation.