Robust Automatic Speech Recognition System for German Language Transcription

Client

A prestigious law firm specializing in legal services in the German legal landscape. The client encountered challenges in transcription and documentation during legal proceedings. Clarity in legal discussions and accurate transcription was crucial. The objective was to implement German ASR to optimize audio quality, ensuring precise and efficient handling of legal documents and proceedings.

Challenges

The client needed full control over the ASR setup due to privacy and confidentiality concerns. This requirement ruled out the use of out-of-the-box APIs from AWS, GCP, and Azure services.

Approach

Kaldi- an open-source Automatic Speech Recognition (ASR) toolkit is used, that contains various recipes for training customized acoustic models.
The Automatic Speech Recognition script is built using the Librispeech recipe from Kaldi, an acoustic model, and a language model.
Labeled Audio Data (recordings and spoken words) and Pronunciation Lexicon (words and corresponding sequences of phonemes) are collected from Tuda de and mozilla commonvoice datasets of German language audio files.
The acoustic model based on TDNN (Time Delay Neural Network) is trained using Labeled Audio Data on an NVIDIA A10 GPU.
The audio data is preprocessed to remove noise, enhance quality, and normalize volume.
MFCC (Mel-frequency Cepstral Coefficients), CMVN (Cepstral mean and variance normalization), and i-vectors audio features are extracted.
The language model refines word sequence predictions, while the decoding graph inspects potential word sequences, generating accurate transcription, evaluated against Word Error Rate (WER).

Impact

Achieved a Word Error Rate of an impressive 3.2%, signifying high accuracy of transcription.
Maintained a low WER of 5.2% for audios with background noises and disturbances.
Achieved high precision in the legal context by training the model on legal terminology and jargon.
70% reduction in cost for manual transcription tasks, while significantly improving manpower morale.
Maintained 100% compliance with the client’s stringent data protection and privacy regulations.
Ensured consistent transcription quality over varying accents, pitch, or speaking styles.