Seamless Speaker Diarization System for Effective Conversation Transcription

Client

Our healthcare client, a leader in surgical services, sought to optimize operating room transcription. Embracing advanced speaker diarization, they aimed to enhance transcription accuracy during surgical procedures. The goal was to streamline post-operative analysis and improve medical documentation, ultimately elevating the quality of patient care through precise and reliable records.

Challenges

  • To build a system capable of performing speaker diarization for a given audio file.
  • The output should provide speaker-specific information about each speaker’s start and end time.
  • Provide the duration for each speaker sequentially as per the occurrence in the given audio.
  • Ensuring the diarization system adapts to changing acoustic conditions, handling unexpected variations, and maintaining accuracy in dynamic environments.

Approach

  • We used Kaldi, an open-source Automatic Speech Recognition toolkit with the capability of training the models as required.
  • Trained a TDNN-based x-vector model on common speech corpora using the ‘callhome’ recipe.
  • Applied segmentation to capture key speech patterns and transitions between speakers, by dividing the audio recording into overlapping or non-overlapping segments.
  • Extracted MFCC features from the segments to define the unique speech characteristics of different speakers.
  • Mapped variable-length audio signals to fixed-length x-vector embeddings, encasing essential speaker-related information.
  • Calculated similarity using a derived metric based on acoustic features, assessing resemblance between segment pairs.
  • Employed clustering algorithms to categorize segments with similar acoustic traits, identifying potential speakers.
  • Applied an iterative refinement process until speaker-homogeneous regions were obtained.
  • Enhanced speaker diarization accuracy through iterative refinement of the model, using insights from preceding iterations.

Impact

  • Achieved a low Diarization Error Rate (DER) of 4.3%, accurately identifying speakers. 
  • Streamlined post-operative analysis processes, resulting in a 40% reduction in the time required for reviewing and analyzing surgical transcripts.
  • Integrated speaker diarization seamlessly into Electronic Health Record (EHR) systems, leading to a 30% reduction in data entry errors and ensuring accurate and synchronized medical records.