COO Magazine Q2 2024

Voice Monitoring – to be or not to be?

Erkin Adylov
Founder and CEO

Monitoring phone calls in a financial institution can serve several important purposes that contribute to effective risk management and regulatory compliance.

Voice conversations tend to harbor 10x more misconduct and potential fraud than other communication channels like email or instant messaging.

Certain jurisdictions and financial regulations (such as MiFID II in the European Union) require financial institutions to record and monitor phone calls related to trading, investment, or advisory activities.

Monitoring and recording phone calls help create a comprehensive audit trail of communications related to financial transactions. This audit trail can be essential during internal or external audits, regulatory examinations, or legal proceedings.

Recorded phone calls can be used to resolve disputes, clarify misunderstandings, or provide evidence in case of complaints from customers or counterparties.

In the past few years, numerous firms have faced challenging inquiries regarding the absence of voice monitoring during inspections by the National Futures Association (NFA) and examinations conducted by the Financial Conduct Authority (FCA). It is surprising that a few firms have presented arguments to regulators, claiming that voice surveillance is costly or that the technology required for effective monitoring is unavailable. A decade ago, such reasoning might have held weight, but in the era of Artificial Intelligence, securing an exception from regulators on this basis is no longer feasible.

Implementing voice monitoring doesn’t have to compromise security, budget, or the efficiency of compliance teams. However, it is not as simple as activating an Automatic Speech Recognition (ASR) transcription engine and generating alerts based on lexicons or random sampling. The effectiveness of a voice surveillance program relies on numerous crucial details and nuances. In this blog, we delve into the essential steps necessary for implementing an efficient and successful voice monitoring compliance program.

Empty calls, short calls and call segments

Voice calls on the trading floor can be classified into three main categories: empty calls, short calls, and long calls.

Empty calls are characterized by the presence of background noises but no direct human speech. Surprisingly, these calls account for approximately 20% of all recorded calls. This high proportion often surprises our customers when they are presented with these statistics. The reason behind this is that the recording equipment used by financial services firms on the trading floor is sourced from companies that supply similar equipment to the emergency services. These recorders are designed to capture all audio, even during periods of silence when callers fail to hang up.

Short calls make up the majority of calls on the trading floor, comprising around 66% of recorded calls. These calls are brief and concise, typically lasting less than 10 seconds. However, the compressed nature of the recordings and the suboptimal audio quality pose significant challenges for both human listeners and AI systems attempting to comprehend the content of these calls.

Long calls represent only 13-15% of the recorded calls, with a mere 0.1% exceeding a duration of one hour. To enhance the accuracy of transcription, each of these recorded calls is divided into smaller segments, typically ranging from 10 to 30 seconds. For instance, a five-minute call might be divided into 30 segments, each lasting 10 seconds.

Machine learning algorithms are employed to analyze each segment and identify the spoken language, enabling the assignment of the appropriate transcription model. This segmentation process is crucial in improving the transcription quality, as excessively short or long segments may lead to inaccuracies in the transcriptions.

It is crucial for firms to have a comprehensive understanding of how their recorded calls are distributed among the three categories and how data processing is conducted for each category. For instance, transcribing empty calls can lead to more inaccurate transcripts since the system will attempt to transcribe background noise. It is therefore advisable to exclude these calls from the transcription processing altogether.

Nevertheless, the compliance team should implement an assurance process to validate that empty calls are genuinely devoid of human speech and solely consist of background noise. This validation can be achieved by randomly sampling empty calls for review.

Short calls, which last less than 10 seconds, present a significant challenge in accurately transcribing them into the correct language. This difficulty arises from the fact that the language identification model lacks sufficient audio duration to accurately determine the language spoken. The shorter the audio segment, the higher the likelihood of language confusion, resulting in the application of an incorrect transcription model. For example, English audio may be mistakenly transcribed using a Swedish ASR model. Here is a funny example of a mismatch – did you hear “Salsa cookies” or “North Korea”?

In the case of short calls, it is advisable to choose a single ASR model and transcribe all short calls in one language. The HR file, which indicates the geographical location of the employee and the likely coverage markets, can assist in determining the appropriate language selection. While transcribing all calls under 10 seconds in a single language may seem imprecise, it is a more practical and accurate approach compared to relying solely on the model’s guess regarding the spoken language. By adopting this method, we can ensure more reliable and consistent transcriptions for short calls.

 Languages & accents

Language diversity and varying accents pose significant challenges in the transcription of speech through ASR technology. The transcription process involves two key steps: Language Identification (LID) and the subsequent application of ASR in the identified language. LID employs a machine learning model to assign languages to different segments of speech. As a reminder, all calls are divided into segments that are typically between 10 and 30 seconds long. Breaking calls into segments helps transcribe calls that are multilingual (multiple languages used on the same call, where speakers switch from one language to another).

Accents play a pivotal role in the precision of language identification and the overall quality of transcription. For instance, if a call features English spoken with a strong Spanish accent, it may be mistakenly identified as Spanish and transcribed incorrectly using the Spanish ASR model.

The LID model is not perfect, and its accuracy diminishes as the number of languages to identify increases. Focusing on fewer languages significantly enhances accuracy. This can be done by analyzing the HR data or by analyzing a sample of calls identified by LID.

Compliance teams rolling out voice monitoring are advised to have validation processes that focus on the LID model. Transcription using ASR is a downstream process that is dependent on the LID model. If the LID model is not working as designed, it will significantly impact the quality of transcription to the point of transcripts being completely wrong.

The vendors must disclose the accuracy of their LID models and the sensitivity of these models to accents. If vendors are not using LID models, what is their approach in that case to languages? Once the compliance team understands the expected performance of the LID model, the validation process is fairly simple. For each language that was identified by the LID model, a sample of calls is selected and manually reviewed to confirm that the language on the call aligns with the language identified by the LID. This check can be done once or regularly depending on the availability of resources.

Word error rate (WER)

Word Error Rate (WER) is often the primary focus in discussions about voice transcription quality. During meetings, people commonly inquire about the quality of transcription engines. However, as this blog highlights, WER represents just the tip of the iceberg. It is a downstream task that relies on the language identification model and the distribution of calls (empty, short, long).

Assuming that compliance teams have a comprehensive understanding of call distribution and the impact of the language identification model, we can now shift our focus to evaluating the transcription quality.

The quality of Automatic Speech Recognition (ASR) is typically evaluated using two metrics: Character Error Rate (CER) and Word Error Rate (WER). For compliance purposes, WER is often considered a better metric as it quantifies the number of incorrectly transcribed words. Although there is ongoing research suggesting that WER alone is limited, particularly in the context of large language models (as a single word doesn’t fully represent language understanding), it is still widely regarded as the industry standard.

However, it is important to note that the WER disclosed by transcription software vendors is usually not representative of the quality customers are likely to experience. Compliance teams should request benchmarking to be performed on a sample of their own data. This independent evaluation can be provided to regulators, and Behavox, for instance, offers Regulatory Benchmarking reports on the quality of its ASR as measured on customer data.

In Behavox’s experience, customers’ WER typically falls within the range of 26% to 35%. This wide range can be attributed to variations in the quality of recording equipment and the usage of specific jargon.

It is worth noting that Amazon offers a best-in-class WER of 24% on the same data. However, using Amazon’s service is not feasible for customers due to potential privacy concerns. Sharing data that often contains Personally Identifiable Information (PII) and confidential information with Amazon is not ideal. In contrast, Behavox operates on a dedicated infrastructure, processing customer data in a secure cloud environment without data leaving the perimeter or being used for research and development purposes.

Compliance teams can rely on Behavox’s benchmarking reports, which provide an accurate assessment of the ASR quality based on customer data. By obtaining insights from this independent evaluation, compliance teams can better assess the performance and reliability of the transcription solution.

AI for alert generation

In many compliance teams within the financial services sector, random sampling is commonly employed for voice monitoring. However, this approach is often time-consuming and yields minimal results in terms of identifying misconduct and potential fraud. To address this, it is recommended to configure detective controls that target specific risks, with the use of AI being particularly effective for this purpose. While lexicons can also be utilized, they tend to generate more false positives.

AI, on the other hand, offers greater robustness in identifying risks while reducing the volume of alerts. It achieves this by evaluating entire sentences rather than focusing solely on individual words, enabling it to work effectively even with transcripts that may not be 100% accurate and contain speech disfluencies (e.g., “ahm,” “oh,” or mid-speech message changes).


The absence of voice monitoring is likely to continue to draw scrutiny from regulators. However, implementing voice monitoring doesn’t have to be a cumbersome process. By partnering with the right provider, this endeavor can become manageable and efficient. Enabling Language Identification (LID), Automatic Speech Recognition (ASR), and AI capabilities will enhance detective controls significantly. Importantly, it will also reduce the workload by eliminating the need for random sampling, which is commonly used by many firms.

If you would like to learn more about our voice capabilities, visit

Also in this edition...