Testing Voice Biometric Security Against AI Deepfakes

Articles

Testing Voice Biometric Security Against AI Deepfakes

Naman Aggarwal

author • 29th February 2024 (UPDATED ON January 17, 2025)

6 minute read time

Voice biometric authentication systems were neither designed nor operationalized to protect against the sophisticated deepfakes attacking call centers today. It’s paramount that every call center using voice biometric authentication solution should test that their authentication processes can keep out the bad actors using Gen AI. 3Qi, a leading software QA solution provider, helped a Tier 1 US Bank test its call centers against the full variety of deep fakes. In this blog, we share their learnings and best practices from that exercise for everyone to learn from. – The Pindrop Team

3Qi Labs has always prided itself on its ability to break software. Founded as a software QA solutions provider, 3Qi Labs’ mandate has extended beyond executing predefined test plans to proactively seek out the corner cases that could precipitate defects. In software testing, constraints of time and resources dictate the scope, requiring prioritization of features critical to user experience and areas with the highest risk profiles. For Authentication systems, this translates to a balanced mix of Positive* and Negative* tests, with a focus on fraud detection capabilities due to the dramatic increase in biometric fraud incidents.

Among various types of vulnerabilities, the challenge of testing and detecting Synthetic Speech Injection stands out. Although other fraud methods like voice recordings and impersonation remain relevant, the swift spread of deep fakes powered by advancements in Generative AI has made sophisticated synthetic voice technologies easily accessible to a wide audience at minimal cost. This reality places synthetic voice generation and detection at the forefront of our testing efforts for Authentication systems like the Pindrop solution.

The current state of synthetic speech generation technology

In a recent evaluation of Authentication platforms for one of 3Qi’s banking customers, we evaluated multiple AI-driven speech generation platforms. Some of the insights are highlighted below:

Explosion in Number of Gen AI Systems : Today there are well over 120 Gen AI systems with a combination of text-to-speech and speech-to-speech systems. Among the technologies assessed were Eleven Labs, Descript, Podcastle, PlayHT, and Speechify. These entities, fueled by significant venture capital investments, are positioned to accelerate advancements in this space.
Easy Accessibility to Sophisticated Attacks: Minimal effort is required to create convincing synthetic voice samples— we used 60 seconds of speech per tester. Not long ago you needed a 30 minute sample to generate an equivalent sample and Microsoft claims its VALL-E model can clone a voice from a 3-second audio clip!
Potential for Misuse: The efficacy of these technologies was demonstrated by the successful spoofing of multiple Authentication systems using our synthetic samples. It’s no surprise that the Federal Communications Commission (FCC) just outlawed AI generated robocalls, underscoring the need to protect the citizens against misuse of these technologies, especially considering the vast amount of voice data accessible via social media.
Affordability: These technologies are accessible to a broad audience due to the low cost. Our testing encompassed all the aforementioned platforms for as little as $1 to clone a voice! It’s no wonder that the low barrier to entry for utilizing these advanced technologies has resulted in Deepfake identity fraud doubling from 2022 to Q1 2023.

Best practices for testing systems against synthetic voice

As part of an Authentication platform evaluation, we employ a holistic testing approach covering a broad spectrum of demographic categories, various technologies, a range of input/environmental factors, and synthetic voice injection.

Below is the outline of our approach for a recent evaluation for a Tier-1 US Bank:

Proctored Sessions: Direct supervision of interactions for thorough scenario coverage and real-time result capture.
Diverse Scenarios: Employing a mix of positive and negative tests with randomized scenario execution and Net Speech* variation (more below) across testers.
Enrollment & Verification: Assessing user onboarding and verification efficiency, considering variables such as presence of background noise and speech clarity.
Security & Fraud Detection: Validating systems against synthetic and replayed voice attacks, as well as distinguishing between different live voices.
Demographic Representation: A broad participant demographic across age, gender, linguistic, and ethnic backgrounds.
Technical Infrastructure: Utilizing a variety of mobile devices and networks, with synthetic voice generation facilitated through tools like Eleven Labs.

Central to our testing is the concept of Net Speech*, a critical variable that directly influences the accuracy and reliability of Authentication systems. The amount of net speech provided is positively correlated with the system’s ability to generate a precise voice enrollment and, consequently, its capability to authenticate a caller or detect a fraudulent one. By examining synthetic voice samples of varying lengths, we can identify the specific net speech duration at which synthetic or cloned voices begin to significantly affect false acceptance rates, a key factor in maintaining system integrity and user trust. Thus, net speech serves as a crucial variable in our evaluations, leading us to test across various intervals to determine the optimal net speech requirement per platform. This is vital for minimizing the risk of fraud while promoting a superior customer experience.

The art and science of measuring performance

The cornerstone of our analysis is the evaluation of False Acceptance Rate* (FAR) and False Rejection Rate* (FRR) across diverse data segments. This entails examining these metrics in specific scenarios or variable combinations, such as FAR for Synthetic Voice Injection across different net speech intervals, or even more detailed analyses, like verification FAR for synthetic voice injection among Spanish-speaking females with net speech < 6 seconds. While achieving statistical significance in niche scenarios can be challenging, the juxtaposition of FAR and FRR is critical. It increases the likelihood that the system’s sensitivity is finely tuned to balance robust fraud detection (low FAR) against user convenience (low FRR), essential for optimizing both security and customer experience.

Ultimately, our testing methodologies and processes are designed to arm decision makers with the data they need to objectively evaluate different Authentication platforms. The goal is to not only challenge and evaluate the efficacy of biometric authentication in Authentication systems, but also to ensure that enterprises can confidently integrate these technologies, bolstering both security and user satisfaction.

Glossary:

Positive Tests: Test cases where the system correctly identifies an authorized user’s voice. They test the system’s reliability and effectiveness in recognizing and verifying authorized users, enhancing user trust and system integrity.
Negative Tests (aka Spoof Tests): Test cases where the system correctly rejects an unauthorized user’s voice. These tests are essential for assessing the system’s security measures and its capability to safeguard against unauthorized access attempts.
Net Speech: Net Speech refers to the actual amount of speech content within a voice interaction, excluding any periods of silence or non-speech elements. Optimizing the Net Speech threshold is essential for efficient and secure user authentication. It impacts system responsiveness, user experience, and the ability to accurately authenticate users under various conditions.
False Acceptance Rate (FAR): The percentage of Negative Tests where unauthorized users are incorrectly verified/recognized as authorized users. This is a key metric for maximizing the security of the system.
False Rejection Rate (FRR): The percentage of Positive Tests where authorized users are wrongly denied access by the biometric system, mistaking them for unauthorized users. Minimizing FRR is essential for user experience and satisfaction and overall efficiency.

Platform

PRODUCTS

CAPABILITIES

2024 Voice Intelligence and Security Report

Solutions

BY NEED

By industry

2024 Voice Intelligence and Security Report

Partners

2024 Voice Intelligence and Security Report

Company

2024 Voice Intelligence and Security Report

Articles

Testing Voice Biometric Security Against AI Deepfakes

Naman Aggarwal

The current state of synthetic speech generation technology

Best practices for testing systems against synthetic voice

The art and science of measuring performance

Voice security is
not a luxury—it’s
a necessity

Take the first step toward a safer, more secure future
for your business.

Platform

2024 Voice Intelligence and Security Report

Solutions

2024 Voice Intelligence and Security Report

Partners

2024 Voice Intelligence and Security Report

Company

2024 Voice Intelligence and Security Report

Articles

Testing Voice Biometric Security Against AI Deepfakes

Naman Aggarwal

The current state of synthetic speech generation technology

Best practices for testing systems against synthetic voice

The art and science of measuring performance

Voice security is not a luxury—it’s a necessity

Take the first step toward a safer, more secure future for your business.

Voice security is
not a luxury—it’s
a necessity

Take the first step toward a safer, more secure future
for your business.