pindrop-logo-2.svg
Search
Close this search box.
Search
Close this search box.

Written by: Elie Khoury

VP Research

Amidst the advancements in voice biometrics technology, recent strides in generative AI have raised concerns about the performance of voice authentication. Deepfakes, capable of mimicking anyone’s voice with remarkable realism, have emerged as a prevailing threat to speaker verification systems. At Pindrop, our unwavering commitment to combating voice fraud sets us apart as industry-leading experts. In this article, we’ll explore and answer the questions raised against voice biometrics by the University of Waterloo study. Continue reading to understand how Pindrop’s cutting-edge Liveness Detection system surpasses all others, effectively mitigating the risks posed by signal-modified deepfakes.

Questions raised against voice biometrics

Voice anti-spoofing detection systems, also known as countermeasures (CM), have been developed to detect and thwart deepfake attempts. Recent industry developments have posed two questions regarding the ability of CM systems to address emerging challenges. First, whether CM systems struggle to identify synthetic content from new Text-To-Speech (TTS) systems, making zero-day attacks harder to detect. Second, tell-tale signs left by TTS systems in synthetic audio can be masked through signal modifications, rendering synthetic content virtually undetectable by CM systems. 

Pindrop answered the first question by showcasing how Pindrop’s system is effective at detecting zero-day attacks created using Meta’s new Voicebox system [link]. The University of Waterloo published a study [link] on the second topic, which we have addressed below.

About University of Waterloo’s study

Researchers at the University of Waterloo undertook a study to address the impact of signal modifications applied to synthetic audio, aimed at bypassing countermeasures. According to the study, TTS systems leave behind tell-tale signs in the synthetic audio they generate. CM systems identify whether the audio is synthetic or live depending on these tell-tale signs. 

Waterloo team’s thesis is that malicious actors can remove these tell-tale signs by applying certain signal masking modifications. They conducted experiments with 7 signal modifications to machine speech, aiming to erase the distinctions between genuine and machine-generated speech, thereby bypassing countermeasures. These signal modifications included:

  1. Replacing leading and trailing silences with silence from genuine audio
  2. Removing inter-word redundant silences in the machine speech utterance
  3. Spectral modification to boost the center of the speech spectrum
  4. Adding an echo effect
  5. Applying pre-emphasis
  6. Noise reduction to eliminate unnatural noise in machine audio
  7. Adversarial speaker regularization

 

The resulting signal-modified deepfakes are difficult to detect by the CM systems that rely on identifying these signals in the first place. The results of the study indicated that certain signal modifications could deceive specific combinations of Automatic Speaker Recognition (ASV) and CM systems, with success rates ranging from 9.55% to 99%.

The research conducted by the team at the University of Waterloo shed light on the potential challenges that countermeasures face in detecting these modified synthetic utterances. It underscored the need for advanced and resilient solutions like Pindrop’s Liveness Detection system as highlighted below.

Pindrop’s response and test results

At Pindrop, we recognize the potential risks associated with signal-modified deepfakes. To validate this, we reproduced the signal modifications used in the Waterloo study and rigorously tested our system against them. The results were significant, as our system successfully detected the deepfakes, outperforming even the best ASV+CM system used by the Waterloo team.

Our Liveness Detection system demonstrated remarkable performance against adversarially modified spoofed utterances. Comparing the detection accuracy with the best systems from the Waterloo paper, our system significantly outperformed by a good margin on all modifications. Additionally, when combined with voice authentication, our accuracy on full attacks (F1-F7) soared from 98.3% to an unmatched 99.2%. This exceptional accuracy showcases the effectiveness and reliability of Pindrop’s solution in mitigating the risks posed by signal-modified deepfakes.

This table shows Pindrop’s Liveness Detection accuracy compared to the worst and best systems.

Attack type Worse reported FAR in the Waterloo paper Best reported System in the Waterloo paper Pindrop’s System
F1 84.4% 95.2% 99.2%
F1-F2 58.4% 96.6% 99.2%
F1-F3 56.5% 95.4% 99.6%
F1-F4 53.0% 94.5% 99.8%
F1-F5 46.0% 92.0% 99.8%
F1-F6 42.2% 92.0% 97.3%
F1-F7 (All Attacks) 38.2% 88.0% 98.3%

What does this mean for call center teams?

In the pursuit of enhanced security measures, Pindrop’s Liveness Detection system emerges as a powerful ally for call center teams. Our system’s strength lies in its sophisticated technology, extensive training on diverse datasets, and advanced signal processing capabilities. 

How Pindrop’s Liveness Detection can build you an impenetrable defense

We take pride in our system’s performance, as it achieved a remarkable 99.2% detection success rate against the attacks replicated from the study. Moreover, our sophisticated approach, trained on diverse spoofed audio with extensive data augmentation, empowers our system to perform exceptionally well in real-world scenarios, even against zero-day attacks.

The Waterloo study is essential as it demonstrates the feasibility of new attacks that eliminate differences between genuine and machine speech. It underscores the need for constant innovation to outpace malicious actors. The above results prove how Pindrop’s profound research expertise can mitigate current and future voice authentication attacks. In addition, Pindrop’s emphasis on a multi-factor authentication system that combines voice biometric authentication with deepfake liveness detection ensures heightened security for our customers. By leveraging acoustic cues, behavioral cues, and other metadata, our system becomes more robust and reliable in detecting voice fraud.

As we continue testing against the Waterloo data set and collaborate closely with research teams, we remain committed to staying vigilant against emerging threats. At Pindrop, we’re dedicated to delivering innovative solutions and protecting valuable resources, contributing to our position as leaders in voice authentication fraud protection.

We’d like to thank the Waterloo research team for their insight and assistance in replicating the attacks from their study for our testing purposes. Pindrop welcomes the opportunity to collaborate with research teams across academia and industry to further improve voice authentication and deepfake detection. 

3 Key Takeaways

  • The Waterloo study highlights the threat of new attacks that remove differences between genuine and machine speech and demonstrates the need for constant innovation to stay ahead of malicious actors. 
  • Pindrop detected these deepfakes with 99.2% accuracy which surpassed all other solutions tested while proving our effectiveness at detecting signal-modified deepfakes.
  • Pindrop’s multi-factor system, including voice biometrics, liveness detection, behavior analysis, and device authentication, effectively defends against deepfake fraud in call centers and beyond.

 

More
Blogs