Search
Close this search box.

Written by: Nicholas Klein

Speech Research Engineer

Audio deepfakes created by advanced text-to-speech (TTS) and voice conversion (VC) systems are increasingly prevalent as more easy-to-use commercial and open-source tools become available. On the Pindrop research team, much of our focus has been on mitigating the threat of fraud, disinformation, and misinformation through reliable detection of synthetic speech. However, recently, we’ve also started investigating ways to identify the engine behind a given deepfake.

In January 2024, we released a research blog detailing how Pindrop uncovered the text-to-speech (TTS) engine behind an election interference robocall imitating President Joe Biden. Our results indicated that the call was most likely created by a popular commercial TTS tool, allowing for subsequent investigations by journalists, fact check organizations, and law enforcement.

We also recently published a blog about a deepfake of Elon Musk. In that case, we were able to identify ElevenLabs as the voice cloning vendor that was used to create the deepfake. We informed them so that they could complete further investigation.

Since then, our research on this topic has progressed, and a paper with our findings was recently accepted at the Interspeech 2024 conference. Keep reading for a brief overview of this research.

Research: Source Tracing of Audio Deepfake Systems [link to paper]

By Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury. To be presented at Interspeech 2024 in Kos, Greece on September 3, 2024.

Key contributions

  • We leverage open source deepfake detection systems for source tracing, predicting the acoustic model and vocoder with state-of-the-art accuracy, while additionally predicting the input type (text or speech) with near perfect accuracy.
  • We devise and publish a new source tracing benchmark for more robust evaluation of source tracing systems composed of a large number of recent TTS systems.

Component-based source tracing

As covered in our previous blog on robustness against zero-day attacks, voice cloning systems are typically built from common generative AI building blocks. TTS and VC systems commonly have a conversion model that is responsible for producing output acoustic features and a vocoder that transforms those acoustic features to output waveforms. As many novel deepfake systems tend to reuse existing building blocks, we adopt a generalizable approach of predicting the conversion model and vocoder. Additionally, we propose identifying the input type (text or speech).

Source tracing methods

In our paper, we experiment with two strategies for leveraging existing state-of-the-art deepfake detection systems for the task of component classification. Here, we present the method that we refer to as two-stage source tracing.

The two-stage approach splits training into two steps. First, a front-end model is trained for the standard binary deepfake detection task.

Next, the front-end weights are frozen and lightweight classification heads are trained on the embeddings for each separate component classification task.

For the classification heads, we use the simple feed forward architecture from the back-end model of the ResNet deepfake detection system.1 

While the two-stage approach is limited to the information that the binary-trained deepfake detection learns, it is very attractive in practice: in addition to the reduction in computational costs, existing binary systems can be trained on significantly more data than we have component labels for.

Results on an existing benchmark

As an anchor to previous work, we evaluate our methods on the ASVspoof2019-based protocol designed by Zhu et al.2 Utterances in this protocol are from the ASVspoof 2019 LA dataset which contains synthetic examples generated using a set of different TTS and VC systems.3 We use the same categories as Zhu et al. for the acoustic model and vocoder classification tasks. Additionally, we create a new “Input type” task which is helpful to separate between TTS and VC systems.

Our key takeaways

  • We reach 99.9% accuracy on the newly-proposed input type prediction task for the ASVspoof protocol.
  • We achieve state-of-the-art performance on both the acoustic model and vocoder classification tasks with the SSL (E2E) method with an accuracy of 99.4% and 84.6%, respectively. We attribute this improvement over previous work to the fact that their method utilized a multi-task objective causing their models to compromise on the performance of each task.
  • The more restricted but practical two-stage approach yields competitive accuracy.

Ongoing work

As of today, our source tracing solutions have expanded on our Interspeech 2024 work to enable the discrimination of a much larger number of deepfake engines, including popular commercial and open-source systems. We are planning to share more updates about this in the near future.

This work is a good step forward into providing an additional level of intelligence and understanding of deepfake detection. We believe that it will be very useful to journalists, fact checkers, and law enforcement for a pointed follow-up investigation when synthetic speech is detected.

References

1T. Chen, A. Kumar, P. Nagarsheth, G. Sivaraman, and E. Khoury, “Generalization of audio deepfake detection,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020.

2T. Zhu, X. Wang, X. Qin, and M. Li, “Source tracing: Detecting voice spoofing,” in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022.

3X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee et al., “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, p. 101114, 2020.

More
Blogs