Publication
Generalization Of Audio Deepfake Detection
Nicholas Klein, Tianxiang Chen, Hemlata Tak, Ricardo Casal, Elie Khoury
Pindrop, Atlanta, GA, USA [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract
Recent progress in generative AI technology has made audio deepfakes remarkably more realistic. While current research on anti-spoofing systems primarily focuses on assessing whether a given audio sample is fake or genuine, there has been limited attention on discerning the specific techniques to create the audio deepfakes. Algorithms commonly used in audio deepfake generation, like text-to-speech (TTS) and voice conversion (VC), undergo distinct stages including input processing, acoustic modeling, and waveform generation. In this work, we introduce a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline. We evaluate our system on two datasets: the ASVspoof 2019 Logical Access and the Multi-Language Audio Anti-Spoofing Dataset (MLAAD). Results from both experiments demonstrate the robustness of the system to identify the different spoofing attributes of deepfake generation systems.Index Terms: Anti-spoofing, audio deepfake detection, explainability, ASVspoof
1. Introduction
In recent years, deepfake generation and detection have attracted significant attention. On January 21, 2024, an advanced text-to-speech (TTS) system was used to generate fake calls to manipulate the voice of US President, Joe Biden, encouraging voters to skip the 2024 primary election in the state of New Hampshire [1]. This incident underscores the critical need for deepfake detection that is reliable and trusted. Thus, explainability in deepfake detection systems is crucial. Within this research area, the task of deepfake audio source attribution has recently been gaining interest [2-10]. The goal of this task is to predict the source system that generated a given utterance. For example, the study in [2] aims to predict the specific attack systems used to produce utterances in ASVspoof 2019 [11]. This approach of directly identifying the name of the system misses the opportunity to categorize the spoofing systems based on their attributes. Such attribute-based categorization allows for better generalization to spoofing algorithms that are unseen in training but are composed of building blocks, such as acoustic models or vocoders, that are seen.Along these lines, authors in [3] propose a more generalizable approach by classifying the vocoder used in the spoofing system. Authors in [4] explore classifying both the acoustic model and vocoder, finding that the acoustic model is more challenging to predict. The work in [5] takes this further by proposing to classify several attributes of spoofing systems in ASVspoof 2019 LA: conversion model, speaker representation, and vocoder. However, their findings demonstrate accuracy challenges in discerning speaker representation. Another drawback of their evaluation protocol is that the ASVspoof 2019 dataset is relatively outdated as there have been many advancements in voice cloning techniques in the last five years. Finally, their choice of categories for acoustic model and vocoder are very broad (e.g., “RNN related” for acoustic model and “neural network” for vocoder) and may not be that useful in narrowing down the identity of the spoofing system.
2. Attribute Classification of Spoof Systems
2.1 Proposed Strategies
We present two strategies for leveraging existing state-of-the-art (SOTA) spoofing countermeasure (CM) systems for the task of component classification:
- End-to-End (E2E): This approach takes an existing CM architecture and trains the whole model for each of the multi-class component classification tasks separately.
- Two-Stage: This approach splits training into two steps: first, an existing CM is trained for the standard binary spoof detection task; next, the CM backbone is frozen, and a lightweight classification head is trained on the CM’s embeddings for each separate component classification task.
While the second approach is limited to the information that the binary-trained CM learns, it is attractive in practice due to reduced computational costs and the ability to leverage existing binary systems trained on significantly more data than available component labels.
2.2 Countermeasures
We used three different CMs to validate our hypothesis:
- ResNet: Consists of a front-end spoof embedding extractor and a back-end classifier. To enhance generalization, large margin cosine loss (LMCL) and random frequency masking augmentation are applied during training.
- Self-Supervised Learning (SSL): Combines SSL-based front-end feature extraction with an advanced graph neural network-based back-end. The SSL feature extractor is a pre-trained wav2vec 2.0 model fine-tuned during CM training.
- Whisper: Based on an encoder-decoder Transformer architecture for automatic speech recognition. The Whisper CM architecture uses a combination of Whisper-based front-end features and a light convolution neural network back-end.
3. Datasets and Protocols
3.1 ASVspoof 2019
The ASVspoof 2019 LA dataset includes three partitions: train, development, and evaluation. Spoofed utterances are generated using a set of TTS, VC, and hybrid TTS-VC algorithms. We adopt the protocol partition as detailed in Table 1. While we use the same categories for the acoustic and vocoder tasks, we create a new “Input Type” task to separate TTS and VC systems.
3.2 MLAAD
The MLAAD dataset includes 52 state-of-the-art spoofing algorithms. Labels for acoustic models and vocoders are derived from the dataset metadata. Compared to ASVspoof, this protocol includes modern attack systems with specific acoustic models and vocoders.
4. Experimental Results
4.1 Implementation Details
- ResNet and SSL models use 4s raw audio inputs, while Whisper processes 30s audio.
- For ResNet, LFCC features are extracted using 20ms windows with a 10ms frame shift.
- SSL and Whisper models are fine-tuned, while ResNet is trained from scratch.
4.2 Results
Results are evaluated on the ASVspoof and MLAAD datasets. Models demonstrate high accuracy in input type, acoustic model, and vocoder classification tasks.
5. Conclusions
We propose multi-class classification tasks for deepfake detection, achieving high accuracy on both ASVspoof and MLAAD protocols. Our findings highlight the challenges of distinguishing between similar acoustic models and the potential for future research in voice-independent spoof detection.
References
(A complete reference list as provided in the original document.)