Publication

Pindrop Labs’ Submission to the ASVspoof 2021 Challenge

Tianxiang Chen, Elie Khoury, Kedar Phatak, Ganesh Sivaraman

Pindrop, Atlanta, GA, USA
[email protected], [email protected], [email protected], [email protected]

Abstract

Voice spoofing has become a great threat to automatic speaker verification (ASV) systems due to the rapid development of speech synthesis and voice conversion techniques. How to effectively detect these attacks has become a crucial need for those systems. The ASVspoof 2021 challenge provides a unique opportunity to foster the development and evaluation of new techniques to detect logical access (LA), physical access (PA), and Deepfake (DF) attacks covering a wide range of techniques and audio conditions. The Pindrop Lab participated in both the LA and DF detection tracks. Our submissions to the challenge consist of a cascade of an embedding extractor and a backend classifier. Instead of focusing on extensive feature engineering and complex score fusion methods, we focus on improving the generalization of the embedding extractor model and the backend classifier model. We use log filter banks as the acoustic features in all our systems. Different pooling methods and loss functions are studied in this work. Additionally, we investigated the effectiveness of stochastic weight averaging, further improving the robustness of the spoofing detection system. Overall, three different variants of the same system have been submitted to the challenge. They all achieved a very competitive performance on both LA and DF tracks, and their combination achieved a min-tDCF of 0.2608 on the LA track and an EER of 16.05% on the DF track.

1. Introduction

Automatic Speaker Verification (ASV) has been widely adopted in many human-machine interfaces. The accuracy of the ASV system has improved greatly in the past decades due to the help of deep learning algorithms. Meanwhile, the deep learning-based text-to-speech synthesis (TTS) and voice conversion (VC) techniques are also able to generate extremely realistic speech utterances. TTS and VC techniques like WaveNet [1], Deep Voice [2], and Tacotron [3] greatly enhanced the quality of the voice-spoofed utterances. These spoofed utterances are often indistinguishable to human ears and are able to deceive state-of-the-art ASV systems. Thus, the detection of these voice spoofing attacks has drawn great attention in the research community and the technology industry.

To benchmark the progress of research in voice spoofing detection and foster the research efforts, the ASVspoof challenge releases a series of spoofing datasets. In 2019, the ASVspoof [6] challenge released two datasets: physical access (PA) and logical access (LA). The PA dataset focuses on replay attacks, and the LA dataset refers to synthesized speech. The LA dataset was largely based on detecting deep learning-based spoofing techniques, and it primarily focused on evaluating the generalization of the spoofing detection model. In total, it includes seventeen different TTS and VC techniques, but only seven of them are in the training and development set. During the ASVspoof 2019 challenge, many submissions focused on investigating different low-level spectro-temporal features [7, 8, 9, 10, 11, 12] and ensemble-based approaches.

In ASVspoof 2021 [13], the challenge has further included more data to simulate more practical and realistic scenarios of different spoofing attacks. There are three sub-challenges: physical access (PA), logical access (LA), and deepfake detection (DF). The PA dataset contains real replayed speech and a small portion of simulated replayed speech. For the LA dataset, while the training and development data remain the same as ASVspoof 2019, various codec and channel transmission effects are added to the evaluation data. This is aimed at simulating telephony scenarios and evaluating the robustness of the spoofing detection model against different channel effects. The challenge has also further extended the LA track to general speech Deepfake detection (DF). Deepfake detection deals with detecting synthesized voice in any audio recording. The speech Deepfake detection task involves different audio compression techniques such as mp3 and m4a, along with additional spoofing techniques. This Deepfake detection task aims to evaluate the spoofing detection system against different unknown conditions. Therefore, the detection systems for both LA and DF tracks need to be robust to unseen attacks and audio compression techniques.

This paper presents the Pindrop Labs’ submissions to the LA and DF tracks and introduces a novel spoofing detection system. Our submissions were among the top-performing systems in the full evaluation sets on both LA and DF tracks. In total, we have trained three systems. The first system is proposed in [14], which is a ResNet-based spoofing detection system trained using large margin cosine loss. The second system is an extension of the first system, using a novel learnable dictionary encoding (LDE) [15] layer to replace the mean and standard deviation pooling layer. The third system also uses the LDE pooling layer but is trained using Softmax activation in the output layer and the cross-entropy loss function. All systems contain two main components: embedding extractor and backend classifier. Figure 1 shows the framework of our spoof detection system. The final submissions to both LA and DF tracks are the fusion of the three spoofing detection systems.

2. Datasets

We use the ASVspoof 2019 official LA train and development datasets to train and evaluate our systems. Various data augmentation methods are performed on the training dataset to increase the amount of data and robustness of the models. The ASVspoof 2019 and 2021 datasets are presented in Sections 2.1 and 2.2. The data augmentation technique is introduced in Section 2.3.

2.1. ASVspoof 2019 Challenge Dataset

The ASVspoof 2019 [6] logical access (LA) dataset comprises seventeen different text-to-speech (TTS) and voice conversion (VC) techniques, from traditional vocoders to the recent state-of-the-art neural vocoders. The spoofing techniques are divided into two groups: six as known techniques, and eleven unknown techniques. The train and development sets have six known spoofing techniques, while the evaluation set contains eleven unknown spoofing techniques. Only the training and development sets are used for developing the spoofing detection systems in this work.

2.2. ASVspoof 2021 LA & DF Dataset

The ASVspoof 2021 LA track aims to evaluate the robustness of the spoofing detection model across different channels. Although the spoofing techniques used in this dataset are the same as in 2019, multiple codec and transmission effects are added to the audio samples. Both bonafide and spoofed samples are transmitted through either public switched telephone networks or voice-over-internet protocol networks. All audio samples are resampled to 16 kHz after passing through different networks.

Deepfake detection is an extension of the LA track, focusing on evaluating spoofing detection systems across different audio compressions. The compression algorithms include mp3, m4a, and other unknown techniques.

2.3. Data Augmentation

Three different types of data augmentation are applied to the train dataset in this work: reverberation, background noise, and audio compression effects. These augmentations simulate realistic conditions to enhance the robustness of the model.

3. Methodology

3.1 Features and Preprocessing

Linear filter banks (LFBs) are used as features in this work. LFBs are a compressed version of short-time Fourier transforms (SFTs) with a linearly spaced filter bank. We use 60-dimensional LFBs extracted on 30ms windows with a 10ms overlap. Mean and variance normalization is performed per utterance during training and testing. Online frequency masking is applied during training.

3.2 Embedding Extractors

Three different embedding extractors based on Residual neural network architectures are used for logical access and Deepfake detection tasks. The systems are:

  • ResNet-L-FM: A ResNet18-L-FM model trained with large margin cosine loss.
  • ResNet-L-LDE: An extension of ResNet18-L-FM with a learnable dictionary encoding (LDE) pooling layer.
  • ResNet-S-LDE: Similar to ResNet-L-LDE but trained with cross-entropy loss.

3.3 Backend Classifier

The backend classifier is a shallow neural network designed to classify the feature embeddings into bonafide or spoofed classes. Stochastic weight averaging (SWA) is applied to enhance generalization ability.

4. Experimental Results

Results on the ASVspoof 2021 evaluation set show competitive performance. For the LA track, the best min-tDCF achieved was 0.2608, and for the DF track, the best EER was 16.05%.

5. Conclusions

Pindrop Labs’ submissions to the ASVspoof 2021 challenge achieved competitive results in both LA and DF tracks. Future work will focus on further enhancing robustness across different audio conditions.

References

(Complete references provided in the original document.)

Voice security is
not a luxury—it’s
a necessity

Take the first step toward a safer, more secure future
for your business.