Deepfakes are generative media in which a person in an existing image or video is replaced with someone else’s likeness. Deepfakes leverage powerful techniques from machine learning and artificial intelligence to manipulate or generate visual and audio content with a high potential to deceive. The main machine learning methods used to create deepfakes are based on deep learning and involve training generative neural network architectures.
You might have seen some of the harmless and famous deepfakes, like Jordan Peele’s version of Barack Obama, or Britain’s Channel 4 video of the Queen of England’s holiday speech.
The unfortunate thing is that criminals are capitalizing on the dark side of deepfakes, using the same techniques to conduct misinformation campaigns, commit fraud, and obstruct justice. The existence of deepfakes creates doubt about the authenticity of legitimate video evidence.
How to Deal with Deepfakes
In 2019, Microsoft, Facebook, and Amazon launched the Deepfake Detection Challenge to boost development of tools for identifying counterfeit content. The Defense Advanced Research Projects Agency (DARPA) also funded a media forensics project to address the issue.
Fear of interference in the recent US election also inspired a flurry of US regulatory activity. In December 2020, the IOGAN Act directed the National Science Foundation (NSF) and the National Institute of Standards and Technology (NIST) to research deepfakes. The National Defense Authorization (NDAA) started 2021 with the creation of the Deepfake Working Group that is tasked with reporting on the “intelligence, defense, and military implications of deepfake videos and related technologies.”
Detecting Deepfake Audio has also been the focus of the ASVspoof challenge, where speech scientists from all over the world compete and share their findings. The Pindrop Research team has been a regular participant to this challenge since its conception in 2015, and its systems are always well performing and published and peer-reviewed conferences and workshops like Odyssey, Interspeech, or ASVspoof and through several patents.
In July 2021, Roadrunner, a documentary about the late TV chef and traveler Anthony Bourdain opened in theaters. Some words viewers hear Bourdain speak in the film were faked by artificial intelligence software used to mimic the star’s voice.
Bourdain fans accused the documentary’s director Morgan Neville of acting unethically. In an interview, Neville told The New Yorker that he had generated three fake Bourdain clips with the permission of his estate, all from words the chef had written or said but that were not available as audio. He revealed only one, an email Bourdain “reads” in the film’s trailer. “If you watch the film,” Neville said, “you probably don’t know what the other lines are that were spoken by the artificial intelligence, and you’re not going to know.”
Pindrop to the Rescue
But audio experts at Pindrop do know. According to Pindrop’s analysis, the deepfake Bourdain controversy is rooted in less than 50 seconds of audio in the 118-minute film. It also highlighted audio midway through the film in which the chef observes that many chefs and writers have a “relentless instinct to fuck up a good thing.” The same sentences appear in an interview of Bourdain on the occasion of his 60th birthday in 2016. “We’re always looking for ways to test our systems, especially in real real conditions. This was a new way to validate our technology.” says Collin Davis, Pindrop’s Chief Technology Officer.
To scan for fake Bourdain, Pindrop processed the documentary’s soundtrack to remove noise and to make speech more prominent, then ran the segments containing speech through a detector based on machine learning that looks for signatures of synthetic voices. Elie Khoury, Pindrop’s Director of Research, says “Some of those artifacts can be perceived by the human ear, but others require technological help.”
Pindrop’s system gave every four-second segment of speech in Roadrunner a deepfake score from 1 to 100; the two missing synthetic clips were identified after reviewing the 30 segments that scored highest, which also included the fake clip disclosed by Neville. The results of that process show the power but also some limitations of deepfake detection. Some segments other than the three Pindrop ultimately identified also scored highly on the initial scan. Most were easily eliminated as false positives by giveaways such as that they matched visuals on screen like Bourdain’s lips moving, or drawing on standard audio forensic techniques that detected conventional sound processing, heavy music, or background noise.
When Pindrop provides fraud detection in call centers, false positives can be checked by prompting a caller who triggered the system to provide extra security information. But not every example of alleged deepfake deception will allow easy verification or cross checking.