News consumption is changing, especially during election cycles
Scrolling on social media for hours on end has yet another unforeseen consequence: it’s altered the way that the American public consumes the news—and, by extension, statements from political leaders. According to the Pew Research Center, “half of US adults [are getting] news at least sometime from social media.” When we consume our news on social media, we may assume that the information we’re seeing is honest and credible. Yet, as a recent parody that uses AI-generated voice cloning of VP Kamala Harris implies, we can’t always believe what we’re hearing.
As AI evolves, one troubling fact is emerging: global leaders and average citizens alike can fall victim to voice cloning without their consent. Though the industry is looking towards safety measures like watermarking and consent systems, those tactics may not be enough.
How it started
At 7:11 PM ET on July 26, 2024, Elon Musk reposted a video on X from account @MrReaganUSA. In a follow-up video, @MrReaganUSA acknowledged that, “the controversy is partially fueled by my use of AI to generate Kamala’s voice.” Our research was able to determine more precisely that the audio is a partial deepfake, with AI-generated speech intended to replicate VP Harris’s vocal likeness alongside audio clips from previous remarks by the VP.
As of July 31, 2024, Musk’s post was still live and had over 133M Views, 245K reposts, and 936K likes. Another parody video of VP Harris was posted to X by @MrReaganUSA on July 31, 2024.
Our analysis of the deepfake
When our research team discovered Musk’s post, they immediately ran an analysis using our award-winning PindropⓇ Pulse technology to determine which parts of the audio were manipulated by AI. Pulse is a tool designed for continuous assessment, producing a segment-by-segment breakdown and analyzing for synthetic audio every 4 seconds. This is especially useful in identifying AI manipulation in specific parts of an audio file—helping to spot partial deepfakes.
Synthetic vs. non-synthetic audio
After denoising the audio to reduce the background music, Pulse detected fifteen 4-second segments as “synthetic” and six 4-second segments that were not synthetic, which leads us to believe that this is likely a partial deepfake.
With Pulse’s liveness detection capability, our research team found three clips of VP Harris’s previous remarks in the parody video. Each clip, however, was removed from its original context. Listen below:
This audio was taken from a real speech, but altered to repeat in a loop.
VP Harris misspoke in this speech. That audio was used here.
This audio is also from a real speech.
Tracing the source and identifying inadequate AI safety measures
Our research team went one step beyond this breakdown: they identified the voice cloning AI system that was used to create the synthetic voice. Our source attribution system identified a popular open-source text-to-speech (TTS) system, TorToise, as the source. TorToise exists on GitHub, HuggingFace, and in frameworks like Coqui. It’s possible that a commercial vendor could be reusing TorToise in their system. It’s also possible that a user employed the open source version.
This incident demonstrates the challenges with watermarking to identify deepfakes and their sources, an issue Pindrop has raised previously. While several of the top commercial vendors are adopting watermarking, numerous open-source AI systems have not adopted watermarking. Several of these systems have been developed outside the US, making enforcement difficult.
Pindrop’s technology doesn’t rely on watermarking. Instead, Pulse detects the “signature” of the AI generating system. Every voice cloning system leaves a unique trace, including the type of input (“text” vs “voice”), the “acoustic model” used, and the “vocoder” used. Pulse analyzes and maps these unique traces against 350+ AI systems to determine the provenance of the audio. Pindrop used this same approach in previous incidents, including the Biden Robocall deepfake in January, which Pulse determined was created by ElevenLabs, a popular commercial TTS system.
Through additional research, we identified three platforms that offer AI-generated speech that mimics VP Harris’s voice. Those include TryParrotAI, 101soundboard, and jammable. We also found that 101soundboard seems to be using the TorToise system.
Some commercial vendors are considering adopting measures, like consent systems, to mitigate the misuse of voice cloning; however, with open-source AI systems, these measures are difficult to enforce. While implementing consent systems is a step in the right direction, there isn’t a consistent standard or third-party validation of these measures.
Why information integrity must be top-of-mind
While this audio was labeled as a “parody” in the original post, now that it’s available online, it can be reshared or reposted without that context. In other situations online, like accepting cookies on a website or verifying your identity with your online bank, governments have established laws to protect consumers. However, AI and deepfakes are a new and rising threat–with little to no guardrails to prevent misuse.
That’s why maintaining the integrity and authenticity of information that’s shared online—especially as we near the 2024 election—should be a top priority. Not doing so can be damaging to public trust and the belief in our most important and foundational systems.
Putting up protections to help preserve truth
Good AI is sorely needed to mitigate the societal effects of bad AI. As a leader in the voice security space for over a decade, Pindrop is leading the fight against deepfakes and misinformation, with the goal of helping to restore and promote trust in the institutions that are the bedrock of our daily lives. Our Pulse solution offers a way to independently analyze audio files and empower organizations with information to determine if what they’re hearing is real. Read more here about our deepfake detection technology and how we’re leading the way in fighting bad AI.
Disclaimer
This is an actively developing investigation. The information presented in this article is based on our current findings and analysis as of August 1, 2024. Our team is actively staying alert, investigating and uncovering new trends in deepfake and voice security-related incidents. Follow us on LinkedIn or X for any new insights.