Discussions surrounding the authenticity of content and claims of AI-generated media often occur with a very low bar of scientific analysis. To help establish a minimum standard of analysis and explainability, Pindrop is sharing its analysis of a recent deepfake incident at Pikesville High School in Baltimore. Our analysis, unlike the claims of other deepfake detection vendors, shows that this audio is not AI-generated but has been doctored.
A Case Study in Objective Analysis: The Pikesville High School Incident
On January 16, 2024, a recording surfaced on Instagram purportedly featuring the principal at Pikesville High School in Baltimore, Maryland. The audio contained disparaging remarks about Black students and teachers, igniting a firestorm of public outcry and serious concern.
Given the gravity of the accusations and the potential repercussions for the principal’s career and the community of Pikesville High School, it was critical to rigorously verify the authenticity of the audio. The situation was further complicated as several deepfake detection vendors and media outlets quickly declared the recording a deepfake, often without providing substantial evidence or detailed analysis, based on subjective evaluations of tonal quality or delivery style (prosody). Such rushed conclusions without detailed scientific explanations risked serious misjudgments and could unjustly sway public opinion and impact legal proceedings. Our intent is not to substantiate or refute any law enforcement conclusions. We’re primarily focused on the liveness detection of the specific audio shared publicly.
In light of these developments, Pindrop undertook a comprehensive investigation, conducting three independent analyses to uncover the truth:
- Deepfake (Liveness) Detection of the January Audio Sample: We sought to determine if the audio displayed characteristics typically associated with synthetic speech.
- Deepfake (Liveness) Detection of a Public Speech Sample from November 2018: We established a baseline for Mr. Eiswert’s voice characteristics by analyzing a verified live speech.
- Comparison of Both Audio Samples for Voice Similarity: The critical final analysis compared the controversial January recording with the 2018 speech to assess whether both could originate from the same person.
The results of our thorough investigation led to a nuanced conclusion: although the January audio had been altered, it lacked the definitive features of AI-generated synthetic speech. Our confidence in this determination is supported by a 97% certainty based on our analysis metrics. This pivotal finding underscores the importance of conducting detailed and objective analyses before making public declarations about the nature of potentially manipulated media.
How We Came to Our Conclusion
Deepfake Analysis
Pindrop analyzed the audio using Pindrop® Pulse, our deepfake detection engine. As you can see from an image of the UI of our solution, Pindrop® Pulse broke down the 46-second audio segment into 11 segments that were 4 seconds each. 8 of the 11 segments were scored as live, and three as indeterminate. Pindrop® Pulse’s analysis shows an overall deepfake score of 20. The deepfake score for an audio can range from 0 to 100. A score of 0 means that Pindrop® Pulse is almost certain the audio is not synthesized, while a score of 100 means that Pindrop® Pulse is almost certain the audio is synthesized. At the threshold of 20, the confidence of the system in determining that this is not synthesized is 97% (based on a similar cohort).
For comparison, we conducted a similar deepfake analysis on a verified genuine recording of the school principal from November 2018. The resulting deepfake score is 12. At this threshold, Pindrop’s system’s confidence in determining that it is not synthesized is 99%.
Voice Similarity Analysis
Having concluded that the audio was not AI-generated, we next tested it for human impersonation. We conducted a voice similarity analysis comparing the verified genuine recording with the contentious audio sample. After applying dereverberation and noise reduction techniques, the comparison revealed nearly identical voice characteristics in both samples, with a likelihood of approximately 99%.
Spectral Analysis
Having concluded that the audio was based on the genuine voice of the school principal, we tested it for other manipulations. We employed spectral analysis to examine audio files visually. In a spectrogram, the vertical axis represents frequency (in Hertz), the horizontal axis represents time, and the brightness indicates amplitude.
Upon review, we detected six short segments of complete silence interspersed with noisy speech—each silent segment about 100 milliseconds long. Such patterns are not typical in natural speech, suggesting tampering and digital splicing.
Embracing Feedback and the Pursuit of Continuous Improvement
At Pindrop, we value transparency and are committed to continuous improvement. We actively seek and incorporate feedback from diverse fields—academia, industry, and government—to enhance our techniques and refine our analyses. Our confidence in the liveness of the Jan 2024 audio sample is 97%, indicating a 3% chance of our assessment being wrong. Our motivation for sharing this analysis is to invite technical review and to understand the arguments as to why our assessment could be wrong so we can continue to make our detection technology more robust.
Conclusion
In an age where digital authenticity is constantly challenged, Pindrop remains a steadfast ally to media companies, offering robust tools and analyses to help ensure the content they distribute is verified and truthful. By sticking firmly to a data-centric approach, we can help illuminate the truth and foster a culture of accountability and precision in media reporting.
Interested in how our deepfake detection capabilities can help your organization? Chat with an expert today.