Search
Close this search box.
  1. Video Transcript

Sarosh:
I mean, yeah, it goes back to the app developers, right? I mean, if I wanted to swap my face with Leonardo DiCaprio, as a developer, I can use off-the-shelf tools in GitHub to go and do that, right? We’ve all seen things like the aging app as well, where I can take a picture of myself and it shows me what I look like at the age of 80. I mean, all these tools are becoming democratized and making it a lot easier for developers to create fun consumer experiences, and I think the challenge is, is that when these things start evolving, if I didn’t want to create an app, but I wanted to go steal your bank account, I can now go do that.

And sort of what you mentioned before, right, which is not only are these tools becoming more available, but they’re becoming better and better and it’s sort of scary. I mean, if I could fake Jennifer Lawrence’s voice and then call into her bank account and steal some of the Hunger Games money, why wouldn’t I? That’s sort of the risks today and in the future.

Elie:
So currently, what’s mainstream in this technology is speech synthesis. Speech synthesis is actually converting text to speech, right? So this is now possible. It’s easy. There are a lot of tools available out there on GitHub or any platform. The tools are available online. People can go grab them and use data from YouTube and use these two combinations with GPUs to generate those audio Deepfakes, right?

But I think the next step in this is going to be how to hold a real conversation with real live interactions. For example, an agent, what you’re calling a bank, right? So this is where I think the next threat would be voice conversion, right? Which is doing a live, in a real-time conversion between the source speaker and the target speaker. The hacker’s speech and the victim’s speech, right?

Sarosh: Yeah. I mean, just to build off of that, middle of last year we saw a couple of instances covered by Wall Street Journal and BBC where a fraudster theoretically took a whole bunch of YouTube videos, investor relations calls, TED Talks and synthesized an avatar, a voice avatar of a CEO. And that CEO then called some financial officer of multiple organizations and convinced that financial officer to transfer millions of dollars. I think there was about four cases of this in the middle of last year.

Sort of what is Elie alluding to now is that these tools are sort of not synchronous, right? Which is you’re going to ask me a question, I’m going to sit there, type of response. It has to generate the audio and then speak back to you. I think the scary future is that with more data, with better algorithms, with more computational resources, you can get this into a real-time fashion where you talk to me, I talk back to you and there’s this sort of filter between me and you, where I can sound like Elie and you think you’re talking to the wrong person.