Can We Believe Our Ears? Experts Say To Heed Caution As Audio Deep Fake Technology Advances

NPR

Published September 28, 2020 at 11:20 AM CDT

This image made from video of a fake video featuring former President Barack Obama shows elements of facial mapping used in new technology that lets anyone make videos of real people appearing to say things they've never said. There is rising concern that U.S. adversaries will use new technology to make authentic-looking videos to influence political campaigns or jeopardize national security. (AP Photo)

Audio deep fakes are advancing to sound more realistic, experts warn in the run-up to the 2020 presidential election.

We tend to trust our ears because we’re so attuned to the voices of our family members and the famous people we hear on our TVs or radios, but that’s changing as artificial intelligence allows computers to learn our voices and reproduce them with ease.

The consequences can be disturbing, sound experts say. Dallas Taylor, host of the Twenty Thousand Hertz podcast, contracted an AI professional to build out his voice. He says the deep fake audio clip was convincing enough to trick people closest to him, including his wife.

“What’s terrifying to me about this is the power of a deep fake voice coupled with the power of sound design itself,” he says. Deep fake creators are mastering how to finesse fake audio’s uncanny valleys — a term for when your brain recognizes that something in a piece of humanlike audio is off or the speaker sounds soulless.

The perfecting of AI-generated fake audio coupled with the speed at which social media spreads information worries Taylor.

“I’m concerned that if something came out — and I don’t think it’s an ‘if,’ I think it’s more of a ‘when’ — things are shared so quickly that I’m worried that our rush to judgments on things will come back to haunt us,” he says.

Watch on YouTube.

Interview Highlights

On uncanny valleys

“That’s kind of that space when you kind of think of something as being, you know, humanoid. But when it speaks, your brain automatically goes, like, something’s very creepy about that. So right now, we’re still in that little uncanny valley, but right around the corner, I think it’s going to become more convincing.”

On teaching people about deep fakes to avoid being fooled — but potentially teaching people to create their own by doing so

“Yeah, that’s a bit of a moral dilemma. But I feel like it’s important to really get information out there, especially with deep fakes and the ramification of that. Of course, when we’re going down a story like that, we do have the moral dilemma of: Do we teach someone how to hurt someone else with sound?

“So there’s times where I want to just dig into that to where we have an understanding of what to look for rather than it coming out of nowhere. I think we need to spot the markers and understand the markers as a society and question things that we hear. And that’s really what concerns me, is that if you’re already primed to know that somebody on this other side is a terrible person, it’s very easy to convince ourselves that what we hear or what we see is true.”

On how AI is learning from audio recordings

“If there’s anything that I’ve learned about computing, it is just the exponential rate of processing that we’re living through. And so right now, it takes some time and it takes some crafting, but I’m already seeing peaks of websites that you can kind of build out your own fake voice with. It basically just asks you to say phrases over and over and over again, and different phrases. And at the end of it, you start to get closer to a model of your own voice. Five years ago, we started hearing about this. It was still like, oh, that far-off thing that still has an uncanny valley. That’s still a little odd when you hear it. I think we’ve all lived through enough … artificial intelligence and machine learning to know that it’s just a matter of time before this gets just better and better and better and faster.”

On the liar’s dividend, where you can reap the rewards of being able to get away with anything because of deep fakes

“So there’s two sides of the coin here, and either one can be more terrifying than the other. So there’s the one side of the coin of someone making it, leaking it strategically at the wrong time, say right before someone’s gonna be drafted in the NFL or something the night before. It could have massive ramifications. By the time it’s even debunked, the damage has already been done massively. Scale that up to elections … and you have this thing where you can plant this little seed of doubt in either side.

“The flip side of the coin, which is just as frightening, is now when someone does say something terrible or if you don’t trust your leaders and you know that they do lie, it’s very easy for them to just say, ‘Oh, that didn’t happen.’ That’s already been used with very clear tape before. And that’s what’s really frightening is that even if something is captured, it can now be used to just sell a little seed of doubt that it was fake.”

On what deep fakes are meant to do

“The point here that’s frightening with deep fakes is that it’s designed to do something in a moment that plays off of the worst of us — like the jumping to conclusions, instantaneous decision making [and] instantaneous acceptance of what we see and hear. And especially in a presidential election year, the circumstances are just ripe for this perfect moment for something to be plopped down. Whether it’s actually sanctioned by the people or just by somebody who wants to have it out for a certain candidate or a country that we don’t trust and that wants to sow distrust in the U.S. And again, I don’t think this is going to be a matter of ‘if’ I think it’s going to happen at some point. It’s gonna be very sudden. And I just hope that if it has some sort of world-changing effect, that the governments would be slow to react.”

On the ease of designing a deep fake from well-known people’s voices

“Think about every celebrity or politician who’s read their own audiobook — that gives you more than enough to do anything with. It’s a full transcript, full clean audio. Same for people like us who are hosting shows because it really just takes about two to three hours of solid talking with a transcript next to it. So if you know what you’re doing, that’s really all it takes.”

Karyn Miller-Medzon produced and edited this interview for broadcast with Tinku Ray. Serena McMahon adapted it for the web.

This article was originally published on WBUR.org.