Google started an unusual artificial intelligence experiment this month. If you instruct its Siri-style virtual assistant to “talk like a Legend,” it will speak in a simulacrum of the smooth sound of Grammy-winning crooner John Legend. The singer helped demonstrate a promising, but contentious, use case for AI.
Software that can impersonate people’s voices can make computers more fun to talk with, but in the wrong hands might be used to make so-called “deepfakes” intended to deceive. How good is voice cloning technology now? Google’s project provides a snapshot.
WIRED made some audio clips to compare the real and fake Legends, using recordings from the Google Assistant app and a company video that included clips of Legend in the recording studio. Think of it as The Voice: AIgorithmic Edition.
The software sounds like Legend. You can hear it best in vowel sounds like the “a” and “o” in San Francisco. But the clips also highlight how AI voices can’t yet match human ones.
Google’s fake Legend is good, but it still has the characteristic whine of a computer synthesized voice. Security startup Pindrop, which develops software to defend against phone scams, analyzed samples for WIRED and provided a tour of the technology’s strengths and weaknesses.
When Pindrop researcher Elie Khoury fed a sample of the synthetic Legend into his fake-detecting software, it wasn’t fooled. The clip scored 98.9996 out of 100 as being synthetic.
Pindrop won’t reveal details of how it distinguishes real voices from fake ones. But Khoury offered a few bot-spotting tips, such as paying attention to a voice’s rhythm, and how it pronounces “f” and “s.”
Like Google Assistant’s other voices, Legend’s is made using a voice-synthesis technology called WaveNet. It was developed in late 2016 by Alphabet’s London-based AI research unit, DeepMind. Khoury says it was a leap in the evolution of synthetic speech. Google put the technology into millions of pockets in 2017, when it upgraded the voice of Google Assistant. WaveNet also powers the company’s Duplex phone bots, which make restaurant reservations.
WaveNet voices are made by training machine learning algorithms on a collection of text and recordings of voices reading that same text. Khoury says this process is better than older methods at capturing the waveforms of speech. After training, the software can voice impressively smooth audio from any text, as you can hear in these audio samples posted by DeepMind.
DeepMind says blind listening tests found the new technology reduced the perceived gap between real and fake voices by more than half, compared with prior methods like synthesizing sentences piecemeal from a library of speech sounds. That’s how Apple’s Siri speaks.
Hints of the robotic are still detectable in WaveNet voices like Google Assistant’s defaults and its new Legend impersonation. One giveaway is the odd cadence. The fake Legend lacks the easy-listening rhythm of the real one. Another tell that you’re hearing a bot is the sound of consonants, particularly fricatives such as “f” or “v” or “s” made by narrowing your airway such that the friction of moving air becomes audible. Synthetic voices have always struggled to recreate those sounds, which reach toward the top of our frequency range and can generally be trimmed off without losing the sense of what a person is saying.
That limitation becomes visible when spectrograms of the simulated Legend saying saying “San Francisco” and the real one saying “semolina” are placed together. The diagrams show how the energy of the sound is distributed across different frequencies. When you compare the first red area on the left of the images—each representing an “s” sound—the real Legend reaches a higher frequency.
The fake Legend’s consonants also contain sounds that don’t naturally occur when they are voiced by a human, such as odd clicks, Khoury says. That’s a common limitation of synthetic voices. Because they treat speech as a series of waveforms, they sometimes create sounds that a human can’t, owing to anatomic limitations like the size of our vocal cords, and how quickly we can shift our mouths from one shape or position to another.
Recent improvements in AI software faking voices and video have some researchers, legal scholars, and policymakers worried about misuse of the technology. In December, Senator Ben Sasse (R-Nebraska) introduced a bill that would make it a criminal offense to create or distribute fake audio or video with the intent of causing harm. A lively online subculture already uses machine learning to edit people into pornographic video clips.
The design of the Google Assistant makes it hard to imagine as a criminal accomplice, even if its voice becomes more realistic. You can’t tell the software what to say, and Google controls what questions it will answer.
Pindrop CEO Vijay Balasubramaniyan says the threat will come from others adopting the underlying technology, which Alphabet has disclosed in research publications. Pindrop already catches fraudsters that defraud companies using voice altering software, for example to allow men to pose as women and gain access to financial accounts, he says.
How good could technology like Google’s get? Balasubramaniyan says the Legend voice isn’t the best he’s heard from the company’s WaveNet technology. Samples released by DeepMind in 2016 seem to be higher quality, perhaps because it was able to get speakers to record more audio than Legend did, or they didn’t have to be generated in real time in response to a user’s query.
DeepMind said it used 25 hours of audio to create those voices. It’s not clear how many hours of recordings Google collected from Legend to make the voice released this month.
The singer told People that he went to the recording studio around 10 days in a row, saying words and phrases with different inflections. His publicists didn’t respond to queries from WIRED and Google declined to say how many hours of audio it used to make the fake Legend. By email, Johan Schalkwyk, a distinguished engineer at Google, offered that it had been “a large dataset,” and that the script had to be carefully curated to cover every possible sound and speech pattern.
Legend had to read phrases such as “Submandibular gland, either of a pair of salivary glands situated below the lower jaw.” Schalkwyk declined to share how Google tested how accurate or convincing its fake Legend is.
The clip below shows how the bar for passing as human is lower on phone calls, which due to historical limitations usually strip out the upper frequencies. The muffling effect of that dampens the contrast between the real and fake Legends.
When I picked up my phone to ask Google Assistant if it would ever lie, it responded in the singer’s voice. “I always try to tell the truth,” it said. “I take honestly seriously.”