Who’s talking? Over 1 in 4 deepfake speech samples fool humans

LONDON — Who’s really speaking to you over the phone? The answer may not be as simple as you think. A new study reveals people are unable to detect more than one in four “deepfake” speech samples, which criminals could potentially exploit. Researchers in the United Kingdom are the first to evaluate the human ability to recognize artificially generated speech in languages other than English.

Deepfakes are forms of synthetic media designed to mimic a real person’s voice or appearance. Criminals have employed them to defraud people out of substantial amounts of money. Deepfakes are part of generative artificial intelligence (AI), a branch of machine learning (ML), that trains algorithms to recognize the patterns and characteristics of datasets like video or audio recordings of a real person. This training enables the reproduction of original sounds or images.

Early deepfake speech algorithms may have needed thousands of voice samples to generate original audio. However, the latest pre-trained algorithms can recreate a person’s voice with just a three-second clip of them speaking.

Tech giant Apple recently unveiled software for iPhones and iPads that allows users to create copies of their voices using 15 minutes of recordings. Researchers at University College London (UCL) utilized a text-to-speech (TTS) algorithm trained on two publicly available datasets, one in English and the other in Mandarin, to create 50 deepfake speech samples in each language. To avoid the algorithm reproducing original input, the samples created were different from the ones used for training.

A total of 529 participants listened to the artificially generated samples alongside genuine ones to see if they could differentiate between real and fake speech. Participants managed to identify fake speech only 73 percent of the time, as reported in the journal PLoS ONE.

Artificial Intelligence (AI)
Image by Gerd Altmann from Pixabay

The research team noted only a slight improvement in detection after training the participants to recognize aspects of deepfake speech. English and Mandarin speakers demonstrated similar detection rates. When describing the speech features used for detection, English speakers frequently mentioned breathing, while Mandarin speakers referenced cadence, pacing between words, and fluency more often.

“Our findings confirm that humans are unable to reliably detect deepfake speech, whether or not they have received training to help them spot artificial content. It’s also worth noting that the samples that we used in this study were created with algorithms that are relatively old, which raises the question whether humans would be less able to detect deepfake speech created using the most sophisticated technology available now and in the future,” says study first author Kimberly Mai of UCL Computer Science, in a media release.

The research team is now working on developing more effective automated speech detectors.

While there are undeniable benefits from generative AI audio technology, such as enhanced accessibility for individuals with speech limitations or those who may lose their voice due to illness, growing concerns exist that criminals and nation states could exploit such technology, causing significant harm to individuals and societies.

Documented instances of deepfake speech employed by criminals include a 2019 incident in which the CEO of a British energy company was deceived into transferring hundreds of thousands of dollars to a fraudulent supplier by a deepfake recording of his boss’s voice.

“With generative artificial intelligence technology getting more sophisticated and many of these tools openly available, we’re on the verge of seeing numerous benefits as well as risks. It would be prudent for governments and organizations to develop strategies to deal with abuse of these tools, certainly, but we should also recognize the positive possibilities that are on the horizon,” says Professor Lewis Griffin, the senior author of the study.

You might also be interested in:

South West News Service writer Stephen Beech contributed to this report.

YouTube video