CHICAGO — People tend to change their voice when interacting with voice assistant apps like Alexa and Siri, according to new research. When speaking to these devices, individuals are often louder, flatter, and slower compared to when they’re conversing with another person. This finding could potentially enhance the ability of these technologies to comprehend human speech and reveal the impact of technology on language.
“We do observe some differences in patterns across human- and machine-directed speech. People are louder and slower when talking to technology. These adjustments resemble the changes speakers make when talking in background noise, such as in a crowded restaurant,” says Georgia Zellou from UC Davis in a media release. “People also expect that the systems will misunderstand them and that they won’t be able to understand the output.”
In their experiments, the team examined how individuals adjusted their voices when communicating with an artificial intelligence (AI) system as opposed to talking with other people. They discovered that participants produced louder and slower speech with less pitch variation when speaking to voice-AI systems like Siri or Alexa, even during identical interactions.
People struggle to understand ‘mechanical’ voices
On the listening side, the researchers demonstrated that the degree of human-like quality in a device’s voice affects how well listeners understand it. If a listener perceives the voice as belonging to a device, their comprehension is less accurate. However, if the voice sounds less mechanical and more like a person, their understanding improves. Clear speech, similar to that of a news broadcaster, led to better comprehension overall, even if the voice was coming from a machine.
Identifying the factors that make a speaker intelligible will be valuable for voice technology. For instance, these results suggest that text-to-speech voices should adopt a “clear” style in noisy conditions. The researchers are aiming to apply their findings to people from various age groups and diverse social and linguistic backgrounds. They also plan to explore how individuals learn language from devices and how linguistic behavior adapts as technology evolves.
“There are so many open questions,” says study co-author Michelle Cohn. “For example, could voice-AI be a source of language change among some speakers? As technology advances, such as with large language models like ChatGPT, the boundary between human and machine is changing – how will our language change with it?”
How does AI voice assistant technology work?
AI voice assistant technology works through a combination of natural language processing (NLP), speech recognition, and artificial intelligence algorithms. The process generally involves the following steps:
- Audio input: The user speaks to the voice assistant, and the device captures the audio input through a microphone.
- Speech recognition: The audio input is converted into text using Automatic Speech Recognition (ASR) technology. ASR systems are trained on large datasets containing various accents, languages, and speech patterns to accurately transcribe the user’s speech.
- Natural language understanding: Once the speech is converted to text, the voice assistant uses natural language processing and understanding (NLP/NLU) techniques to interpret the meaning and intent of the user’s query. This involves tasks like tokenization, part-of-speech tagging, parsing, and semantic analysis to understand the context, relationships between words, and the user’s goal.
- Query processing: The AI voice assistant processes the user’s request and determines the most appropriate action or response based on its understanding. This may involve accessing external databases, APIs, or other resources to gather relevant information.
- Response generation: The voice assistant generates a response based on the information gathered during query processing. This response may be in the form of an action (e.g., setting a reminder or controlling a smart device) or providing information (e.g., answering a question or providing directions).
- Text-to-speech synthesis: The generated response is converted back into speech using text-to-speech (TTS) technology. TTS engines are designed to produce human-like speech, taking into account factors like intonation, pitch, and stress to make the voice sound more natural.
- Audio output: Finally, the synthesized speech is played back to the user through the device’s speaker, completing the interaction with the voice assistant.
The team presented their findings at the 184th meeting of the Acoustical Society of America in Chicago.
South West News Service writer Mark Waghorn contributed to this report.