We are surrounded by computer-generated voices these days, from navigation systems and voice assistants to automated announcements. But how human do these voices actually sound? A recent study by the Max Planck Institute for Empirical Aesthetics (MPIEA) in Frankfurt am Main, Germany, published in the journal Speech Communication, shows that our perception is affected by three things: how something is said, what is being said, and whether we understand the language.
In two consecutive experiments, the researchers investigated how people perceive the difference between real and synthetic voices. They created 16 short German sentences, such as: “The boy gave his father a hat.” The team then manipulated the sentences in three different ways by changing the word order, replacing words with similar-sounding pseudowords, and combining both changes. This resulted in four versions of each sentence. All versions were recorded by eight human speakers and eight computer-generated text-to-speech (TTS) voices.
In the first experiment, 40 German-speaking participants rated how human the voices sounded. Overall, the computer-generated voices were perceived as less human than the human voices. An analysis of the voices’ acoustic characteristics revealed objectively measurable differences in sound between human and TTS-generated voices.
