We did the opposite. By reversing many of these same techniques, we were able to get an approximation of a speaker’s vocal tract during a speech segment. This allowed us to effectively look into the anatomy of the speaker who created the audio sample.
From here on, we assumed that deepfake audio samples would not be limited by the same anatomical limitations that humans have. In other words, the analysis of deepfak audio samples simulated shapes of the vocal tract not found in humans.
Our test results not only confirmed our hypothesis, but also revealed something interesting. When extracting vocal tract estimates from deepfake audio, we found that the estimates were often comically incorrect. For example, it was common for deepfake audio to result in vocal channels with the same relative diameter and consistency as a drinking straw, as opposed to human vocal channels, which are much wider and more variable in shape.
This realization shows that deepfake audio, even when convincing to human listeners, is far from distinguishable from human-generated speech. By estimating the anatomy responsible for creating the perceived speech, it is possible to identify whether the audio was generated by a person or a computer.
Why this is important?
Today’s world is defined by the digital exchange of media and information. Everything from news to entertainment to conversations with loved ones usually happens through digital exchanges. Even in their infancy, deepfake video and audio undermine people’s trust in these exchanges, effectively limiting their usefulness.
If the digital world is to remain a critical source of information in people’s lives, effective and secure techniques for determining the source of an audio clip are critical.
Logan Blue is a Ph.D. student in computer and information science and engineering at the University of Florida. Patrick Traynor is a professor of computer and information science and engineering at the University of Florida.