Imagine the following scenario. The phone rings. He is answered by an office worker and hears his panicked boss telling him that she forgot to transfer the money to the new contractor before she left and demands that he do so. She gives him the wire transfer information and with the money transferred, the crisis was averted.
A worker sits on the back of a chair, takes a deep breath, and watches his boss walk in the door. The voice on the other end of the conversation was not his boss. In fact, it wasn’t even human. The voice he heard was that of an audio dipfake, a machine-generated audio sample made to sound exactly like his boss.
Such attacks using recorded audio already happenedand deepfakes of spoken sound may be just around the corner.
Deepfakes, both audio and video, have only become possible with the development of sophisticated machine learning technologies in recent years. Deepfakes have brought with them a new level uncertainty surrounding digital media. To detect deep forgeries, many researchers have turned to analyzing visual artifacts—tiny glitches and inconsistencies—found in deep fake videos.
Deepfake audio is potentially even more of a threat because people often communicate verbally without video — for example, through phone calls, radio and voice recordings. These voice-only communications greatly expand the opportunities for attackers to use deepfakes.
To detect audio depifakes, we and our fellow researchers at the University of Florida developed a technique that measures acoustic and fluid dynamic differences between vocal patterns produced organically by human speakers and those produced synthetically by computers.
Organic vs. synthetic voices
A person speaks by pressing air against various structures of the vocal tract, including the vocal folds, tongue, and lips. By rearranging these structures, you change the acoustic properties of your vocal tract, allowing you to create over 200 different sounds, or phonemes. However, human anatomy fundamentally limits the acoustic behavior of these different phonemes, resulting in a relatively small range of correct sounds for each.
In contrast, audio deepfakes are created by first allowing a computer to listen to the target victim’s audio recordings. Depending on the exact methods used, the compd you may only need to listen to 10-20 seconds of audio. This sound is used to obtain key information about the unique aspects of the victim’s voice.
The attacker chooses a phrase to deepfake and then, using a modified text-to-speech algorithm, creates a sound sample that sounds like the victim is speaking the chosen phrase. This process of creating a single spoofed audio sample can be done in seconds, potentially giving attackers enough flexibility to use a deeply spoofed voice in a conversation.
Detection of audio deepfakes
The first step in distinguishing human-generated speech from deepfake speech is understanding how to acoustically model the vocal tract. Fortunately, scientists have methods to estimate that someone – or some creature, e.g a dinosaur— would sound based on the anatomical measurements of his vocal tract.
We did the opposite. By inverting many of these same techniques, we were able to obtain an approximate characterization of a speaker’s vocal tract during a segment of speech. This allowed us to effectively peer into the anatomy of the speaker that created the audio sample.
From this, we hypothesized that spoofing audio samples would not be constrained by the same anatomical constraints that humans have. In other words, the analysis of the falsified audio samples simulated vocal tract shapes that do not exist in humans.
The results of our testing not only confirmed our hypothesis, but also revealed something interesting. When obtaining vocal tract estimates from deep-fake audio, we found that the estimates were often comically wrong. For example, it was common for fake audio to result in vocal tracts of the same relative diameter and consistency as a drinking straw, in contrast to human vocal tracts, which are much wider and more variable in shape.
This realization demonstrates that deep fake audio, even if convincing to listeners, is far from indistinguishable from human-generated speech. By evaluating the anatomy responsible for producing the observed speech, one can determine whether a human or a computer produced the sound.
Why it matters
Today’s world is defined by the digital exchange of media and information. Everything from news to entertainment and conversations with loved ones usually happens through digital exchanges. Even in their infancy, deeply tampered video and audio undermines people’s trust in these exchanges, effectively limiting their usefulness.
For the digital world to remain the most important information resource in people’s lives, efficient and secure methods of determining the source of an audio sample are critical.
Logan Blue is a graduate student in computer science and engineering University of Floridaand Patrick Trainor is a professor of computer science and computer science at University of Florida.