Technical 04 Nov 2022

Exactly what is a voiceprint anyway?

An attempt to clarify some misconceptions
By Peter S

Introduction

The question “what is a voiceprint?” is interesting, as there are a wide range of definitions available. It’s also very important to understand, as there are significant concerns related to the misuse of lost or stolen voiceprints. So, we’ll examine this topic in a bit of detail with the goal of clarifying some of the potential confusion surrounding voiceprints.

First, a definition: a voiceprint is a mathematically derived model of the unique physical and behavioral characteristics of an individual’s speech. Sophisticated voice biometric algorithms are first trained on a large dataset of speech samples to learn what makes a person’s voice unique; this training process results in a model. Then, when an individual’s speech samples are processed by this model, a unique voiceprint is created for that individual. This voiceprint is typically stored in a database so that it can later be used as a basis of comparison against future speech samples from that individual – with a goal of helping to identify them.

Ok, so we have a definition, but what exactly is a voiceprint? It’s easy to understand and visualize what a fingerprint is, we’ve all seen these images in TV and movies for many years. But what we “see” for visual representations of voiceprints can be misleading. For example, consider Figures 1 and 2 below. Images like these are commonly used to depict “voiceprints”. Unfortunately, these are not voiceprints. They are sophisticated representations of recorded speech samples – but they are NOT the result of processing by voice biometric models and algorithms.

Figure 1 - Waveform representation of a speech sample (WAV file)
Figure 2 - Spectrogram representation of a speech sample (WAV file)

Voiceprints Are Multidimensional Arrays of Numbers

Although not as visually appealing, a voiceprint is really a multidimensional array of numbers. Figure 3 below depicts a 3-dimensional representation of an array of numbers. Real world voiceprints may have 256 dimensions or more – something that is impossible to represent visually. And looking at dense tables of numbers would likely be “boring” to most viewers; this is perhaps why people revert to more visually appealing images!

Figure 3 - 3D representation of multidimensional array of numbers

We unfortunately don’t have a perfect solution for how to show voiceprints, but it’s perhaps more accurate to use something like Figure 4 below. The primary goal for this article is to help readers to understand what the term "voiceprint" means – not how to visualize one – so we’ll leave it to you to decide.

Figure 4 - Visualization of a human genome sequence

Voiceprints in the Wrong Hands

Voiceprints (and other forms of biometrics) are legally viewed as personal data, or personally identifiable information (PII) – among the most sensitive data there is – even more so than credit card information. So, there’s a great deal of concern about how voiceprints are created, stored, managed, and deleted, whether they are shared with 3rd parties, and what happens if they fall into the wrong hands.

With knowledge of what a voiceprint is, what happens if a voiceprint winds up in the wrong hands – whether accidentally or through the direct actions of a fraudster?

There are several critical points to make here:

  • Voiceprints are NOT submitted to voice biometric systems. If you have someone’s password, token, or answer to their challenge question, you can supply this information to a system and potentially gain access. Voice biometric systems work by having users submit speech samples to the system for comparison to a stored voiceprint. Users do NOT submit their voiceprint to these systems.

  • Voiceprints are secure. As described above, a voiceprint is a multidimension array of numbers. It’s an abstract representation of the unique vocal characteristics of an individual that cannot be reverse engineered into intelligible speech without detailed knowledge of the specific voice biometric algorithms used to create the voiceprint, and the proprietary data structures used by the particular voice biometric company who created the voiceprint (and who have systems that are never publicly documented). Add to this the fact that voiceprints are typically encrypted in transit and at rest. So, they are really quite secure.

    Further, we believe the specific voiceprints from every voice biometric vendor are likely to be very different from one another. In other words, voiceprints from one system will only work in that system – and not in any other system.

  • Speech samples are more dangerous than voiceprints! This is perhaps the most controversial thing we’re stating, but it’s true. Given the availability of “deepfake” software generation tools, a fraudster would be far more likely to wreak havoc with a number of good quality recordings of your voice – versus having possession of a voiceprint from a specific voice biometric vendor.

Conclusion

One goal of this article was to define what a voiceprint is, and help readers to visualize them. Hopefully we've accomplished this. However, given the growing power and ease of access to deepfake technologies, it was important to bring attention to the potential misuse of speech recordings. In fact, we believe speech recordings are far more dangerous in the hands of fraudsters versus voiceprints. So, appropriate information security protocols must be applied equally to both forms of data. The good news is that this is not a new finding at IngenID: we've always protected this data for our clients.

And relative to speech recordings, voice biometric companies go to great lengths to protect their voiceprints and any source speech recordings they have. They use proprietary storage formats, they encrypt voiceprints and speech samples, they monitor their servers and networks, etc. But what about all the other places where speech recordings exist, and what kinds of protection are typically in place?

For example, consider social media posts of personal videos, vlogs, podcasts, etc. In this day and age of social media influencers, people rarely consider the implications of how these posts can potentially be misused. Abundant video and audio recordings exist, with growing amounts of content being published each day. And businesses are not immune either, as conference call recordings and voicemail messages could be misused as well.

Suffice it to say, the threat of deepfakes will quickly eclipse generalized concerns about voiceprints being in the wrong hands.