Voice biometrics, where characteristics of a speaker’s voice are used to authenticate him or her, is a recent development from the long history of speech research. Although researchers have worked on areas ranging from speech recognition to speech synthesis, the aspects most relevant to authentication are speaker identification and speaker verification (see a 2009 paper by Campbell et. al. for details).
Speaker identification (SI) deals with the problem of identifying the speaker from a given set based on an audio sample of the speaker. This becomes very difficult very quickly when the number of records increases above several hundred.
In contrast, speaker verification (SV) has the simpler task of verifying the claimed identity of a speaker from his or her voice. Thus, voice authentication is really focused on the SV problem. Since several companies are now offering voice-based authentication solutions (see Voice Biometrics Conference in San Francisco in early May 2013), let us dig deeper into the issues surrounding voice biometrics.
Voice authentication is attractive for several reasons. The voice signal can be captured naturally, transported over a long distance and presented as a remote service where authentication needs to be done. We make a call and talk to a call center agent anyway when we need to complete a transaction over the telephone, making our voice available to the call center without having to do anything other than talking.
Of course, as with any authentication method, we have to ask two questions. The first is how often will it reject a legitimate voice? For instance, will the method falsely reject me because my voice is hoarse or because I am calling from a noisy public place? This is the false rejection (FR) or false-negative problem. Equally important is how often will it accept the wrong voice. For instance, can someone else mimic my voice or use a recording? This is the false acceptance (FA) or false positive problem. FR leads to user annoyance and either rejects a legitimate customer outright or falls back to knowledge-based authentication questions, adding cost and customer dissatisfaction. FA is more serious because it leads to incorrect authentication and a security compromise.
How likely is it that this will occur? While we are pretty comfortable believing that no one else can produce our fingerprint (another common biometric), criminals can easily get hold of our voice print. For example, check out this video to see how a caller was able to talk someone into providing a voiceprint that could be used to access his account.
More concerning is recent research that has explored voice conversion attacks where we can automatically convert the voice of one speaker to make it appear like someone else to a speaker verification system. Also, speaker verification research has only explored the problem at a much smaller scale with hundreds or thousands of users. In the context of call centers, we are potentially dealing with millions of callers. Do we have enough “entropy” (measure of hardness of guessing someone else’s voiceprint) in human voice at this scale? This is a question that still needs to be answered.
Since voice-based authentication is a security function, we need to carefully examine the threat model for it and its robustness to various kinds of attacks. We will talk next week about how to address the robustness challenge.