April 28, 2016
By Steve Hoffman, Founder and CEO, SayPay Technologies
This is the second of a two-part series on voice biometrics. Part 1: Voice Authentication Principles , addressed the differences between speech recognition and voice recognition, the different voice processing methods, voice accuracy, and how voice enrollment and authentication works. In Part 2: Voice Authentication Advancements, I provide an up-to-date view of recent advancements in voice recognition technology and give best practices for evaluating the launch of a voice program.
Part 2: Voice Authentication Advancements
Launching a successful voice recognition program may first require collecting a data set of voice samples for a new geography or location. While English is a common language throughout the world, each word or sentence sounds different depending on the speaker’s learned speech attributes. For this reason, English sounds different based upon country and region. Even in the U.K., English is noticeably unique when spoken by people from London, Scotland and Ireland. In the U.S., accents are distinguishable among those from different parts of the Northeast, South, and West.
Advancements in voice recognition mean virtually any language is capable of being modeled. How many voice data samples are required for each new language or market is not fixed, but the more voices, the higher the quality of the model. Voice data scientists recommend a data set of 100-500 voices, based upon the voice processing method deployed, to prove the efficacy of the voice solution.
Following voice enrollment, new users sometimes experience lower success rates than their more experienced peers and may need to make several attempts before each success. Voice recognition is an imperfect science but can achieve high accuracy rates with usage. New users are sometimes not as relaxed as those with experience who have learned the nuances and idiosyncrasies of voice processing. The learning curve is generally not too difficult, with most mastering after several attempts. Further, most voice engines are self-learning and refine each person’s voice print with each succeeding entry. Each voice print update adds a new sample that enriches the velocity and expands the breadth of the entire voice model for continuously improved success rates. In time, even entries submitted from environments with modest-to-moderate noise may be acceptable.
Options for Text-Dependent Authentication
Text-dependent options include the use of static passphrases or the more advanced and flexible “dynamic” text-dependent approach. Static passphrases generally consist of solutions where each speaker speaks the same passphrase, something like, “At ABC Bank, my voice is my password.” The newer dynamic text-dependent approach uses a variable series of numerical digits that when spoken together, offer a unique value or token with each authentication while still ensuring the authenticity of the speaker.
The advantage of using digits presents new opportunities unavailable with passphrases or other biometric methods. Speaking a value allows the user to offer their identification and authentication credentials simultaneously—like combining your username and password in a self-contained package. With customer experience becoming a central theme in all services, any unnecessary action required on the customer’s part needs close and careful examination.
The second advantage of digits over passphrases is the ability to identify a specific transaction which extends voice biometrics beyond limited Website sign-in. When a transaction receives a digital ID, it enables the user to directly service that function without unnecessary navigation—like paying a bill or approving a wire by simply speaking the transaction ID. The ID may be any of various numbers assigned to the transaction including invoice, account or customer numbers. Some solutions can automatically create a transaction identifier using an algorithm that uses the user, amount and parties to the transaction into a guaranteed-unique 8-digit value. When users speak each unique “voice token,” they apply their biometric signature which forms in means of nonrepudiation. Even if authentication disputes may be rare with biometrics, the voice token is the digital equivalent of a notary seal that protects and provides all parties with assurance of indisputability.
The third advantage of a digit-based voice solution is the inherent inability for playback. Passphrase providers claim they can detect playback attempts by comparing each voice submission to all previous submissions. This claim implies that your voice is different enough with each submission but unique enough at the same time. Using a unique value each time eliminates this argument from ever clouding the voice biometric efficacy discussion.
The fourth advantage to digits is three-factor authentication. Passphrase solutions (e.g., “. . . my voice is my password”) by default only offer two-factor authentication with something each user has (the mobile device), something the user is (unique voice). Adding the third factor (something you know)—the holy grail of security teams but deemed too averse to the customer experience—is now possible, feasible and desirable.
The fifth advantage to voice digits is intent anonymity. If a user speaks a bank’s generic passphrase in a public location, they are indiscreetly making it known to everyone within earshot that they are logging into their bank. Perhaps this is not a major concern for most customers, but it should raise a red flag for product managers as it could be an ever-present and irreversible barrier to adoption and usage when making a long-term commitment to using passphrases. Further, research shows that customers would prefer to use a standard approach, like a unique-code for each authentication, than to speak a different passphrase for each of their banking relationships. Banks need to ensure they do not have to go back a few years later and implement a new method that requires customers to re-enroll because a better option was not considered thoroughly enough early on.
Early pioneers of voice passphrases may have been subject to technology limitations like lack of high-quality digital microphones and noise cancelation. Modeling the same passphrase for all users is a much easier exercise as the value remains constant for each institution. However, the added value of a digits-based solution cannot be ignored. A dynamically-generated eight-digit numeric value has 100,000,000 permutations; add alpha characters and the variability increases to 2.8 trillion. Digit-based solutions require analyzing the full value and also parsing the user’s voice into separate input values for individual analysis and comparison—this in turn achieves a much higher level of user authentication assurance.
Advancements in voice recognition mean virtually any language is capable of being modeled. Launching a successful voice recognition program, however, may first require collecting a data set of voice samples for a new geography or location. The advantage of speaking digits presents new opportunities unavailable with passphrases or other biometric methods. Digits have the ability to identify a specific transaction which extends voice biometrics beyond limited website sign in. Research shows that customers would prefer to use a standard approach, like a unique code for each authentication, than to speak a different passphrase for each of their banking relationships.
Steve Hoffman is CEO of SayPay Technologies, Inc. , a Silicon Valley-based biometric authentication solutions provider. SayPay offers 3-factor authentication with its patent-pending “voice token” technology. Steve can be reached at firstname.lastname@example.org .