April 21, 2016
By Steve Hoffman, Founder and CEO, SayPay Technologies
Are you interested in learning more about voice biometric authentication but not sure where to start? This two-part series provides a primer and essential information for any person seeking to understand more about the science behind voice identity processing and the business opportunities it presents.
Part 1: Voice Authentication Principles
Speech recognition services like Apple’s Siri and OK Google have become convenient alternatives to the tedious, frustrating and time-consuming effort of keying data into mobile phones. Speech recognition has been around for years and has reached consumers in the form of products like Dragon (Nuance), Cortana (Microsoft), and Alexa (Amazon). So it’s natural for people to think the terms speech and voice recognition are synonymous. Voice recognition, however, is the technology that applies to authenticating an individual’s identity using their voice.
Speech recognition is [hide for=”!logged”]the exercise of using software to recognize sound waves and converting them to a digital representation for applications such as performing searches or text dictation. Speech recognition can be a phenomenal time-saving tool compared to typed words. Voice recognition (also called “speaker recognition”) is the exercise of matching a voice utterance to a specific and unique digital representation as a means of authenticating an identity.
Speech recognition engines analyze large samples of voice data containing words spoken by a wide variety of people of different ages, sexes and racial, social and geographic backgrounds. The system creates digital representations with a very high probability of correct interpretation. Each person’s voice is uniquely constructed based upon physiological and behavioral characteristics. Physiological aspects are based on the size and shape of each person’s mouth, throat, larynx, nasal cavity, weight and other factors; these result in our natural pitch, tone, and timbre. Behavioral properties are those formed based on language, education/influence and geography, which result in speech cadence, inflection, accent and dialect.
Voice Processing Methods
Voice authentication comes in two primary flavors—text-dependent and text-independent. Text-dependent compares a 6-to-10 syllable voice “sample” against a master “voice print” and calculates an accuracy score. Text-independent captures longer speech input into a voice model and identifies speech mannerisms across a broader spectrum. Text-dependent requires less data but active enrollment by each user (albeit ~30 seconds). Text-independent requires significantly more data, takes longer to process, but enrolls users passively without the need to request any specific utterance. Both have been deployed successfully for call-center identification, but text-dependent is the only viable option for functions like retailer website login that must be fast and convenient.
Voice Processing Accuracy
Voice recognition is much more specific and requires significantly more processing and analysis than speech recognition. Where speech recognition applies broader liberties to converting speech to text, voice recognition must not only convert the speech to text, but also analyze and compare up to 100 unique characteristics of each voice to a master voice print. Voice analysis is capable of detecting and matching varying individual attributes that are inaudible or not recognizable to the human ear.
High-quality voice recognition requires upstream processing on server-class equipment. While some solutions are offered for local on-device authentication, false positive rates (accepting a voice entry from someone other than the original owner) dramatically increase. Local authentication is limited to testing far fewer validation conditions as compared to a large online data set capable of analyzing and scoring hundreds of validation conditions.
Companies considering deploying voice authentication solutions should target solutions offering the industry norm of False Acceptance Rates (FAR) ~ .01% and False Reject Rates (FRR) of ~1% to 3%. Bear in mind, most solutions do not rely on voice as the only factor for authentication. Further, an adaptive model based upon transaction risk should determine what level of accuracy is required for the intended authentication.
Text-independent processing requires no active enrollment as voices are captured, generally during conversations with a customer care representative. Text-dependent requires a one-to-one match of the spoken utterance to the user’s voice print. Voice recognition effectiveness is directly related to following careful and deliberate best practices during enrollment. Enrollment is generally a simple and quick process—requiring the user to speak a passphrase or series of numbers three or four times. Speaking naturally is the most essential best practice, followed by enrolling in an environment without background or ambient noise.
Naturally speaking using your normal voice is the best way to recreate each additional voice entry for comparison to against the master voice print. Naturally speaking is using the same tone, volume, etc., as you would if you were speaking to an acquaintance right beside you. Many people make the mistake of speaking with increased volume, force or even sounding robotic—try to avoid these pitfalls whenever using voice recognition.
Background noise (e.g., traffic, fans, others speaking, music/TV, machinery, etc.) distorts the purity of voice collection during enrollment or comparison, so users should take extra care to seek environments with little or no noise. The input device also affects the quality of voice processing; newer mobile phones generally have higher-quality digital microphones and noise-cancellation processing (If you’ve ever noticed a small pin-hole on the back of your phone, it’s a microphone that collects background noise and generates inverse sound waves for noise cancelation).
The voice engine scores each voice attempt and responds to the authentication manager with a red, yellow or green light-type status. Green light means the entry passed with a high score; yellow light means the entry passed but with marginal results; red light of course means the entry failed with an unacceptable score. Green light entries are automatically added to the voice print model. Yellow light entries are added if a secondary authentication factor is successful like a PIN or password. Red light statuses are never added. Some users with very heavy accents or speech outside collective norms may become frustrated in the early use of voice recognition. These users may need to override voice rejects with a PIN or password that provides authentication until their voice print profile has been enriched with more voice samples.
Part two of this series will explore recent advancements in voice recognition technology and provide best practices merchants should be familiar with when launching a voice program. Steve Hoffman is founder and CEO of SayPay Technologies. SayPay enables companies to safely and conveniently authorize financial transactions using a mobile phone and unique biometric “Voice Signature.”