It’s been a busy past few weeks for me, and I ran into a situation that I have seen in the past, and I wanted to share some knowledge about voice assistants. Most people just don’t understand the complexities and limitations of the technology with voice assistants, so I thought that sharing this particular use case, and the difficulties of it, might be helpful for some people.
My Voice Assistant Situation
One of my customers had a desire to have a voice assistant understand and respond to questions about people, and have the ability to recognize the names of the people being asked about. This would require a customized Speech-to-Text (STT) model that could listen to user utterances, and translate those into the names of people. In this scenario, my customer wanted to be able to handle a wide variety of different names and surnames. They also wanted to be able to handle a variety of different English accents.
This particular customer has been listening to a variety of industry “experts” and sales professionals about how AI is going to answer all of their questions, and solve all of their problems… if only they would give it a chance. The people in charge of the business look at the things that home assistants and other commercial applications can do, and they feel like this kind of thing should be feasible, easy to do, and relatively quick to implement.
You might be dealing with something similar. If so, have your people in charge read this article. It will help them understand the difficulties in doing some of these things — and give them some more realistic expectations.
The Initial Assessment
This is EXTREMELY hard for a Speech-to-Text (STT) service to accomplish, due to the almost infinite variability in pronunciations and spellings of the names of people.
Most of this is not unique to any one sub-culture of names (people often struggle with Indian, Arabic, and Asian names), it happens with names in general. Take a name like “Dawn”. Common pronunciations of the name could translate to the name Dawn (what you want), or to the concept of “dawn”, the product name “Dawn”, the male surname “Don”, the direction “down”, the action of “drawn”, the noise “din”, etc (all of which you do not want). That confusion is all for a single syllable, female, Anglo-Saxon, name.
Now factor in the complexities of a multi-syllabic name. And what about the seemingly random jumble of letters that some names appear to be? As humans, we often abbreviate people’s names to avoid situations like this, just so we don’t constantly mangle people’s names. We do this with my nickname, “Tox”, so people don’t mispronounce my last name of “Toczala” (which is pronounced TOKS ALLA). We do it for Bob, Cindy, Mike, Candy, Joe, and others — and the base of those names is quite common.
Another factor that can make name differentiation difficult is the structure of many names, regardless of the language of origin. Some names are either single word or compounded terms from the language. Names like Cooper, Hammersmith, Wordsworth, Ginger, Penny, and others. You also run into issues when dealing with some of the “newer” names and naming conventions followed in recent years. How do you expect your STT service to translate “M. Night Shyamalan”? What about “North West”? Or “Daisy Bloom”? Even worse, how about names with embedded actions in them? Like “Christopher Walken”? or “Paige Turner”? How do you recognize the names, the verbs, and the punctuation in this? I’m not even mentioning “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo“. Our systems need some guidelines to follow, some basic rules and patterns to follow. When it comes to names and titles for people, the rules of the language seem to get suspended — which makes it extremely difficult for a voice assistant to get it right.
Going Further Down the Rabbit Hole
So with all of the challenges I discussed above, you can see why this use case is difficult to implement. We compound that difficulty when we ask to do this with a wide variety of speech accents. You have multiple varieties of American English (common Midwestern accent, Brooklyn accent, Boston accent, etc.), as well as the more refined UK English that most Indian speakers will pattern on. Then there are the various Latin-tinged English (some might call it Spanglish) accents that you hear from people who have Spanish as a first language and English as a second language. Let’s start with the phrase, “Park the car, take the ticket and pay the fee”. Now imagine understanding this as spoken by four different people :
- One with a Midwestern accent
- One with a Boston accent
- One with an Indian English accent
- One with a Southern drawl
You will get four greatly different translations. The first three words might all be different between these four different speakers. Even a highly customized language model like Siri (which is highly specialized for general talk and a variety of accents), struggles constantly with people’s names and accents.
The last complicating factor (as if the above wasn’t tough enough), is the whole idea of spelling. Names get spelled in ways that are not close to how they are pronounced. Typical examples of this include names like “Nguyen”, “McConaughey”, “Weinstein”, “Jesus”, “Baughman”, and others. These names are spelled in ways that you could not predict from the way that they are pronounced. And first names have multiple spellings all over — is it “Andy”, “Andi”, or “Andee”? “Tony”, “Toni”, or “Tone”?
Why Are We Here Anyhow?
So why all of this excuse-making for how hard general name recognition is? Why are we trying so hard to personalize our voice assistants? What is the payoff — what is the value of calling Tox by his first name?
Considering all of these challenges, building and maintaining a highly efficient and accurate language model for something like this would be expensive (in terms of time and money). And it would still suffer from inaccuracies and failures on things like homophones (new, knew, gnu), name collisions (smith, smythe), and all of the other cases that I have outlined above. In the big picture, is all of the work and complexity worth it? Wouldn’t all of that effort be more productive if it were targeted at something else that had a more definite impact on the performance of your voice assistant — and on the general satisfaction of your end-users? Maybe spend that effort on setting up some automated testing?
I would suggest that you read the article Why The Overall Voicebot Solution And User Experience Are More Important Than Speech Accuracy, by Marco Noel. It is a great overview of how a voice assistant is more than just a Speech-to-Text service, and how you need to look at your end-user experience holistically — from the perspective of your end-user.
Some Approaches That Help
An approach that might be worthwhile to explore is having people attempt to spell the names of the people that they are interested in, and then let the users “filter and downselect” to the person that they intend. Often human assistants will require you to spell out a name for them — we’re used to doing that.
You should be VERY careful about how much information (and what kind of information) you provide — releasing personal information without any controls in place can be risky (and run afoul of GDPR and similar regulatory constraints).
The customers with systems that will return personal information to users, will often require some sort of account validation or login, at which point you already KNOW the person’s name (and how to spell it). They also tend to not rely on names — instead they use identification numbers, or some other unique key information, to identify users. Names are not unique — neither are addresses.
One other approach would be to attempt to translate names, and immediately fail out to spelling the names. As time goes on, your STT model may improve and your bot would ask for spelling less often. But this would require a commitment to constantly work on improving and managing your STT custom models specific to names. You would need to continually be adding customization data in the form of a name and the phonetic pronunciation of the name. Seems like a lot of work and cost for something with limited value.
So the real takeaway here is this: don’t try to get people’s names from speech-to-text engines. It’s too hard and requires too much effort to support. There is nothing wrong with personalizing the experience of your end-users, but get that name information from somewhere else.