AI, UX and a dash of SMH
What captivates visionaries about the latest crop of AI tools is the promise of combining the best of technology (speed, accuracy and consistency) with the best of human ingenuity (compassion, creativity and nuance). That’s what users want and hope for from AI. It’s supposed to be like a computer, but softer. Like a human, but more knowledgeable. The very best we, collectively, have to offer.
Of course, as with any tech revolution, the path to achieve this is not trouble-free.
To err is human – and annoying
One of the most talked-about aspects of generative AI is its mistakes, often called hallucinations. People love to catch it messing up. AI models – they’re just like us! we think.
And it’s cute and a little amusing when it’s ChatGPT fabricating some accomplishments when you enter a prompt like “Tell me about my most glorious accomplishments.” You get a good laugh and move on. But what happens when the stakes are higher? What if AI messes up your prescription recommendation, or gives you the wrong insurance card or incorrect information about your financial account’s tax status? Suddenly, the user experience goes from fun to smacking your head against the table.
No AI system will ever be free from flaws. AI models are non-deterministic, meaning there is variation in their responses, because rather than being based on an exact formula like earlier computing models, they are based on statistical probabilities.
Let’s break that down. Up until now, almost all computing functions a user interacted with were dictated by formulas. If this, then that. If the value is negative, turn it red. If the coupon is not expired, apply it. This behaves the same way every single time, with perfect accuracy (as long as it’s been properly tested). People like that.
What people don’t like so well is that it relies on someone to map out the logic for every decision, every wrinkle, every nuance, and then code and test it. Want to handle what happens when someone forgets to enter their email? Need to write code. Want to handle coupons greater than the value of the shopping cart? More code. Want to recommend the best coupon to use based on what’s in your cart? Heaps more code. This costs time and money, and people don’t like that.
Enter machine learning. We let go of having an engineer code a decision tree of everything the system could possibly encounter. And instead we let the machine observe a set of past decisions, and draw its own inferences from there based on a statistical model. “Shopping carts like this historically applied the 25 percent off coupon.” This opens up a world of drastically more complex decisions taking in dozens of factors. But it also introduces some uncertainty. We can’t promise the computer will make the exact decision we’d like every time. There will be errors.
And to the user, that’s the AI tool breaking an implicit promise of how computers behave: computers get things right. And when they don’t, people get angry and lose trust.
How good is good enough for high stakes interactions?
A common question we are asked is how accurate does AI have to be in order to launch in a high stakes environment, such as insurance, healthcare or finance? To answer that, let’s look at what we know about human nature, and what the data says.
Imagine ringing up your groceries at the self-checkout line. Do you ever question if the computer did the math right? Most of us have seen a price on an individual item be different than advertised, which we understand is probably due to human error inputting something in inventory, but do you ever question whether the computer itself can successfully handle 1 + 1 = 2?
Of course not, and if you did, the self-checkout line would probably collapse with people frantically doing mental math to see if it’s correct. Computers are supposed to get it right every time. And when they don’t, people get frustrated – more frustrated than they would be with another person who made a similar mistake.
Researchers have studied the question of what happens when AI makes mistakes, and when it’s recoverable versus unrecoverable. A study at the University of Michigan by Conor Esterwood and Lionel P. Robert, Jr. sought to find out just how much error users could tolerate. They paired human participants with a robot ‘coworker’ who was supposed to help them perform a task. Some of the robots were trained to make periodic mistakes, and then try to repair the relationship with the human participant afterward with various strategies including denial, apologies, explanations and promises. They found that after 3 mistakes out of 10 attempts, trust was unrepairable by any strategy.
They also found some interesting conclusions about repairing trust after the first two mistakes. Denial (claiming the mistake didn’t happen, or justifying it) failed to repair trust successfully at any stage. Anyone who’s been in a romantic relationship can back this up - defensiveness doesn’t get us very far in building trust. The other three strategies (apologies, explanations, and promises) all had some impact at restoring trust.
Interestingly, apologies, promises and explanations had similar impact in most aspects, but none were able to restore trust to pre-mistake levels. Users carried a ‘grudge’ against the robot even after the repair attempt. The exception to the strategies having similar impact was in relation to participants’ perception of the robots’ benevolence - the degree to which the robot wished to help them. Although no strategy was able to repair benevolence to pre-failure levels, apologies and promises performed the best, and explanations, denials, and no repair attempt performed the worst. (1)
What we can learn from this is twofold:
users expect AI to perform at a high level
mistakes hurt the trust immensely, to a level that cannot be fully repaired after repeated mistakes
While I don’t know many visionaries who would consider a 30% failure rate an acceptable level for launch, it’s worth considering that users form an opinion after a relatively small sample size, and it’s not always possible to change that opinion once set - the user will simply give up on the tool.
So how accurate do you need to be? Probably about 95-99% accurate for high stakes interactions, and it additionally needs some mitigation factors to cushion the user against that 1-5% failure rate.
Cushioning the impact of mistakes by avoiding them
The first step to limiting the impact of mistakes is to try to reduce them as much as possible. One of the major reasons mistakes happen is because users overestimate the AI’s capabilities. This might be because the tool is overly enthusiastic in the way it presents itself, or it could simply be the user assuming it works like other AI tools they’re familiar with – the ChatGPT effect.
If you’ve ever tried to have a complex natural language conversation with Alexa, you probably quickly ran into limitations. A typical morning interaction with Alexa in our house goes like this: “Alexa, play upbeat morning music that’s not country to get my day started on the right foot.” In a few seconds, the melancholy opening chords of Lana Del Ray’s “Summertime Sadness” are wafting through the room.
What happened? Does it not work?
Alexa works, it’s just not designed to be used like that. It’s a different model than ChatGPT and not designed to parse complicated prompts. Again though, this is a relatively low-stakes interaction (3). If I were relying on Alexa to dial 911 in an emergency, I wouldn’t be nearly so forgiving of mistakes.
And that’s the problem - there aren’t enough guides for that interaction. If an AI agent can only accurately answer questions phrased in a certain way, prompt it that way (or better yet, design the interface so the user isn’t creating the prompt but choosing from pre-defined options). It’s far more successful to let users ‘bumper bowl’ by putting guards in place to stop them from ending up in the gutter, than to try to impress them by pretending your tool can do the same things a famous tool can do. Especially in high-stakes interactions.
It’s also true that most high-stakes AI features have some sort of quality gate on them. This might be a human verifying, or it could be programmatic – some sort of more traditional algorithm that does some boundary checks to ensure it’s not providing an outlier of an answer. While these increase the cost of the solution, they’re necessary to maintain trust in industries like healthcare, legal services, insurance and finance in particular.
Priming and Caveating
A portion of the response to AI can be influenced by providing context statements for the user prior to their interaction with the AI agent. For example, a 2023 study found that ‘priming’ the user with positive statements such as that the AI cared about them, caused them to view the AI responses as more helpful and supportive (4).
Having the AI agent mimic signals of human compassion helps even when the AI agent makes a mistake. The University of Michigan study also found that while repair attempts didn’t fully restore trust, they did partially restore it, particularly apologizing and promising to do better.
It can also be effective to identify lower-confidence answers with some kind of badge, and proactively offer alternatives. ChatGPT does this by asking a follow up question such as “Did that hit the mark, or should I suggest more recipes, perhaps with an emphasis on pickles or kimchi?” This is a form of ‘caveating’ or gently alerting the user that the tool is aware the answer may not be correct. This mimics how a (secure and mature) human would respond in the same situation.
One major pitfall to be aware of: while people appreciate human-like emotional intelligence from computers, your AI agent should never, ever pretend to be human. If you decide to name your chatbot, we suggest giving it a name not commonly used as a person’s first name so there’s no confusion. Amazon can get away with it with Alexa - most companies probably shouldn’t attempt it just yet. People have a strong aversion to being ‘duped’ by a chatbot and are surprisingly good at detecting it, if only due to the uncharacteristic lack of typos. People want honesty in their AI interactions.
AI should also consider the tenor of the message and generally should also avoid some of the less healthy emotional responses people employ, like shaming and guilt trips. An AI agent congratulating a user on contributing to their savings 3 months in a row might be welcomed; an AI tool chiding the user for not contributing this month is likely to be unwelcome. A famous example of this is Duolingo’s AI owl, Duo, who sends guilt-laden messages like “You made Duo sad. You haven’t logged in today.” Guilt trips are as unwelcome in technology as they are at your in-laws’ house. Use all of that personal knowledge for good, not guilt.
Onward & upward.
Notes
Esterwood, C., & Robert, L. P., Jr. (2023). Three strikes and you are out!: The impacts of multiple human–robot trust violations and repairs on robot trustworthiness. Computers in Human Behavior, 142, 107693. https://doi.org/10.1016/j.chb.2022.107693
Relatively low-stakes. Unsolicited breakup songs are more jarring at 5:30 AM than you might expect.
Pataranutaporn, P., Liu, R., Finn, E., & Maes, P. (2023). Influencing human–AI interaction by priming beliefs about AI can increase perceived trustworthiness, empathy and effectiveness. Nature Machine Intelligence, 5(12), 1076–1086. https://doi.org/10.1038/s42256-023-00720-7