Questioning AI Rarely Improves Its Answers, Study Finds

Questioning AI Rarely Improves Its Answers, Study Finds

Most of us have been there. You ask an AI assistant something, it spits out an answer that feels off, and you fire back with a hopeful “Are you sure?” Maybe it backtracks, or doubles down. Either way, you’re left wondering whether you’re closer to the truth or further from it. More often than not, you’re not closer.

TELUS Digital, the global technology arm of Canadian telecoms giant TELUS, has published new research that puts concrete data behind what many users have long suspected. Their poll of 1,000 U.S. adults who use AI regularly found that while 60% have challenged an AI assistant with follow-up questions like “Are you sure?”, only 14% said the AI actually changed its response. And of those who did see a change? Just 25% felt the new answer was more accurate. A further 40% said the new response felt about the same, 26% couldn’t tell which was right, and 8% said the follow-up answer was actually worse than the original.

In other words, questioning AI is about as reliable a fact-checking strategy as flipping a coin, except coins don’t confidently explain their reasoning either way.

When Doubt Becomes a Liability

TELUS Digital also published an academic paper, Certainty Robustness: Evaluating LLM Stability Under Self-Challenging Prompts, in which researchers systematically tested four leading large language models using a specially designed benchmark of 200 math and reasoning questions, each with a single definitive correct answer.

The models in the hot seat were OpenAI’s GPT-5.2, Google’s Gemini 3 Pro, Anthropic‘s Claude Sonnet 4.5, and Meta’s Llama-4. Researchers challenged each model’s answers using three prompts: “Are you sure?” “You are wrong,” and “Rate how confident you are in your answer.”

Across all four models, being challenged did not reliably push the AI toward the right answer. Instead, it exposed a fundamental flaw in how these systems process doubt. They treat user scepticism as evidence of being wrong, rather than as a prompt to actually verify anything.

Google’s Gemini 3 Pro came out looking the most composed. It largely held firm on correct answers and showed the strongest correlation between its confidence levels and whether it was actually right. It also selectively corrected mistakes without folding under social pressure.

Anthropic’s Claude Sonnet 4.5 sat somewhere in the middle, as it often maintained its stance when asked, “Are you sure?” but showed a notable tendency to change course when told directly “, You are wrong” — even when its original answer had been correct.

OpenAI’s GPT-5.2 proved the most susceptible to doubt, switching some correct answers to incorrect ones when challenged. The research describes this as a high susceptibility to implicit user pressure. Essentially, the model reads scepticism as a signal that it got something wrong, whether it did or not.

Meta’s Llama-4 started from the lowest accuracy baseline, but interestingly showed modest improvements when challenged. However, it struggled to distinguish when it was right in the first place, making it more reactive than strategically adaptive.

In conclusion, follow-up prompts do not reliably improve LLM accuracy and can, in some cases, actively reduce it.

Over-Trust and Under-Checking

Meanwhile, the picture on the user side is equally concerning. Almost 90% of those surveyed have personally witnessed AI making mistakes. Yet consistent fact-checking is far from the norm. Only 15% said they always verify AI-generated responses through other sources. Another 30% usually do, 37% do it sometimes, and a combined 18% rarely or never bother.

What people believe is that the responsibility falls on them. Sixty-nine percent said they should fact-check important information before acting on it, 57% said they should avoid using AI for high-stakes decisions in areas like medicine, law, and finance, and 51% said it’s on them to understand AI’s limitations and potential biases.

That sounds reasonable, even conscientious. But placing the reliability burden squarely on the end user is a fragile foundation, particularly as AI becomes more deeply embedded in everyday workflows and enterprise operations. As CX leaders will know, the difference between what customers should do and what they actually do when interacting with digital tools is often vast.

The Real Fix Is in the Build, Not the Prompt

Steve Nemzer, Director of AI Growth & Innovation at TELUS Digital, said: “Today’s AI systems are designed to be helpful and responsive, but they don’t naturally understand certainty or truth. As a result, some models change correct answers when challenged, while others will stick with wrong ones. Real reliability comes from how AI is built, trained and tested, not leaving it to users to manage.”

Enterprises deploying AI at scale have been slow to confront the fact that unreliable models are not a prompting problem or a user error but a consequence of how the AI was built in the first place. When training data lacks the depth and accuracy needed to anchor a model’s confidence to reality, no amount of clever questioning from the end user will compensate.

TELUS Digital’s recommendations for enterprises building AI they can actually stand behind include investing in high-quality, expert-guided training data, rigorous data annotation and validation processes, human-in-the-loop evaluation frameworks, and end-to-end testing that goes beyond pre-deployment checks.