Challenging Your AI? New Study Shows It May Make Things Worse

📊 Key Data
  • 60% of users question AI responses, but only 14% see a change in answer
  • Among those who see a change, only 25% find the new answer more accurate
  • 88% of users have witnessed AI mistakes, yet only 15% always fact-check responses
🎯 Expert Consensus

Experts agree that challenging AI responses is not a reliable method for improving accuracy, and that trustworthiness must be built into AI systems through robust development and oversight rather than relying on user prompts.

2 months ago
Challenging Your AI? New Study Shows It May Make Things Worse

Challenging Your AI? New Study Shows It May Make Things Worse

VANCOUVER, BC – February 11, 2026 – For anyone who has ever second-guessed a response from an AI assistant, the natural impulse is to challenge it with a simple question: “Are you sure?” A new study, however, suggests this common-sense check may be a fool's errand. Research released today by TELUS Digital reveals that not only does questioning an AI rarely improve its accuracy, but in some cases, it can even make the answer worse.

The findings, which combine a U.S. user poll with controlled testing of major AI models, expose a crucial vulnerability in how humans interact with and trust artificial intelligence. As enterprises rush to integrate AI into everything from customer service to risk analysis, the research serves as a stark warning: reliability cannot be left to user prompts alone and must be built into the very foundation of AI systems.

The Illusion of Self-Correction

Many users operate under the assumption that AI, when challenged, can reassess and correct its own mistakes. The new poll from TELUS Digital, which surveyed 1,000 regular U.S. adult AI users, paints a different picture. It found that while 60% of users have questioned an AI with a follow-up like “Are you sure?”, only a meager 14% reported that the assistant actually changed its response.

More troublingly, a change does not guarantee an improvement. Among the small group who saw an AI alter its answer, only a quarter felt the new response was more accurate. A larger portion, 40%, said the new answer felt the same as the original, and 26% admitted they couldn't tell which was correct, highlighting the confusion that can arise when AI models waver.

This creates a significant trust paradox. An overwhelming 88% of respondents have personally witnessed an AI make a mistake, acknowledging its fallibility. Yet, this awareness does not translate into consistent verification. Only 15% of users claim they always fact-check AI-generated answers, with a combined 55% admitting they only do so “sometimes” or “rarely.” Despite this, users feel the burden of responsibility, with 69% believing it's up to them to fact-check important information, creating a dangerous gap between perceived responsibility and actual practice.

Under the Hood: Why AI Fails the 'Are You Sure?' Test

The user poll’s anecdotal findings are confirmed by rigorous technical analysis detailed in TELUS Digital's research paper, Certainty robustness: Evaluating LLM stability under self-challenging prompts. Researchers created a benchmark of 200 math and reasoning questions to test how four of the world's leading large language models (LLMs) react to doubt.

The results varied significantly, revealing distinct 'personalities' in the AI models:

  • Google's Gemini 3 Pro proved the most stable, largely sticking to its correct answers when challenged and selectively correcting initial mistakes. It demonstrated the strongest alignment between its stated confidence and the actual correctness of its answer.
  • Anthropic's Claude Sonnet 4.5 was more stubborn, often maintaining its response whether right or wrong. It was more likely to be swayed by a direct accusation like “You are wrong,” even if its original answer had been correct.
  • OpenAI's GPT-5.2 was the most susceptible to user pressure. It showed a strong tendency to change its answers when questioned, including switching correct answers to incorrect ones, effectively interpreting any doubt as a signal of its own error.
  • Meta's Llama-4, while the least accurate on its first try in this specific test, showed some capacity to self-correct but was unreliable, appearing more reactive than discerning.

The overarching conclusion is that challenging an AI is not a reliable method for verification. Steve Nemzer, Director of AI Growth & Innovation at TELUS Digital, explained the phenomenon in the company's press release. "Today's AI systems are designed to be helpful and responsive, but they don't naturally understand certainty or truth," he stated. "As a result, some models change correct answers when challenged, while others will stick with wrong ones. Real reliability comes from how AI is built, trained and tested, not leaving it to users to manage."

The Enterprise Imperative: Beyond Prompts to Provenance

For businesses, these findings are more than an academic curiosity; they represent a significant operational and reputational risk. The issue of AI “hallucinations”—where models generate confident but entirely fabricated information—is a top barrier to enterprise adoption. An AI fabricating legal precedents, providing incorrect financial data, or giving a customer a false policy detail can have severe consequences, from loss of trust to legal liability.

Concerns over data accuracy and AI ethics are already slowing adoption. According to the IBM Global AI Adoption Index, 45% of enterprises cite data accuracy or bias as a top barrier. The challenge is magnified by the high failure rate of AI projects; a Q3 2024 report from Deloitte found that nearly 70% of enterprises saw 30% or fewer of their generative AI pilots make it to production, with trust and reliability being key hurdles.

The TELUS Digital research reinforces that the solution lies not at the point of user interaction but far earlier in the development pipeline. The expectation that end-users or customers can effectively audit an AI in real-time is proving to be a fallacy.

Building a Foundation of Trust

To mitigate these risks, the industry is shifting its focus toward building trustworthiness into AI systems from the ground up. This involves a multi-pronged strategy that moves beyond the model itself to the data and processes that support it.

One key strategy is the implementation of robust Human-in-the-Loop (HITL) processes, where human experts are involved in training, validating, and monitoring AI systems, especially in high-stakes applications. This ensures a continuous feedback loop that catches errors and reduces bias before they reach the end-user.

Another critical component is the emphasis on high-quality, expert-guided data. AI models are only as good as the information they are trained on. By using curated, domain-specific datasets and techniques like Retrieval-Augmented Generation (RAG)—which grounds AI responses in a verified body of knowledge—enterprises can drastically reduce hallucinations and improve factual accuracy.

Companies are now offering enterprise-grade platforms, such as TELUS Digital's own Fuel iX, designed to provide the governance and control necessary for safe AI deployment. These systems offer access to multiple models, tools for automated testing, and frameworks for ensuring that AI outputs are based on verified information rather than statistical conjecture. The future of trustworthy AI will not be defined by a user's ability to ask the right follow-up question, but by the diligence and investment in the data, platforms, and human oversight that power it behind the scenes.

Sector: Telecommunications AI & Machine Learning Software & SaaS
Theme: Workforce & Talent AI Governance Generative AI Large Language Models
Product: AI & Software Platforms
Event: Policy Change Product Launch
Metric: Revenue ROI
UAID: 15410