AI's Reality Check: A Lesson from the Power Grid for Healthcare's Future
A new report on AI in the energy sector reveals a critical truth for medicine: general-purpose AI is not ready for high-stakes decisions without human experts.
AI's Reality Check: A Lesson from the Power Grid for Healthcare's Future
PALO ALTO, CA – December 09, 2025 – The drumbeat for artificial intelligence in healthcare grows louder by the day. We are promised a future where large language models (LLMs) streamline diagnostics, personalize treatment plans, and unburden clinicians from administrative tasks. Tech giants and startups alike are racing to deploy these powerful general-purpose tools into the complex, high-stakes world of medicine. But a sobering new report from an entirely different critical sector—the electric power industry—serves as a crucial and timely reality check. It raises a fundamental question: are these systems truly ready for responsibilities where a single error can have catastrophic consequences?
Today, the Electric Power Research Institute (EPRI) published a first-of-its-kind study benchmarking the performance of public LLMs on domain-specific tasks. The results are a stark warning for any industry, like healthcare, that operates on a foundation of precision, reliability, and public trust. The findings demonstrate that even the most advanced AI models exhibit a dangerous gap between apparent competence and true operational reliability, reinforcing the non-negotiable need for expert human oversight.
A Warning from a Parallel Universe
While the context is power generation and grid management, the implications for healthcare are impossible to ignore. EPRI’s research wasn't based on abstract academic problems but on over 2,100 real-world questions crafted by 94 industry experts. The goal was to see how models like GPT-5, Grok 4, and Gemini 2.5 Pro would fare when faced with the technical, regulatory, and operational nuances of a critical infrastructure sector.
The conclusion was clear: accuracy is a fragile commodity. "As utilities integrate AI into power system planning and operations, this benchmarking establishes a critical foundation for evaluating domain-specific tools and models. Accuracy is paramount, as errors can lead to significant operational and reliability consequences," stated EPRI Vice President of AI Transformation and Chief AI Officer Remi Raphael. Swap "utilities" for "hospitals" and "power system" for "patient care," and the statement resonates with chilling relevance for medicine.
For healthcare leaders and innovators, the EPRI report is not a discouraging bulletin from a distant industry. It is a detailed, data-driven preview of the very challenges they will face when moving AI from the lab to the bedside. It provides a blueprint of what can go wrong when generalist technology confronts specialist reality.
The Anatomy of AI's Reliability Gap
The most telling finding from EPRI’s analysis was the dramatic performance drop when questions shifted from a structured, multiple-choice question (MCQ) format to an open-ended one. On MCQs, the leading models performed admirably, scoring between 83–86%, consistent with their performance on general math and science benchmarks. This mirrors the headlines we see about AI passing the U.S. Medical Licensing Exam. It creates an illusion of deep understanding.
However, when the safety rails were removed and the models were presented with the same problems in an open-ended format—one that more closely resembles a real-world consultation or diagnostic challenge—their accuracy plummeted. The average score dropped by a staggering 27 percentage points. On the most difficult expert-level questions, the top models were only 46–71% accurate. In some cases, their accuracy was below 50%.
This is the critical lesson for healthcare. A clinician doesn't work with multiple-choice questions. They work with a patient's complex, often ambiguous narrative, weaving together symptoms, history, and test results to form a diagnosis. The EPRI report strongly suggests that an AI which can ace a standardized test might still fail catastrophically when asked to synthesize a complex patient case, potentially missing critical nuances or fabricating incorrect information—a phenomenon known as "hallucination."
Furthermore, the study found that allowing the models to search the web offered only a modest 2-4% accuracy boost, while simultaneously introducing the risk of incorporating irrelevant or misleading information. This punctures the simplistic notion that simply connecting an LLM to the internet or a medical database is a sufficient fix for its knowledge gaps. Curation and context remain king.
The Clinician-in-the-Loop: A Permanent Mandate
This brings us to the report's central theme: the indispensable role of human expertise. The findings from the energy sector transform the concept of a "human-in-the-loop" from a transitional safety measure into a permanent, fundamental requirement for any AI system deployed in a critical field. The dream of a fully autonomous AI doctor, for now, remains just that—a dream. The reality is that these tools are best understood as powerful, but flawed, co-pilots.
For healthcare, this means building systems and workflows that don't just allow for clinician oversight but demand it. It requires designing AI tools that are transparent about their confidence levels and can show their work, allowing a human expert to quickly validate or correct their outputs. The goal is not to replace clinical judgment but to augment it with data-processing power, freeing up clinicians to focus on the uniquely human aspects of care: empathy, complex decision-making, and patient communication.
Ignoring this mandate courts disaster. An AI-generated summary of patient records that is 95% accurate may seem impressive, but the 5% it gets wrong could be a critical allergy, a life-threatening comorbidity, or a subtle symptom that points to a different diagnosis entirely. Without rigorous human validation, efficiency gains come at the unacceptable cost of patient safety.
A Collaborative Path to Trustworthy AI
Beyond its cautionary findings, the EPRI report also illuminates a path forward. It highlights the rapid improvement of open-weight models—AI systems whose internal parameters are publicly available. While currently a generation behind their proprietary counterparts, these models offer a crucial advantage for critical industries: flexibility and auditability. Healthcare systems could, in theory, host and fine-tune these models on their own curated, private patient data, creating highly specialized tools without sharing sensitive information with third-party tech companies. This avoids the "black box" problem of many commercial systems and allows for deeper validation.
Moreover, the entire effort was born from EPRI's Open Power AI Consortium, a collaborative initiative designed to drive the development of AI tailored for the power sector. This model of pre-competitive collaboration is precisely what healthcare needs. Instead of every hospital system and tech vendor reinventing the wheel, a consortium could establish common benchmarking standards, create secure data-sharing protocols, and collectively invest in the foundational research needed to build safe, effective, and equitable medical AI.
Ultimately, the journey to integrate AI into our most vital services is a marathon, not a sprint. The lessons from the electric grid provide an invaluable map. They teach us that true innovation lies not in the uncritical adoption of powerful new tools, but in the painstaking, collaborative, and humble work of understanding their limitations and ensuring they serve, and never endanger, the people they are meant to help.
📝 This article is still being updated
Are you a relevant expert who could contribute your opinion or insights to this article? We'd love to hear from you. We will give you full credit for your contribution.
Contribute Your Expertise →