Niche AI Platform Leni Outperforms OpenAI, Google on Key Benchmarks

📊 Key Data
  • Leni scored 71.6% on the DRACO benchmark, surpassing competitors in deep research accuracy.
  • Leni rejected 98% of nonsensical queries on BullshitBench, compared to 38% for OpenAI's GPT-5.2 and 48% for Google's Gemini 3.0 Pro.
  • Leni achieved a 77.0% score on the GAIA benchmark, outperforming agents from Genspark, Manus, and OpenAI Deep Research.
🎯 Expert Consensus

Experts suggest that Leni's success highlights the growing importance of specialized, high-reliability AI systems tailored for specific industries over generalist models, emphasizing accuracy and trustworthiness in professional applications.

2 days ago
Niche AI Platform Leni Outperforms OpenAI, Google on Key Benchmarks

Niche AI Platform Leni Outperforms OpenAI, Google on Key Benchmarks

By Susan Powell

NEW YORK, NY – May 12, 2026 – In a significant challenge to the prevailing narrative of AI dominance by tech behemoths, Leni, a specialized analytics platform for commercial real estate, announced today it has outperformed systems from OpenAI, Google, Anthropic, and Perplexity across four separate, rigorous AI benchmarks. The results suggest a potential shift in the AI landscape, where purpose-built, architecturally sophisticated systems are proving more reliable than their generalist, large-scale counterparts in critical business applications.

Leni, a company that has raised a comparatively modest $8.5 million since its 2023 launch, secured the top spot on benchmarks designed to test deep research, task completion, and, crucially, an AI's ability to recognize and reject nonsensical queries. The performance places the startup ahead of some of the most well-funded and widely used AI models in the world, raising questions about whether the race for sheer model size overlooks the more pressing enterprise need for accuracy and trustworthiness.

A Gauntlet of Trust and Accuracy

The benchmarks are not simple tests of knowledge but are designed to measure an AI's utility and reliability in real-world professional scenarios. On the DRACO benchmark, developed by Perplexity AI and Harvard to assess if an AI can produce research a senior analyst would approve, Leni scored 71.6%, surpassing deep research products from its larger competitors.

Perhaps most tellingly, Leni excelled on BullshitBench (Version 2), a test that evaluates whether an AI will invent a plausible-sounding answer to a fabricated question. Leni correctly identified and pushed back on 98% of these nonsensical prompts. This score stands in stark contrast to the performance of leading generalist models, with research showing OpenAI's GPT-5.2 scoring only 38% and Google's Gemini 3.0 Pro at 48%, highlighting a critical vulnerability in models that prioritize fluency over factual integrity.

Further demonstrating its capability in complex execution, Leni achieved a 77.0% score on the GAIA benchmark, a test from Meta and HuggingFace that requires multi-step reasoning, web browsing, and tool use to complete tasks. This score placed it ahead of agents from Genspark, Manus, and OpenAI Deep Research. The platform also ranked in the top two globally on SpreadsheetBench Verified, a benchmark testing complex, real-world spreadsheet manipulation, correctly completing 365 of 400 tasks.

These tests measure attributes that are paramount in high-stakes industries like commercial real estate, where, as Leni's announcement notes, "the margin for error is zero."

Architecture Over Models: A Different Engineering Philosophy

Leni's leadership attributes its success not to a revolutionary new model, but to a different engineering philosophy focused on the system surrounding the AI.

"Most teams obsess over models, but the key engineering needed for effective AI adoption, which delivers highly accurate and reliable results for teams, relies on architecture or harness," said Leni CEO and Co-Founder Arunabh Dastidar in a statement. He compared the approach to modern coding tools, which he claims are "98 percent harness and 2 percent models."

This "harness" refers to the purpose-built infrastructure that guides, verifies, and adds context to the outputs of AI models. Leni's platform is model-agnostic, meaning it can leverage the best available large language model for a given task while wrapping it in proprietary guardrails and verification processes. This shifts the user's role from constantly double-checking the AI's work to using a trusted tool.

"We called it years ago and have produced purpose-built infrastructure that can reliably be used for serious work where accuracy and security are crucial," Dastidar added. "It shifts the work from babysitting and guessing to trusted, verifiable output, so teams can move faster with confidence."

Closing the Billion-Dollar 'Trust Gap'

Leni's benchmark victories are more than just technical accolades; they are a direct response to a growing and costly problem in enterprise AI: the 'trust gap.' According to a 2025 EY survey cited by the company, an alarming 99% of companies reported financial losses due to AI-related risks, averaging $4.4 million per company.

This issue is particularly acute in the commercial real estate (CRE) sector. A 2025 JLL survey found that while 92% of CRE firms have piloted AI, a mere 5% report having achieved all their goals. This gap between experimentation and successful implementation points to the failure of general-purpose tools to meet the sector's specific needs for data integration and reliability.

Leni tackles this with its Universal Data Model (UDM), an industry-first standardized data framework for multifamily real estate. Developed over three years, the UDM creates a common language to integrate data from the disparate spreadsheets, PDFs, and proprietary systems that have long defined the siloed CRE industry. This allows the AI to operate on a clean, structured, and comprehensive dataset, drastically improving the accuracy of its analysis for investment and asset management teams.

"If I had to describe Leni's impact, it's simple: faster and easier," said Scott Jones, Vice President of IT at Ram Realty Advisors. "On the asset management side in particular, teams are no longer stuck doing manual work. The data flows directly from the source, and they can trust it."

A New Blueprint for Enterprise AI

Leni's success story may serve as a blueprint for the future of enterprise AI, suggesting a shift away from a single, monolithic AI toward a market of specialized, high-reliability systems tailored for specific industries.

For sectors like finance, law, and medicine—where accuracy is non-negotiable—a system that scores 98% on a test for factual integrity is profoundly more valuable than one that can write a poem in any style but may confidently fabricate information under pressure.

"Trust is the most important part of any AI system that a business actually uses," said Leni's Head of Industry Strategy, Marcio Sahade, a 14-year veteran of real estate giants Tishman Speyer and Hines. "If a team cannot rely on what comes back, they end up redoing the work themselves, and the AI never delivers on its promise."

He argues that the benchmarks Leni topped are a measure of this exact gap—the difference between plausible-sounding output and finished, trustworthy work. As businesses move beyond initial AI hype and confront the real-world costs of unreliability, the demand for platforms that prioritize this distinction is poised to grow. Leni's performance indicates that in the world of professional AI, specialized knowledge and architectural rigor may ultimately be what separates the promise from the reality.

Sector: Software & SaaS AI & Machine Learning Financial Services
Theme: Generative AI Large Language Models
Event: Corporate Finance
Product: ChatGPT Gemini
Metric: Revenue EBITDA

📝 This article is still being updated

Are you a relevant expert who could contribute your opinion or insights to this article? We'd love to hear from you. We will give you full credit for your contribution.

Contribute Your Expertise →
UAID: 30525