Protege’s DataLab Aims to Fix AI's Foundational Data Crisis
- $65 million in funding secured since 2024
- 20x growth in business in 2025
- Collaborations with a majority of the 'Magnificent 7' tech giants
Experts agree that DataLab's scientific approach to AI data curation is critical for overcoming the industry's data quality bottleneck and advancing reliable, high-stakes AI applications.
Protege’s DataLab Aims to Fix AI's Foundational Data Crisis
NEW YORK, NY – March 11, 2026
In an industry defined by a relentless race for bigger models and faster chips, a critical third pillar of artificial intelligence has remained underdeveloped: the data itself. Now, AI data platform Protege is launching DataLab, a new research institution dedicated to transforming AI data from a raw commodity into a rigorous scientific discipline. The initiative arrives as leading AI labs find their progress increasingly constrained not by computational power, but by the quality, complexity, and reliability of the data used to train and evaluate their frontier models.
With backing from top-tier investors and early collaborations with a majority of the "Magnificent 7" tech giants, DataLab is positioning itself as an essential force in building the next generation of more capable and trustworthy AI.
“We understand the three core pillars driving AI: models, chips, and data. We are convinced that with the right datasets—the third, underdeveloped pillar—you can push the entire frontier forward,” said Bobby Samuels, CEO of Protege, in a statement. “We created DataLab to treat data as infrastructure, not exhaust. If we want more capable, reliable systems, we need standards, reproducibility, and real scientific discipline at the data layer.”
The Data Quality Bottleneck
For years, the prevailing wisdom in AI development was that more data equated to better performance. This paradigm fueled a massive data grab, with models trained on vast, undifferentiated swaths of the public internet. However, as AI systems become more sophisticated, this approach is revealing its limits. The industry now faces a data quality crisis, where progress is hampered by issues of bias, irrelevance, and a lack of real-world complexity in training datasets.
Industry experts have pointed to this data bottleneck as a primary inhibitor of future breakthroughs. “Data quality has become the defining constraint in frontier AI development, yet investment and innovation have lagged,” said Nikhil Basu Trivedi, Co-Founder and General Partner at Footwork, a key investor in Protege. “That changes with DataLab at Protege, which brings the same level of rigor and expertise to AI data that we have for AI chips and models.”
The challenges are multifaceted. AI developers grapple with data immaturity within organizations, where data is often fragmented, inconsistent, or inaccessible. Static datasets quickly become outdated, rendering models less effective in dynamic, real-time environments. Furthermore, the curation process itself can introduce latent biases, leading to AI systems that are unreliable or unfair, a critical risk as AI enters high-stakes fields like medicine and finance.
A Scientific Approach to Curation
DataLab aims to address these challenges by operating across three core areas: forging scientific partnerships with leading researchers, constructing high-value datasets, and publishing cutting-edge research to establish industry-wide standards.
Leading the institution is Engy Ziedan, Protege’s Co-Founder and Chief Scientific Officer. Ziedan’s background is not in traditional computer science but in economics, having served as an Assistant Professor at Tulane University where she specialized in causal inference and analyzing large, complex datasets, including electronic medical records. This expertise in measuring and correcting for bias is central to DataLab's mission.
“The strength of DataLab is its ability to integrate perspectives that are often siloed,” said Ziedan. “Advancing AI requires more than larger models or more data alone. It requires thinking at the margin, where we weigh the marginal value of a datapoint on learning and the opportunity cost of choosing the wrong dataset. This requires disciplined dataset design, careful evaluation, and a deep understanding of real-world complexity.”
Under her leadership, DataLab assembles teams of machine learning researchers, economists, and domain experts to apply this disciplined methodology. The goal is to move beyond simple data collection and establish reproducible processes for dataset design, construction, and evaluation that result in measurable performance gains and more reliable AI systems.
Powering the AI Frontier
Protege has rapidly established itself as a critical player in the AI ecosystem, securing $65 million in funding since its 2024 founding from prominent venture firms including CRV, Footwork, and Andreessen Horowitz (a16z). The company reported a 20x growth in business in 2025, building a network of over 100 data partners across healthcare and media.
Instead of scraping public data or relying on synthetic generation, Protege’s platform acts as a "connective tissue," enabling a governed exchange between holders of high-quality, proprietary data and the AI developers who need it. This model is proving essential for the industry's largest players. At its launch, DataLab is already collaborating with most of the "Magnificent 7" and other major frontier AI organizations on high-stakes challenges, from analyzing advanced cancers to developing agentic AI and ensuring international representation in healthcare data.
This strategic positioning addresses a core need for AI companies that have exhausted the utility of publicly available data and now require curated, domain-specific, and privacy-compliant datasets to achieve the next level of capability.
From Theory to Real-World Impact in Healthcare
The most tangible evidence of DataLab's impact can be seen in healthcare, a domain where data quality and reliability are matters of life and death. The institution has already released several multimodal benchmark datasets designed to reflect the diagnostic ambiguity and longitudinal context of actual clinical practice.
In a significant step forward, DataLab co-designed MedScribe and MedCode, two multimodal benchmarks for healthcare. For MedScribe, a benchmark for clinical documentation, the team prepared de-identified doctor-patient conversation transcripts to evaluate how well AI models generalize to real-world workflows. For MedCode, developed in partnership with Vals AI, DataLab created a dataset of thousands of diagnosis codes from de-identified patient records to test an AI's ability to perform medical coding under realistic constraints. These benchmarks are crucial for understanding where models fail in clinical reasoning, moving beyond simple recall to true generalization.
DataLab is also curating evaluation-specific datasets for dermatology, oncology diagnostics, and cardiology, aiming to support the development of AI tools capable of meeting rigorous standards like those of the FDA. As AI systems transition from research labs to critical real-world applications, this focus on creating a robust, scientifically-validated data foundation is becoming the decisive factor for success and safety. The work being done by DataLab and its partners signals a maturation of the AI industry, one where the quality of the data is finally given the same priority as the models it powers.
📝 This article is still being updated
Are you a relevant expert who could contribute your opinion or insights to this article? We'd love to hear from you. We will give you full credit for your contribution.
Contribute Your Expertise →