AI's Hidden Bias: How Training Data Skews Reality and Threatens Democracy
- 90% accessibility: Websites with 'very low' factual accuracy are ~90% accessible to AI crawlers, while 'high' factual rating sites are <50% accessible.
- 8x higher blocking: Center-left sites block AI crawlers at ~8x the rate of conservative sites and ~10x that of far-right sites.
- $100M+ licensing: AI companies have signed licensing deals worth hundreds of millions for content access, but these don't correct underlying data asymmetries.
Experts warn that AI training data's structural bias toward unreliable and ideologically skewed sources threatens democratic discourse, requiring urgent regulatory oversight and transparency measures.
AI's Hidden Bias: How Training Data Skews Reality and Threatens Democracy
WASHINGTON, DC – February 24, 2026 – The artificial intelligence systems that increasingly shape public discourse are being trained on a diet of information that is structurally skewed toward ideologically conservative and factually unreliable sources, creating a hidden vulnerability at the heart of our digital world. A groundbreaking report released today by PSG Consulting and Innovating for the Public Good (IFPG) reveals that these AI models, which function as powerful “unseen editors,” are at growing risk of political manipulation, posing a significant threat to the health of democracy.
The report, titled AI Large Language Model Training: The Potential Risks of Ideological Skewing, presents first-of-its-kind research conducted by the public affairs firm Dewey Square Group. The analysis documents a stark imbalance in the open-web data available for training Large Language Models (LLMs), the technology behind popular AI tools. This imbalance is not the result of a deliberate conspiracy, but rather the unintended consequence of how different media outlets manage access to their content.
“Political bias and factual inaccuracies in these systems are not peripheral issues — they threaten democracy's strength and sustainability,” the report warns, framing the issue as a critical challenge for policymakers, tech developers, and the public alike.
The Inverted Funnel of Information
The core of the research exposes what one of its authors calls a “clear inverted funnel” of information quality. The study analyzed the behavior of 27 different AI web crawlers—automated tools that harvest text from the internet—across 153 U.S. news and political websites. These sites were categorized for ideological lean and factual reliability using ratings from the independent organization Media Bias/Fact Check.
The findings are stark: as a media outlet’s factual reliability increases, its accessibility to AI crawlers decreases. Websites rated as having “very low” factual accuracy were found to be approximately 90% accessible to the crawlers. In contrast, sites with a “high” factual rating were less than 50% accessible.
This trend is mirrored along ideological lines. The report found that center-left outlets impose the strictest restrictions, with less than 40% of their data available to AI crawlers. Many of these sites, especially among the most influential, block access completely. Conversely, far-right websites impose very few restrictions and are nearly 80% accessible. The data shows that center-left sites explicitly block specific AI crawlers at a rate roughly eight times higher than conservative sites and nearly ten times higher than far-right sites.
“Our research reveals that the least factually accurate media outlets are the very outlets that are most accessible to AI crawlers,” said Tim Chambers, Principal and Co-Founder of Dewey Square Group, in a statement. “This is concerning, especially as LLMs wield increasing power over the information made available through traditional online search, social media placements, personal assistants and new venues emerging every day.”
The Unseen Editors and 'AI Bothsides-ism'
The report argues that this data asymmetry has profound implications. As LLMs become integrated into search engines and personal assistants, they are becoming the primary lens through which many people encounter facts and narratives. If the training data is fundamentally biased, the AI’s output will inevitably reflect and amplify that bias, potentially warping public understanding on a massive scale.
A new risk category identified in the report is “AI Bothsides-ism.” This phenomenon occurs when AI developers, in an attempt to appear politically neutral, create models that give false equivalence to competing claims. The report points to a 2025 paper from the AI company Anthropic on achieving “political even-handedness” as an example of this trend. Critics worry such an approach could lead an AI to present established vaccine science and anti-vaccine conspiracy theories with equal weight, misleading users under the guise of impartiality.
Compounding the problem is a sharp decline in transparency from major AI developers. While early model releases often included detailed information about training datasets, the report notes that since 2020, major tech companies have offered only vague descriptions. This opacity makes it nearly impossible for independent researchers to audit the models for bias or verify the effectiveness of any mitigation techniques. While AI companies claim that internal processes like fine-tuning can correct for data imbalances, these claims cannot be independently verified without access to the underlying data and methods.
A Market of Asymmetries
The structural skew in AI training data is not born from malicious intent but from a complex interplay of market forces and technical decisions. High-quality news organizations, many of which are center-left, are increasingly putting their content behind paywalls and using robots.txt files to block AI crawlers from scraping their websites. This is often done to protect their intellectual property and explore direct licensing deals with AI companies.
While the report documented dozens of such licensing agreements worth hundreds of millions of dollars, it concludes that these deals do not correct the underlying asymmetry. Conservative and far-right content often enters AI training pipelines through two channels: open-web crawling and licensing agreements. In contrast, high-quality center-left content that is blocked from crawlers must rely only on licensing as its pathway into the models. This creates a systemic over-representation of one side of the political and factual spectrum.
This dynamic creates a tension between publishers’ rights to control their content and the broader societal need for AI systems trained on balanced, high-quality information. Without a comprehensive solution, the default state of the open web provides a distorted view of the world for AI models to learn from.
A Call for Regulation and Responsible AI
The report's authors frame their findings as an urgent call to action for government intervention. They argue that the immense power of these AI tools cannot be left unchecked and that market forces alone have failed to produce a fair and accurate information ecosystem.
“As this technology increasingly impacts every aspect of our lives, it is imperative that we ensure factual distortions and imaginary realities are not allowed to shape public understanding,” stated Page Gardner, President of PSG Consulting and Founder of IFPG. “We either regulate these powerful AI tools or we let them regulate us.”
This call for oversight aligns with a growing global movement to establish guardrails for artificial intelligence. The European Union is advancing its landmark EU AI Act, which will impose strict requirements on “high-risk” AI systems, including mandates for data quality and transparency. In the United States, agencies like the Federal Trade Commission have signaled their intent to use existing consumer protection laws to combat AI-driven discrimination and deception.
Independent AI ethicists have long warned that biased data is one of the most significant threats posed by the technology, capable of perpetuating and even amplifying societal inequalities. The findings from PSG Consulting provide concrete data demonstrating how this bias is being systematically embedded into the foundational models that will power future technologies. Addressing this requires a multi-pronged approach involving greater transparency from developers, more strategic engagement from publishers, and clear, enforceable rules from regulators to ensure that the AI shaping our future is built on a foundation of fact.
