When AI Fact-Checking Gets the Facts Backward

I used a custom-built agentic workflow to fact-check an article I was writing about Apple’s reported $5 billion deal with Google to power Siri with Gemini. The AI verified every claim as accurate. It was wrong about the most important one.

If you are deploying agents and agentic workflows, this case study clearly illustrates why pairing human subject matter experts with agents is going to be required for the foreseeable future. Not only does everyone need a sign on their desk that says, “I am responsible for the agents I deploy,” every agent you deploy must be subjected to relentless, continuous improvement workflows.

The $20 Billion Error

I was writing about Apple’s reported deal with Google to integrate Gemini into Siri. My draft contained this sentence: “Google already extracts $20 billion annually from Apple for search.”

The sentence is wrong. Google pays Apple roughly $20 billion annually (per Bloomberg’s 2024 reporting) to be the default search engine on iPhones. Apple receives the money; Google sends it. My draft had the direction reversed. I started typing one construction, changed my mind mid-sentence, got distracted, and came back to finish without noticing I had left the verb pointing the wrong way. It’s a pretty common multi-tasking error for me.

I ran the article through my AI critic system, a 35-page evaluation rubric that checks for accuracy, grammar, structure, and sourcing. Claude searched the web, found multiple sources confirming the $20 billion figure and the Google-Apple search relationship, and marked the claim as verified.

The AI proof-reader missed it. The human proof-reader did not!

If the story ended here, it would just be another “AI still sucks” rant. But there is more to learn. 1) Subject matter experts must evaluate all agentic outputs until management is secure in the quality of the results. And, even then, evals and workflows must be constantly revised and updated. 2) Even the clearest instructions may yield unexpected results. Here’s what actually happened…

Entity Matching vs. Relation Extraction

In natural language processing, there are two distinct verification tasks. Entity matching confirms that the components of a claim (names, numbers, dates) appear in reliable sources. Relation extraction confirms that the relationships between those components are correctly represented.

My AI critic performed entity matching:

Does “$20 billion” appear in sources about this topic? Yes.
Do “Google” and “Apple” appear together in this context? Yes.
Is “search” the subject of the transaction? Yes.
Conclusion: Verified.

What it did not perform is relation extraction, specifically a subtask called semantic role labeling. SRL identifies the agent (who performs the action) and the recipient (who receives it). In the sentence “Google pays Apple,” Google is the agent and Apple is the recipient. In my erroneous sentence “Google extracts from Apple,” the roles are inverted.

The AI saw matching entities and pattern-matched to a false positive. It never parsed who was paying whom.

The Agentic Flywheel

The failure was not a bug. It was the system working exactly as designed, revealing a gap I needed to close.

I use an “agentic flywheel” for content production. Context profiles establish identity and voice. Skills define what the agents can do. Pre-prompts and guardrails set boundaries. Agents execute the work. Evals score each draft against defined criteria. Then you revise and repeat.

The key word is repeat. Every failure that surfaces a gap becomes an input to the next iteration. The system improves because it fails in visible, correctable ways.

My critic rubric had 35 pages of evaluation criteria. It checked evidence quality, source attribution, date-stamping, and factual accuracy at the entity level. It did not check relational accuracy. That gap was invisible until a real error exposed it.

The Fix

I added explicit relational verification to every critic rubric. The protocol has four steps:

1. Atomic decomposition. Break every two-party claim into its components: ACTOR → ACTION → RECIPIENT → AMOUNT. My erroneous sentence decomposed as: Google → extracts (receives) → from Apple → $20 billion. The correct relationship: Google → pays → Apple → $20 billion. Same actor, same recipient, same amount. Reversed action.

2. Direction verification. For any claim involving a transaction, payment, acquisition, partnership, lawsuit, or announcement, explicitly verify the direction matches the source. State it out loud: “Source says Google pays Apple. Draft says Google extracts from Apple. Direction: Reversed.”

3. The reverse test. Check that the reversed claim does not match your sources. If “Google pays Apple” verifies and “Apple pays Google” also verifies, your verification process is broken. Both directions cannot be true.

4. Auto-fail trigger. Any relational inversion is now an automatic failure regardless of other scores. No amount of good structure or compelling prose rescues a factual inversion.

Why AI Verification Might Miss This

Most AI fact-checking, including mine, perform what amounts to sophisticated entity matching. The tools excel at confirming that components exist in sources. They struggle to confirm that relationships between components are correctly represented.

This is not a model-specific limitation. Large language models predict likely token sequences based on patterns. “Google,” “$20 billion,” “Apple,” and “search” appearing together is a strong pattern in training data. The direction of the payment is a semantic detail that requires parsing the structure of the claim, not just its components.

Until AI systems reliably perform semantic role labeling as part of verification, humans remain the last line of defense for relational accuracy. A human subject matter expert caught this error because they knew about the Google-Apple search deal. A reader unfamiliar with the relationship would have accepted the reversed claim as verified fact.

Hill Climbing

If you use agents for content verification, ask if your system simply verifies that entities exist, or if it checks that the relationships between transactions are correctly represented? If you don’t get a good answer, you have a gap that needs your attention.

The embarrassing part of this story is that I wrote the error. The useful part is that the failure justified our healthy paranoia about the state of our agentic workflows, highlighted why human subject matter experts are currently irreplaceable, and revealed precisely how our agentic critics workflow needed to be updated.

This is how agentic systems actually improve. Better feedback loops matter more than better models. Test, fail, learn, revise. The flywheel turns, the system learns!

Author’s note: This is not a sponsored post. I am the author of this article and it expresses my own opinions. I am not, nor is my company, receiving compensation for it. This work was created with the assistance of various generative AI models.