AI Operations

AI Enrichment That Does Not Hallucinate: Typed Failure Contracts and Deterministic Scoring

By Alejandro Neckles · June 2026

Ask a language model to find the website for a company called Hastings Custom Millwork, and there is a decent chance it returns hastingscustommillwork.com whether or not that domain has ever existed. The answer looks right. It is formatted right. It is the kind of answer a junior researcher would hand you. And it is invented.

That failure mode is tolerable when a person is checking each result. It is fatal when the job is enriching a CRM: thousands of company records, each needing a website, a category, a location, a size estimate. At that scale nobody is checking each result, which is the entire reason you reached for a model in the first place. So the question that decides whether AI enrichment is an asset or a contamination event is not "how good is the model". It is: what happens, structurally, when the model is wrong?

Most pipelines have an honest answer of "nothing". The wrong value lands in the CRM with the same confidence as a right one, and your sales team spends the next year discovering the difference one awkward phone call at a time.

Why "prompt it better" does not work

The first instinct is to tighten the prompt: tell the model not to guess, tell it to say "unknown" when unsure, tell it to double-check. This helps at the margin and fails as a guarantee, because the model's incentives never change. It was trained to produce plausible completions, and a plausible-looking domain is a perfectly good completion. Instructions reduce the rate of invention. They do not give you a way to detect the inventions that remain.

The second instinct is worse: ask the model to rate its own confidence. Self-reported confidence from a language model is another generated output, subject to exactly the failure you are trying to catch. A model that fabricates a website will cheerfully fabricate an 8 out of 10 to go with it. Letting the model grade its own work converts one unverified claim into two.

Typed failure contracts

The structural fix starts by changing what the model is allowed to say. In a typed pipeline, the model does not return prose. It returns a claim plus the evidence for that claim: the value, the source it came from, and the text in that source that supports it. Then code, not the model, checks the contract.

Every check that can fail has a name. If the model cites a URL, the pipeline fetches it; a URL that does not resolve is rejected as FABRICATED_URL. If the model claims a value came from a page, the pipeline looks for that value in the page text it actually retrieved; a value that is not there is rejected as VALUE_NOT_IN_SOURCE_TEXT. The failure taxonomy grows the same way a test suite grows: every new way the model finds to be wrong becomes a named, machine-checkable rejection.

Two properties make this different from hoping. First, failures are data. You can count them, sort by them, and watch a model update change the rejection mix. "The model hallucinates sometimes" becomes "FABRICATED_URL is 3% of attempts this week, here are the records". Second, rejection is the default. A claim that does not survive its checks never reaches the CRM. The cost of a failed enrichment is an empty field, and an empty field is honest. It tells a salesperson "we do not know" instead of lying to them in a confident font.

Confidence is computed, never reported

Records that pass their checks still are not all equal. A value confirmed by the company's own site and a directory listing is stronger than a value found once on a page that mentions three businesses. So the pipeline scores each enrichment, and the scoring is deterministic: a function, written by people, over observable signals. Did the cited source resolve? Did the value appear verbatim in the source text? Do independent sources agree? Does the claimed category exist in the controlled taxonomy?

The model contributes claims. Code computes the score. Run the same evidence through twice and you get the same number, which means thresholds mean something: above this score, auto-accept; below that one, auto-reject; in between, queue for a person. When the threshold is wrong you tune a function, not a prompt, and the change is reviewable in version control like any other change to a production system.

Verification you can audit, mistakes you can undo

Even a disciplined pipeline earns trust through sampling, not assertion. The pattern we use: draw a seeded random sample of enriched records (seeded so the sample itself is reproducible and nobody can cherry-pick), have a person verify each one against the live world, and treat the sample error rate as the pipeline's error rate until a bigger sample says otherwise.

And because every write to the system of record is a batch with an identity, every batch ships with its rollback before it runs. When sampling does surface a systematic problem, the response is an undo command, not a quarter of manual cleanup. Enrichment that cannot be reversed is not automation. It is risk with good throughput.

The honest costs

This approach rejects work a looser pipeline would accept, and some of those rejections are false alarms: real values discarded because their evidence was thin. You will enrich fewer records than the demo promised, and the gap is the price of every accepted record meaning something.

It is also engineering, not configuration. Fetching sources, matching values against page text, maintaining a failure taxonomy, building the review queue: budget weeks, not an afternoon with an API key. For a one-time enrichment of a few hundred records, a person with a spreadsheet may genuinely be cheaper. The structure pays for itself when the volume is in the thousands and the output lands in a system your team has to trust daily.

Run in production, on our own data

This is not a proposal. It is how the migration of Sprout IQ, our own business, was done: more than 8,000 company records enriched and scope-filtered under exactly this discipline, with typed failure checks on every machine-made claim, deterministic scoring, seeded spot checks, and staged rollbacks. The CRM that came out the other side is one our own sales motion runs on every day, which is a sharper incentive for honesty than any service agreement.

If a vendor offers you AI enrichment, the test fits in one question: when the model is wrong, show me where the wrongness goes. If the answer is not a named failure, in a log, attached to a record, with a way to undo the batch, then the answer is "into your CRM".

Enrichment you cannot audit is contamination on a schedule.

Neckles IO builds AI data pipelines with verification, scoring, and rollback designed in from the first record.

Inquire about your data