Blog · AI · May 12, 2026

AI integration readiness: an honest checklist before you ship

Most "we need AI" requests we hear are really retrieval, data quality, or workflow problems wearing an AI costume. Before recommending a model — let alone a fine-tune — we run a short readiness check. Here it is.

1. Can a human do it consistently?

If a knowledgeable human reads the same input twice and produces two different outputs, an LLM will too. AI is good at scaling judgement that's already well-defined, not at inventing it.

What to do: have two domain experts independently label 50 inputs. Measure their agreement. Under 70% inter-rater agreement and your AI feature will look broken regardless of the model.

2. Is the data actually accessible?

"We have all the data" usually means it lives in five SaaS tools, three databases, and someone's inbox. RAG over scattered, stale, or permissioned data is the most common reason AI features ship buggy.

What to do: write the ingestion plan before the prompt. Where does each source live, how often does it change, who owns refresh, and what's the access boundary per user?

3. Do you have an honest eval set?

If you can't measure quality, you can't ship. "It looked good in the demo" is not an eval. The team that ships the most reliable AI features is the team that writes the most adversarial evals.

What to do: start with 50–100 hand-labelled cases including the boring 80%, the painful 15% (ambiguous, partial info), and the dangerous 5% (off-topic, hostile, PII).

4. What's the worst-case output worth?

A copilot suggestion that's 80% right is great in a writing app and catastrophic in a payments flow. Calibrate the model and the guardrails to the cost of being wrong.

What to do: map each AI surface to a tier — drafting (high tolerance), assist (medium), automation (low). Tier dictates eval thresholds, human review, and rollout pacing.

5. Is the UX honest about uncertainty?

Users trust AI based on the UI, not the model card. Outputs presented as "answers" carry more weight than the same outputs labelled "suggestion (may be wrong)". Most AI feature failures are UX failures.

What to do: design the uncertainty affordances first — confidence ranges, citations, "I don't know" paths, edit/undo. Then plug the model in.

6. Who owns the prod model behaviour?

Once shipped, AI behaviour drifts — model updates, data updates, prompt updates, retrieval index changes. Someone needs to own the eval pipeline, the regression suite, and the cost dashboard.

What to do: name an owner per AI surface. Wire offline evals into CI so prompt PRs can't ship without an eval delta.

7. What's the boring baseline?

Before reaching for an LLM, write down the dumbest version of the feature: a keyword match, a heuristic, a lookup table. Measure it. Often the boring baseline gets you 70% of the value at 1% of the cost.

If you can answer all seven of these clearly, you're ready to build. If three or more are fuzzy, you're not ready for AI — you're ready for two weeks of discovery work, which is cheaper and usually more valuable.

We do this kind of readiness review as a fixed-scope AI engagement. Two weeks, written output, honest go/no-go.