Service / Production AI Review
Production AI Review
For live AI features that have stopped improving. Two to three weeks, fixed-price. A structured audit of the eval system, the iteration loop, and the metrics you're trusting — followed by a written plan to get the team back to confident shipping.
Schedule a ConversationWhen the iteration loop breaks
The AI shipped. Now nobody's sure if changes help
The pattern is consistent. Six to eighteen months ago the team shipped an AI feature. It worked. Users adopted it. The first few iterations improved it noticeably. Then progress slowed. Now changes ship and nobody's sure if they helped, hurt, or did nothing — because the instrument the team is using to measure quality isn't really measuring quality.
The instinct at this point is usually to add more model power, switch vendors, or hire someone senior. Sometimes that's right. Usually what's actually broken is upstream of the model: the eval system, the error-analysis discipline, the iteration loop. Fix those, and the existing model gets noticeably better the next week.
Symptoms we'll probably recognize together
- Quality has plateaued. Users notice. The team can't pinpoint why.
- Engineers are afraid to ship changes because the metrics don't tell them whether it got better or worse.
- Manual QA is the only safety net, and it doesn't scale with the team or the surface area.
- Bug reports describe failure modes nobody on the team can reproduce reliably.
- Every iteration cycle takes weeks because nobody trusts the experimentation pipeline.
- Leadership wants "better AI" but can't tell the team what better means in measurable terms.
What changes by the end
From "we ship and hope" to a measurable iteration loop
Before
Engineers afraid to ship because metrics don't tell the truth, manual QA bottlenecking every change, and leadership asking for "better AI" without a definition the team can act on.
After
A 30/60/90 plan for the eval system you should have, a ranked unblock list your engineers can start on Monday, and a shared definition of what "better" actually means for this product.
What you receive
Documents your team will actually use
Two to three weeks. Fixed-price. Output is structured working artifacts, not slides.
Written diagnosis
A 10–15 page document with my read on where your iteration loop is broken, prioritized by impact. Specific findings, with evidence, organized so your team can argue with them productively.
Eval roadmap
A concrete plan for the eval system you should have in 30, 60, and 90 days. What to build first, what to defer, what tooling to use, what the team needs to learn.
Iteration unblock list
Ranked list of the specific changes — process, infra, ownership — that would get the team back to confident shipping. Each item is scoped tightly enough that an engineer can pick it up and start.
Working session with your team
Closing session where I walk through the diagnosis with your product, engineering, and leadership stakeholders. The goal is shared ownership of the plan, not just delivery of a document.
What we audit
Six places production AI plateaus
Most engagements find serious gaps in three or four. The diagnosis names which ones, with evidence, and ranks them by impact.
The eval system (or absence of one)
What does "good" look like for this product, in numbers? What metrics actually correlate with user-facing quality? Where are the eval gaps that hide regressions? I look at what's in place, what's missing, and what's measuring the wrong thing.
Error analysis discipline
Is the team systematically reviewing failure cases, or chasing whichever bug surfaced loudest this week? Real iteration starts with structured error analysis. I look at how (and whether) failures are being categorized, prioritized, and turned into eval cases.
The iteration loop
From idea → change → measurement → ship, how long does one cycle take? Where does it stall? Most plateaued AI products have a broken loop somewhere — a slow eval run, a missing metric, a deployment process that discourages experimentation. I find where the loop is broken.
Metrics you can actually trust
Vanity metrics, lagging metrics, and metrics that move for the wrong reasons are everywhere in production AI. I audit what's being measured against what's actually predictive of user value, and recommend what to add, change, or stop tracking.
Architecture and infrastructure that fights you
Sometimes the iteration problem is architectural — a prompt that can't be A/B-tested, a RAG layer that can't be inspected, a model swap that requires a deploy. I flag the infra choices that are silently slowing your team down.
Team and ownership patterns
Who owns the eval suite? Who owns model decisions? Who decides when something ships? Plateaued AI products almost always have ownership ambiguity that's invisible until you map it. I map it.
Not a fit
When to skip the Review
- Your AI feature isn't live yet — start with a Strategy Sprint to commit to the right bet first.
- Your team has no production telemetry, no logs, no sample inputs to work from. The Review needs raw material; without it, the diagnosis is guesswork.
- You want me to build the eval system for you. The Review produces the roadmap; implementation goes to your team or a partner.
- You're hoping a more powerful model will fix the plateau. Sometimes it will — but if the iteration loop is broken, a better model just plateaus higher.
Common questions
What teams ask before a Production Review
-
How long does the engagement take?
Two to three weeks end-to-end. Week one is access, audit, and structured interviews. Week two is analysis and the eval roadmap. The optional third week is the closing working session and any iteration on the diagnosis based on what surfaces in that session. -
What does the engagement cost?
Fixed-price, scoped on the discovery call. The price depends on the surface area of the AI product, the number of stakeholders, and whether the engagement extends across multiple model or product surfaces. Sized to fit between a Decision Review and a full Strategy Sprint. -
What access do you need to my system?
Read access to the codebase, eval suite (if one exists), and any logging or observability for the AI features. Sample inputs and outputs for the failure modes you care about. Time with two or three engineers and a product owner. I don't need production write access. -
What if we don't have an eval system at all?
That's the most common starting point. Roughly half the teams I work with at this stage have ad-hoc spot-checking and call it evaluation. The Production AI Review is structured to handle that case — the diagnosis identifies what evals you should have, and the roadmap gets you there in 30/60/90. -
Will you build the eval system for us after?
Sometimes — if it's a clean fit and the work is in scope. More often I hand the roadmap to your team or to an implementation partner. Most teams have the engineering capacity; what they're missing is the structure and the discipline, which is what the Review provides. -
How is this different from a Strategy Sprint?
A Strategy Sprint asks "should we bet on this AI initiative and if so, how?". A Production AI Review assumes you've already bet, the AI is live, and asks "why isn't it getting better and how do we fix the loop?". Different question, different deliverable. -
Do you sign NDAs?
Yes — happy to sign yours, or use a simple mutual NDA if you don't have a standard.
Start the conversation
Tell me where you're stuck
A few sentences on the AI feature, how long it's been live, and what the symptoms look like. I'll respond personally within a day or two.