Chapter 6 of 8

Track 4: Progressive Automation

Moving workflows up the autonomy stack — augmentation first, autonomy where it has been earned. The wrong order produces confident wrongness at industrial scale.

The pattern that keeps repeating in failed AI initiatives is the same one. A team builds an automation around a workflow that wasn’t well-encoded, against a knowledge base that wasn’t well-governed, and ships it broadly. The automation is convincing — confident outputs, plausible reasoning, pleasant UX. Then it is wrong about something important, and the team discovers there was no review process to catch it, no eval to surface it, and no ownership to fix it. Trust collapses. The automation gets quietly rolled back. The team concludes “AI didn’t work for us.”

What didn’t work was the sequence. Automation came before the foundations that automation requires. This chapter is about getting the sequence right.

The autonomy stack

There is a useful spectrum that runs from “AI suggests, human does everything” to “AI does, human approves nothing.” Most of the value lives in the middle of that spectrum. Most of the failures live at the right end.

Level A — Suggestion. The agent proposes; the human accepts, modifies, or rejects every output. Coding assistants in their default mode. Drafting tools. Most chat-based AI today. Low risk, modest leverage.

Level B — Augmentation. The agent does the routine work; the human reviews and finalizes. The reviewer is doing meaningful work, but the bulk of the production is automated. Code review with AI pre-pass. Spec drafting from a structured brief. Customer-issue triage with AI-proposed routing. Higher leverage, manageable risk.

Level C — Supervised autonomy. The agent operates a workflow end-to-end. A human reviews exceptions or samples, but does not see most outputs. The leverage is real and the risk concentrates: the things humans don’t see can compound into systemic problems before anyone notices.

Level D — Autonomous. The agent operates without per-output review. Humans review only at boundaries — escalations, anomalies, periodic audits. Highest leverage, highest risk. Appropriate only for narrow workflows with strong evals, well-understood failure modes, and reversible outputs.

The discipline is to advance any single workflow up this stack one level at a time, only after the prior level has produced enough evidence to justify the move. Skipping levels is what produces the failure pattern.

The augmentation-first principle

The default for any new AI workflow should be Level B — augmentation. The agent does the work, the human reviews every output. This sounds slow. It is slow, by design.

Two things happen during augmentation that don’t happen at higher autonomy levels:

The reviewer corrects the agent. Every correction is data — about what the agent gets wrong, about what the encoding missed, about where the knowledge base is thin or stale. Aggregated across reviewers, these corrections are the highest-quality input you’ll ever get for improving the system.

The reviewer learns the agent’s failure modes. This sounds soft, but it’s load-bearing. Before you give an agent more autonomy, the people who will be living with its outputs need to have an honest mental model of where it fails. That model is built by reviewing outputs, not by reading docs about the agent.

Until both of these have happened consistently for a workflow — corrections being captured and folded back, failure modes being internalized by the team — the workflow is not ready for Level C. Pushing it there anyway is the failure pattern.

The promotion criteria

A workflow earns its way up the stack. The criteria are not subjective; you can write them down before you build, and you should.

To promote a workflow from Level B to Level C, the criteria look like:

The encoded process has been stable, with no major edits, for at least 30 days.
The eval suite covers the known failure modes and is run before any change.
The error rate on a representative sample is below a defined threshold.
The reviewer’s most recent set of corrections has been folded into the encoding.
A defined exception path exists — there is a specific way the agent escalates when it is uncertain, and there is a specific person who handles escalations.
A rollback path exists — if the workflow starts misbehaving, there is a documented one-step way to revert it to Level B.

To promote from Level C to Level D, add:

A monitoring layer reports on workflow output quality continuously, not on demand.
An audit cadence is in place. Someone reviews a sample of recent outputs at a defined interval.
The cost of any individual mistake is bounded — outputs are reversible, or errors are caught at a downstream checkpoint, or the consequences of a wrong output are recoverable.

If any of these aren’t true, the workflow stays at the current level. The right answer is not to lower the criteria. The right answer is to keep the workflow where it is and invest in the missing piece.

Where to start

Pick a workflow that already has an encoded process from Track 1 and is reading from a compiled topic page from Track 2. Most teams are starting from Level A on this workflow — the agent is helping individual operators, but each operator is reviewing every output and re-doing the parts they don’t trust.

Move that workflow to Level B. Specifically:

The agent produces the full first draft using the encoded process and the compiled context.
The reviewer’s job is to review and finalize, not to redo. The reviewer’s standard is “is this acceptable to ship as-is or with minor edits?”, not “is this how I would have done it?”
Capture every meaningful correction the reviewer makes. After two or three weeks, the corrections should be folded back into the encoding.

Run that for thirty to sixty days. At the end of that window, you’ll know whether the workflow is ready for Level C. The signals are clear: the rate of meaningful reviewer corrections drops, the reviewer’s confidence in the agent’s output rises, and the team starts to feel the review step as friction rather than insurance.

When that happens, you are ready to consider Level C. Not before.

The two automation failure modes

The first is over-automation — pushing workflows up the autonomy stack faster than they have earned. This is usually driven by leadership pressure to “show ROI on AI” and produces the failure pattern at the top of this chapter. The fix is the promotion criteria above. They are explicit, they are public, and they are the answer to “why aren’t we further along?” — because we haven’t met the criteria yet, and shortcutting them costs more than waiting.

The second is automation paralysis — workflows that sit at Level A or B forever, even after they’ve earned promotion. This is usually driven by a single bad early experience, or by a culture that overweights the cost of automated mistakes and underweights the cost of unautomated drudgery. The fix is structural: every workflow at Level B should have a quarterly check on whether it has met the promotion criteria, and if it has, the default action is to promote.

What success looks like

Six months in, if Track 4 is going well:

At least one workflow is operating at Level C, with documented promotion evidence.
Multiple workflows are operating at Level B, with active correction loops feeding back into Tracks 1 and 2.
The team has a shared, honest model of where the agents fail, not just where they succeed.
Decisions about where to push next are driven by the promotion criteria, not by enthusiasm or anxiety.

Track 4 is the track where leverage starts to compound visibly. Done right, it is also the track that creates the most defensible competitive position — which is what Chapter 7 is about.

Newsletter

Follow the Playbook by email

Subscribe and get new chapters and follow-up essays in your inbox. Roughly monthly. No filler.

One-click unsubscribe. I never share your email.