Doesn't switching between models add overhead to the workflow?

Less than you'd think. If you're using a PRD.json, the switch is mechanical: the orchestrator works through its stories, then the delegate picks up the next incomplete story tagged with its model. There's no context to transfer — the PRD.json is the context. The real overhead is in writing acceptance criteria tight enough for delegation, but that's work you should be doing anyway.

What if the delegate makes a mistake?

The same thing that happens when any agent makes a mistake: you catch it in the acceptance criteria or in code review. The PRD.json's binary checks are your safety net. If the criteria pass, the implementation is correct regardless of which model wrote it. If they don't pass, you iterate — or escalate the story back to the orchestrator if the issue is architectural.

Can I use more than two models?

You can, but I wouldn't start there. Two tiers — one for reasoning, one for execution — cover the vast majority of cases. Adding a third tier (say, a mid-range model for moderately complex work) increases coordination overhead without proportional benefit. Get the two-tier split working first.

How is this different from just using a cheaper model for everything?

Try it and you'll see. A faster model building a checkout flow with idempotency key management, sold-out handling, and Stripe error recovery will produce something that compiles but is structurally fragile. The bugs won't be in the syntax — they'll be in the flow: race conditions, missing edge cases, state that isn't cleaned up on failure. Those bugs are expensive to find and expensive to fix. The orchestrator catches them upfront because it reasons about the system, not just the code.

Does this only work with Claude models?

No. The pattern is model-agnostic. The orchestrator/delegate split works with any combination of models from any provider — what matters is the capability gap between your reasoning tier and your execution tier. The PRD.json doesn't care which model reads it.

You mentioned future articles on token optimization. What's coming?

Model delegation is one lever for managing costs, but it's not the only one. Context window management, prompt compression, caching strategies, and acceptance criteria design all affect how many tokens a session consumes. Those topics deserve their own treatment with concrete measurements, so I'm keeping them separate rather than guessing at numbers that will be outdated by the time you read this.

Stop Paying Senior Rates for Junior Work: Model Delegation in AI Coding

How to split architectural decisions and routine implementation across different AI models — and why your PRD.json makes it possible

If you’ve adopted a structured PRD.json to guide your AI coding sessions — atomic user stories, binary acceptance criteria, sequential priorities — you’ve already solved the biggest problem: agent drift. The agent knows what to do, in what order, and when to stop.

But there’s a second problem hiding in plain sight: you’re probably using the same model for everything.

That means your most capable (and most expensive) model is writing login forms, styling buttons, and wiring up list views — work that doesn’t require architectural reasoning, just clean execution against a clear spec. It’s the equivalent of paying a senior engineer to write boilerplate. They’ll do it well, but it’s a waste of what makes them valuable.

The fix is model delegation: use your strongest model for decisions that shape the system, and a faster, cheaper model for everything else.

The orchestrator/delegate split

The pattern is simple. In any coding session, there are two fundamentally different types of work:

Architectural work — decisions that affect the structure of the system. Where state lives. How components communicate. What the dependency graph looks like. How error handling flows across layers. These decisions are hard to reverse and expensive to get wrong. They require the model to hold a mental model of the entire system and reason about trade-offs.

Implementation work — building screens, wiring up controllers to existing APIs, writing UI components against a defined spec. This work is high-volume but low-ambiguity: the inputs and outputs are already decided, the patterns are established, and the task is execution, not design. This distinction mirrors the gap between transmitting procedures and projecting intentional structures — the orchestrator defines possibility space, the delegate operates within it.

In my workflow, the first type goes to the most capable model available — the orchestrator. The second type gets delegated to a faster model — the delegate. The orchestrator doesn’t just do the hard work; it also defines the work that gets delegated, writing the specs that the delegate will execute against.

What this looks like in practice

Here’s a concrete case from a recent project — a mobile e-commerce app where I needed to build the cart and checkout flow.

The work broke down into roughly ten user stories. Some were architectural:

Design the reactive cart state management — where the controller lives, how it’s registered globally so multiple screens can access it, how quantity constraints are enforced
Restructure the dependency injection — a per-screen binding pattern was creating duplicate instances, so the cart repository needed to move to global registration
Design the checkout payment flow — an 11-step sequence involving payment intent creation, Stripe’s PaymentSheet, order confirmation, idempotency key management, and error handling for sold-out items, user cancellations, and double-tap scenarios

And some were pure implementation:

Build the cart screen — empty state with icon and CTA, item list with swipe-to-delete, quantity selectors, pinned bottom bar with totals
Build the checkout screen — order summary card, total section, pickup info, payment button with loading state
Build the success screen — animated checkmark, order details, add-to-calendar integration, navigation guards

The architectural stories required the model to reason about the system as a whole. The idempotency key, for instance, had been incorrectly scoped — each API call was generating a fresh UUID instead of sharing one across the payment intent and order confirmation. Catching that and restructuring it required understanding the payment flow end-to-end. Similarly, moving the cart from per-screen to global DI meant understanding which screens needed access and what the lifecycle implications were.

The implementation stories, by contrast, were fully specified by the time the delegate received them. The cart screen didn’t require any architectural decisions — the controller existed, the repository existed, the data model existed. The task was: build a screen that displays this data, handles these interactions, and looks like this. A capable but faster model handles that cleanly.

How the PRD.json enables this

This split only works if the delegate model receives unambiguous instructions. If you hand a faster model a vague prompt like “build the checkout screen”, you’re back to the drift problem — the model will make architectural decisions it isn’t qualified to make, or invent requirements that don’t exist.

The PRD.json solves this. Each delegated user story arrives with:

A precise description of what to build
Acceptance criteria the model can verify mechanically
An implicit scope boundary — if it’s not in the criteria, it’s not in scope

The orchestrator’s job isn’t just to solve the hard problems. It’s to write user stories detailed enough that the delegate can execute them without architectural judgment. In practice, this means the orchestrator often rewrites or refines the acceptance criteria for implementation stories before the delegate touches them — adding file paths, specifying widget structures, naming the exact methods to call. This pre-delegation refinement serves the same function as AI-assisted intake filtering in bug reporting — transforming ambiguous inputs into structured, actionable specifications before execution begins.

{
  "id": "US-054",
  "title": "Cart screen",
  "implementModel": "sonnet",
  "acceptanceCriteria": [
    "Empty state: shopping bag icon + 'Start exploring' button routing to /home",
    "Item list: Dismissible with swipe-to-delete, CachedNetworkImage thumbnail, quantity selector with +/- clamped to 1-5",
    "Bottom bar: pinned, shows subtotal and total from CartController, checkout button disabled when cart is empty",
    "Clear cart: confirmation dialog before calling CartController.clearCart()",
    "Verify: flutter analyze — no errors"
  ],
  "priority": 54,
  "passes": false
}

No ambiguity. No architectural decisions left open. The delegate reads this story and executes it. If something isn’t specified — say, the exact animation for swipe-to-delete — the delegate makes a reasonable default choice within Flutter’s standard Dismissible behavior. That’s an acceptable margin of autonomy for implementation work. The underlying principle is the same one that governs all LLM integration: the model operates in a sandbox with no access to your intentions unless you explicitly provide them. The PRD.json is your context injection layer — every detail you include is all the delegate will ever know.

The principle: match the model to the cognitive load

The argument for model delegation isn’t primarily about cost savings — although those are real and will compound as your project grows. The argument is about fit.

Capable models are better at architectural work not because they write better code, but because they hold larger context, reason about trade-offs more reliably, and are less likely to make structural decisions that need to be reversed later. Using them for implementation work doesn’t just waste money — it wastes their strength. They’ll still produce good code, but you’re not leveraging the capability you’re paying for.

Conversely, faster models are better delegates not because they’re cheaper, but because they’re less likely to over-engineer. Give a highly capable model a simple screen to build, and it might refactor the navigation system along the way because it “noticed an improvement.” A model optimized for execution tends to do exactly what it’s told — which is precisely what you want for well-specified implementation work. This maps to the same principle that governs data quality in AI systems: the precision of the input determines the quality of the output, regardless of the model’s raw capability.

Token pricing changes constantly and varies across providers, so quoting specific numbers here would be misleading within months. The structural principle is stable: split by cognitive load, not by volume. A small number of expensive, high-reasoning calls for architecture plus a large number of fast, cheap calls for implementation will consistently outperform using one model for everything — regardless of what the per-token rates happen to be this quarter.

When NOT to delegate

Not every implementation story should go to the delegate. Here are the cases where I keep everything on the orchestrator:

When the implementation touches shared state. The checkout controller in the example above has error handling paths that affect the cart, the order repository, and navigation simultaneously. Even though it’s “implementation,” the blast radius of a mistake is architectural. It stays with the orchestrator.

When the acceptance criteria require judgment. If you find yourself writing criteria like “handle edge cases appropriately” or “ensure the UX feels responsive,” the story isn’t specified enough to delegate. Either tighten the criteria until they’re binary, or let the orchestrator handle it. The binary verification discipline — the same one that powers systematic post-commit testing of AI-written code — is what makes this judgment possible.

When the story is the first of its kind. The first screen in a new pattern — the first list view, the first form, the first modal flow — sets the template that all subsequent screens will follow. The orchestrator should build the template. The delegate should replicate it.

A useful heuristic: if the delegate would need to read other files to understand what to do, the story isn’t ready for delegation. A well-specified delegated story is self-contained — everything the model needs is in the acceptance criteria.

Getting started

If you already have a PRD.json, adding model delegation is straightforward:

Add an implementModel field to each user story. Start with just two values: your strongest model and your fastest model.
Review your stories. Any story that involves structural decisions, new patterns, or cross-cutting changes stays with the orchestrator. Everything else is a candidate for delegation.
Tighten the acceptance criteria on delegated stories. The delegate should be able to complete the story without reading any file that isn’t explicitly referenced in the criteria.
Run the orchestrator first. Let it handle its stories and — critically — review and refine the delegated stories before the delegate starts. The orchestrator’s last job in each phase is to make sure the delegate’s specs are airtight.

Start conservative. Delegate only the most obvious implementation stories — screens with no shared state, utility functions, simple CRUD endpoints. As you build confidence in the pattern, you’ll develop an intuition for where the architectural/implementation boundary falls in your specific project. This iterative confidence-building follows the same trajectory as test-driven pipeline optimization: start with the cases you understand well, verify the results, and widen the boundary as evidence accumulates.

This article is a companion to Stop Your AI Agent from Hallucinating: Use a PRD.json, which covers the foundational pattern of structuring requirements as machine-readable JSON. Model delegation is the natural next step — but the PRD.json comes first. Without structured, atomic user stories, there’s nothing to delegate.