Pattern · 2026-04-26
Picking the model tier for a specific job
A staging job on our side that should have cost a few cents once ran for three hours on a top-tier reasoning model, churning through context, before we noticed the bill. The job was: take a CSV of inbound leads, group them by stated industry, and flag duplicates. It needed string normalization and a fuzzy match. It did not need the model that costs ten times more per token to "think" about whether two rows describe the same company. We had let the default tier drift upward because a different job had benefited from the upgrade. Once a tier wins on one task, the temptation is to use it for everything. That's the mistake.
We keep a per-job tier sheet across our retainer clients. The headline: roughly four out of five jobs we run end up on the cheap tier, and stay there for the life of the project. The other one in five — the jobs where paying more genuinely pays back — share a small set of properties that we can name. Once you can name them, the routing question stops being intuition and starts being a checklist.
Why "always use the smartest model" is the expensive answer
The default modern advice for someone starting out is reasonable: pick the strongest model in the lineup and don't worry about the cost while you're learning. Cost discipline is a later problem. We agree with this for a single user with a hobby project. We disagree with it the moment a workflow runs on a schedule, against a queue, on behalf of someone else.
The shape of the bill changes. A hobby project might call the model fifty times in a week. A studio agent firing every hour for a month, against a backlog of tasks, with each task pulling in a stack of tools and a long context window, will call the model thousands of times. At that volume, a 10x price difference between tiers is not a rounding error. It's the difference between an automation that runs at break-even and an automation that loses money.
The trap is that the smartest tier is genuinely better at the hardest version of any task. If you only ever benchmark on the hardest version, the smartest tier always wins. The discipline is to ask: how often does this job actually look like the hardest version? For the dedupe-and-tag CSV job, the answer was: never. Even the messiest version of the job — leads with foreign accents, mojibake from copy-paste, multiple corporate aliases — was a fuzzy-match problem the cheap tier solved correctly the first time once we wrote the prompt with one or two examples in it.
Public release notes from a major commercial provider this year frame the newest top tier as delivering its biggest gains on agentic, multi-step work specifically — not on the routine tasks below it [1]. That is real progress, and it is genuinely useful for the small set of jobs that need it. But "stronger on the hard task" is not the same claim as "cheaper than the next-tier-down for an easy task." The cheap tier still wins on the easy task. The headline gain on agentic work doesn't change the routing logic — it sharpens it.
What the cheap tier actually handles
In our own setup, the cheap tier — whichever vendor's small, fast model we're using for a given client — handles the following without complaint, and we don't second-guess it:
- Structured extraction from a known schema. Pull these six fields out of an email, return JSON. The job is bounded and the model has seen ten thousand emails like it. A bigger model does not produce a more correct address.
- Classification into a small label set. Sentiment-like buckets, "is this support ticket about billing or about access," "is this lead in the studio's geography." The cheap tier with a few examples is at parity with a top tier on these, and it answers in a fraction of a second.
- Summarization of a single document under a few thousand words. A meeting transcript, an email thread, a contract excerpt. The summary the smaller tier produces is not noticeably worse for a human reader. It is sometimes better, because it has less room to wander into commentary.
- First-pass formatting. Convert this paragraph to bullet points. Convert these bullet points to a paragraph. Render this list as a markdown table. There is no thinking required.
- Yes/no decisions with explicit rules in the prompt. "Given this expense and these rules, is this reimbursable?" The cheap tier reads the rule, applies it, returns. The smarter tier reads the same rule and produces three paragraphs explaining its reasoning, which we then have to throw away.
What's striking when we look at our retainer logs is how many of the daily, recurring agent steps fall into one of these five buckets. We've found across our own runs that, project by project, more than 80% of the invocations the agent makes per week land here. Trying to route that volume through a top tier on principle is a way to set money on fire.
The five jobs where paying more pays back
The remaining ~20% are not random. They cluster.
1. Long-context reasoning where the answer depends on connecting two distant points in the input. A 40-page technical document where the question is "does any item in section 9 contradict any item in section 3." The cheap tier loses fidelity over distance. The expensive tier holds the whole thing in working memory and notices the contradiction. We've tested this both ways on every long-document review job we've taken; the cheap tier produces a confident-wrong answer often enough that it's not viable for this category.
2. Multi-step planning where each step changes what the next step should be. Booking, research, debugging — anything where the agent has to decide what to do next based on what it just learned. The cheap tier handles fixed-shape workflows fine. The case where the steps are decided at runtime is where the smarter tier's gains on agentic benchmarks show up in real work [2] — the model has to keep its goal, its progress, and its options all live, and the smaller tier drops one of them under load.
3. Code-heavy work where the cost of a wrong answer is debugging time. A two-line patch the smaller tier gets wrong is twenty minutes of human time. A two-line patch the smarter tier gets right is one minute of review. The math here is not "tokens cost X cents." It's "the human who reviews the output costs Y dollars per minute, and Y dwarfs X." This is the case where we route up most reflexively.
4. Adversarial or ambiguous input where the model has to push back. "Help me phrase this so my client doesn't notice the deadline slipped." The cheap tier will help. The smarter tier will more often catch the framing and offer the harder, more honest version. We've found this difference matters most on copy that goes out under our name.
5. The first pass on a job we've never seen before. Before we know the shape of the work, we use the strongest tier we can to find out where the difficulty actually lives. Once we know — once we've drafted a prompt and seen which subtask is hard and which is mechanical — we drop the mechanical parts to the cheap tier and keep the smart tier only on the hard part. The smart tier's job, in our setup, is often not to run the production workflow. It's to write the production workflow that the cheap tier then runs.
How we actually test, instead of guessing
The temptation when picking a tier is to ask the smart tier to compare itself against the cheap tier. It will tell you it's better. It is, on average, biased toward this answer.
What we do instead: take twenty real examples of the job, run both tiers on all twenty, score the outputs blind. Score for the metric that actually matters — not "is this answer plausible" but "would I ship this." If the cheap tier ships fifteen of twenty and the smart tier ships nineteen of twenty, the question is whether the four extra ships are worth the cost delta. In our experience the answer is often no, and we keep the cheap tier with a human-review flag on the outputs that score low on a confidence proxy.
We've found that the eval matters more than the model choice it informs. When a new tier ships from the same vendor, we re-run the same twenty examples against it before we change any defaults. A new tier doesn't earn the upgrade until it earns it on the actual work. The eval is the asset; the routing decision is just what falls out of it.
What the routing sheet looks like in practice
We don't share the sheet. The sheet is part of the deliverable. But the structure is mundane: a list of agent steps in the project, each one tagged with its tier, the cost-per-run estimate, and a one-line note about what's special about that step. When a client asks "why is the agent suddenly slower this week," we open the sheet and look at the column where a step migrated from cheap to expensive — usually because the input got longer or the rules got more conditional — and decide whether to revert, rewrite the prompt to keep the work small, or accept the new cost.
The discipline is not the sheet. The discipline is the habit of treating tier as a per-step decision rather than a project-wide default. If we route the whole project at the top tier "to be safe," we are not running an automation. We are running an expensive concierge.
There's a second move that follows from this, which is not obvious until the first time it bites you: a prompt that worked on the small model often gets worse when you hand it, unchanged, to the bigger one. We've seen public migration guidance from at least one major provider explicitly tell teams to treat a new top-tier model as a new family to tune for rather than a drop-in upgrade, and to start from the smallest prompt that preserves the task contract rather than carrying old instructions forward [2]. That instruction surprised us when we first read it. It shouldn't have. The cost of routing up is rarely just the bill.
References
- openai.com, https://openai.com/index/introducing-gpt-5-5. Accessed 2026-04-26.
- simonwillison.net, https://simonwillison.net/. Accessed 2026-04-26.