Token Cost Engineering
The hidden economics of running AI agents at scale
The Midnight Billing Anomaly
Your product manager is thrilled. The new AI agent automated customer data extraction flawlessly in staging. The latency was acceptable, the accuracy was impressive, and the executive team loved the demo. Two weeks into the production rollout, the VP of Engineering receives a PagerDuty alert. It is not an outage. It is a cloud billing anomaly. The agent is suddenly costing significantly more per transaction than the human operations team it was designed to augment.
Faced with this spike, a junior engineer immediately assumes a sudden surge in user traffic or a potential distributed denial-of-service attack. They start checking network logs and request metrics. A senior engineer knows better. They skip the network traffic and look directly at the agentic reasoning logs.
They know the culprit is an unchecked ReAct (Reasoning and Acting) loop. The agent encountered a poorly formatted document, failed to parse it, appended the failure message to its context window, and tried again. By the fifteenth iteration, the agent was sending 80,000 tokens of accumulated conversation history just to generate a 500-token apology. The system did not crash, but the economics did.
Prompt Golfing is a Distraction
When faced with ballooning AI costs, the instinctive reaction across the industry is to rewrite prompts. Engineering teams will spend days participating in “prompt golfing”, aggressively trimming system instructions, removing helpful adjectives, or deleting few-shot examples to shave off a few hundred tokens. They do this because of the organisational incentives: editing a text file is fast, highly visible, and requires zero architectural approval from management.
This is entirely the wrong battlefield.
The reality is that your prompt length is rarely the problem. The true cost driver in modern AI architectures is the unchecked accumulation of context in autonomous loops combined with the massive premium placed on output tokens. Across major providers, output tokens carry a heavy price multiplier compared to input tokens. For example, according to Anthropic’s public pricing parameters as of this writing, Claude 3.5 Sonnet charges five times more for generated output tokens than for ingested input tokens. While specific vendor pricing will inevitably fluctuate over time, this heavy output multiplier remains a structural constant across the industry. This disparity exists because processing inputs (the prefill phase) parallelizes efficiently across GPUs. Generating outputs (the decode phase) is bound by memory bandwidth and must occur sequentially, one token at a time.
When an agent enters a planning loop, it naturally accumulates state. It holds onto system prompts, tool definitions, conversation history, and retrieved document chunks. A ten-cycle ReAct loop does not just consume ten times the tokens of a single pass. Because the context grows at every single turn, the payload inflates continuously. Consider a baseline calculation: if your system prompt and initial task consume 1,000 tokens, step one passes 1,000 tokens to the API. Step two passes the initial 1,000, plus the 200-token reasoning output from step one, plus a 300-token tool result, totalling 1,500 tokens.
By step ten, the agent is reprocessing thousands of tokens of accumulated history from the previous nine steps just to decide what to do next. You are not paying for your system prompt once. You are paying for it repeatedly, while the model generates premium-priced reasoning traces at every step. Being overly frugal with the initial system prompt is like worrying about the cost of the steering wheel on a sports car while leaving the engine running overnight.
The Orchestration-Execution Architecture
To build sustainable token economics, you must separate the intelligence from the manual labour. Look at how large-scale deployments manage this. In a documented third-party case study published by the engineering team at the AI development framework Morph, researchers revealed a staggering difference in operating costs. A naive implementation running 100 multi-turn calls through a flagship model like Claude 3 Opus cost $6.00 per session. By separating orchestration from execution and implementing intelligent routing to direct easier tasks to a lighter model like Claude 3 Haiku, they reduced the cost to $1.26 per session. That is an almost 80% reduction without sacrificing output quality.
Instead of relying on a single frontier model for every operation, consider splitting the architecture into two distinct layers.
First, you build the Planner. You use a flagship reasoning model strictly for orchestration. It reads the short task description, plans the workflow, and decides which internal tools to call. Because its inputs are highly structured and its outputs are limited to brief routing instructions, its token consumption remains strictly bounded.
Second, you deploy the Executor. You hand off the actual heavy lifting to a cheaper, faster model. The executor performs the repetitive tasks: summarising massive retrieved documents, extracting specific JSON fields, and converting data formats.
Organisational friction almost always appears here. Product stakeholders will push back against this separation. They will demand the use of the smartest possible model for every single step to guarantee maximum quality, viewing cheaper models as a risk to the user experience.
If your enterprise contract strictly locks you into a single vendor’s flagship model and you lack the mandate to build a routing layer, you cannot deploy a separate Executor. In these highly restricted environments, your only survival tactic is strict context bounding. You must implement a sliding window that forcefully truncates the conversation history passed back to the model, keeping only the most recent tool outputs. Combine this with hard iteration caps to stop the compounding token accumulation before it triggers a billing alert.
Your job as an engineering leader is to reframe this conversation. When you do have the freedom to build a routing layer, using a fast, sub-cent model for execution actually improves system reliability. The massive cost savings generated by the Executor layer allow you to afford multiple automated validation and self-correction cycles. You can afford to have a cheap model double-check its own work, an architectural luxury you could never justify if you were running the flagship model for every single API call.
Negotiating Unit Economics with Product
Building a routing layer requires engineering time and introduces architectural complexity. When a stakeholder insists on shipping the single-model prototype to production because they believe it already works, you need a precise way to communicate the financial liability. You cannot simply complain about the cloud bill. You have to translate token usage into business margins.
Consider using this exact script in your next architecture review:
“If we use the flagship model for the entire workflow, our P95 cost per task will be [X], which scales to [Y] monthly once we hit our user targets. This completely erodes the profit margin of the feature. Instead, we need to introduce a routing layer. We will use the premium model exclusively for the complex planning steps, and route the high-volume reading and formatting steps to a lightweight model. This gives us the exact reasoning power we need while keeping our unit economics viable. If there are concerns about quality degradation, we can take a fraction of the cost savings and use it to run automated evaluation checks on the final output.”
This framing completely disarms the typical quality-over-cost argument. It acknowledges the need for high-level reasoning while demonstrating that using premium models for basic data extraction is an inefficient allocation of computing resources.
Implementing Hard Agent Budgets
Unmonitored token spend behaves like a silent financial leak in your architecture. If you rely entirely on the monthly cloud invoice to tell you how your agents are behaving in production, you are discovering architectural flaws weeks too late.
This week, select one agentic workflow or generative AI pipeline currently running in your codebase. Open your observability dashboard and spend 60 minutes examining your telemetry logs to calculate the P95 token spend per task.
Focus specifically on identifying sessions that stayed under your timeout limits but generated massive, expensive outputs due to internal retry loops. If your calculated P95 token spend exceeds your team’s acceptable baseline, do not attempt to refactor the entire architecture immediately. Instead, file a single ticket to implement a strict maximum-iterations cap before next sprint planning.


