I spun up an agent army to automatically find and fix bugs across my codebase. Five different coding agents. Multiple $200/month subscriptions. The goal was simple: let the machines do the grunt work.
I burned through 15% of my weekly API quota in under an hour.
The agents found almost 2,000 potential bugs. Investigated 747. Identified 445 unique root causes. Fixed 70.
Total cost: $750 in API tokens.
Everyone talks about what AI agents can do. Almost nobody talks about what it costs to actually run them when you stop demos and start production testing. I did. And the numbers reveal something most people building "autonomous" systems don't want to admit.
The Capability Wall Isn't Where You Think It Is
The first surprise: agents are more capable than I expected.
Agents are really capable, more capable than you think. But they need proper context management and orchestration.
The system worked. It inspected hundreds of features, filed investigations, triaged bugs, and started fixing them. The problem wasn't that agents couldn't do the work.
The problem was they couldn't tell which work mattered.
Almost 2,000 bugs identified. Then filtered down to 747 investigations. Then 445 unique bugs. Then 70 fixes. That's a 96.5% reduction from initial detection to actual resolution.
They just haven't used the product. They don't know which exceptions actually get hit versus which don't.
Agents can read code. They can spot patterns. They can identify edge cases and potential failures. What they can't do is understand usage. They flag theoretical problems in code paths that never execute in production. They treat a rarely-hit exception in a deprecated feature the same as a critical bug in the checkout flow.
Only about 10% of what agents identify at this scale actually matters.
That's the gap nobody discusses when they demo agent capabilities. The triage-to-fix ratio isn't a minor efficiency problem. It's a fundamental limitation that changes the entire economic model.
The Real Constraint Is Economics, Not Intelligence
Here's what it took to run this experiment:
Five different coding agents: Claude Code, Cursor, Codex, Droid, and Warp. All of them exhausted their usage limits in less than one hour. Not the weekly limits. The 5-hour rolling window limits.
I estimate I can run about 7 cycles per week per account before hitting weekly caps. To run this effectively, I'd need 10 Claude Code and Codex subscriptions at $200/month each.
That's $2,000/month in subscriptions. Just for the accounts. On a small codebase.
And that's actually cheap compared to what production agent systems cost. A single AI agent in production can run $5,000-$50,000+ per month in API fees alone, with token costs representing 70-90% of total spend. Most teams underestimate their agent costs by 3-5x from initial projections.
The $750 experiment was efficient. The problem is it's not scalable without burning significantly more capital.
If the cost would be cheap, it's usually worth just solving more. But the cost isn't cheap. So I have to prioritize.
This is the part that breaks the "agent armies will replace developers" narrative. Capability isn't the bottleneck anymore. It's pure economics and artificial rate limits.
A coding agent fixing a single bug might consume 50,000-200,000 tokens across planning, file reading, code editing, testing, and verification. Multiply that by 747 investigations and you see why quotas evaporate.
The Supervision Tax Nobody Mentions
I can't leave the agents alone for more than 20 minutes.
They get stuck on things like not knowing how to comment on GitHub. Or how to request a review. Basic workflow operations that any junior developer learns in week one.
You can't run it while you're sleeping or doing anything else.
This contradicts every "autonomous agent" pitch you've seen. The reality is constant babysitting. Five tools running simultaneously. Each one capable of hitting a roadblock that stops progress until a human intervenes.
My solution: build a meta-layer of agents to manage the agent army.
An orchestrator agent that posts GitHub issues when sub-agents hit problems. An unblocking agent that tries to resolve common issues automatically. A monitoring agent that checks those GitHub issues and spawns new chat sessions to solve them.
About 10% of my $750 spend went to this orchestration layer. The other 90% was actual bug-finding and fixing work.
That 10% orchestration tax is well-documented in production systems. As multi-agent systems scale, coordination among numerous agents creates communication overhead, message congestion, and performance bottlenecks. Enterprises have to invest in orchestration software, skilled engineering teams, and continuous monitoring infrastructure just to keep things running.
But even with orchestration, you still need human supervision. Organizations still require human oversight for actions like accessing sensitive data, making system changes, or granting permissions. The trust isn't there yet for full autonomy in high-stakes scenarios.
My claim: one engineer can manage 50 agents the way a senior engineer manages 50 teammates.
The reality: you're telling agents how to unblock themselves every 20 minutes, building meta-agents to manage sub-agents, and hoping nothing critical breaks while you're in a meeting.
The Reliability Compounding Problem
A single agent with 95% reliability sounds decent.
Chain five of them together and the system's combined reliability drops to about 77%. Chain ten agents and you're at 60%.
This is why the bug resolution rate collapsed from 2,000 potential issues to 70 fixes. Every agent in the chain introduces another failure point. More intermediate steps. More opportunities for lossy summarization. More places for context to degrade.
I experienced this firsthand. Agents that could individually solve problems got stuck when orchestrated together. The complexity didn't scale linearly. It compounded.
Some enterprises are paying what researchers call a "swarm tax" for architectures whose advantage comes from spending more computation rather than reasoning more effectively. Without proper baselines, you can't tell if your multi-agent system is actually better or just more expensive.
My 10% orchestration overhead might be understated. What enterprises often underestimate is that orchestration isn't free. Every additional agent introduces communication overhead that eats into the theoretical efficiency gains.
When Is an Experiment Ready to Open-Source
I'm considering releasing this as open-source. The framework works. The orchestration layer exists. Someone could theoretically point it at their codebase and let it run.
But they'd still need to do the hard parts.
The constant monitoring. The $2,000/month in subscriptions. The quota management. The 20-minute check-ins to unstick agents from basic workflow operations.
They can just point the agent to it, and the agent would be able to figure out the code base and apply it for the use case. But they don't avoid the hard work of actually monitoring all of this.
This is the gap between demo-ready and production-ready. Most agent experiments showcase what's possible under ideal conditions with human supervision. Few discuss sustained economics or operational overhead.
More than 40% of today's agentic AI projects could be canceled by 2027 due to unanticipated cost, complexity of scaling, or unexpected risks. The autonomous AI agent market could reach $35 billion by 2030 if orchestrated well. But "if orchestrated well" is doing a lot of work in that sentence.
Open-sourcing the framework would give people the orchestration layer. It wouldn't give them the operational knowledge of when to intervene, how to prioritize the 10% of bugs that matter, or how to structure their workflow so agents don't burn through quotas on theoretical problems.
What This Signals About the Next 12 Months
My answer to what needs to solve first is direct: the only thing that limits me personally is the cost and the usage limits.
Not agent capability. Not orchestration complexity. Not supervision requirements.
Pure economics and artificial rate limits.
If Anthropic and OpenAI removed usage caps tomorrow and cut prices in half, I'd run the agent army 24/7. The capability is there. The tooling works. The bottleneck is API economics.
That's a very different story than "agents aren't ready yet."
My final advice for teams considering building their own agent army: don't ever use API tokens for this. You'd get burned instantly. You can purchase Claude Code and Codex subscriptions, the most expensive tier, in bulk for each one of the teammates, or by just creating multiple accounts, but don't use API tokens.
Translation: the pricing models aren't built for this use case yet. API tokens are metered for typical usage patterns. Agent armies consume tokens at rates that make per-token pricing unsustainable. Flat-rate subscriptions with usage caps are the only economically viable option right now.
But even those caps get exhausted in hours when you run at scale.
The next 12 months won't be about agents getting smarter. They're already smart enough to find bugs, write fixes, and coordinate across multiple tools. The next 12 months will be about whether API providers adjust their pricing and rate limit models to match how people actually want to use these tools.
Because right now, the gap between "this works" and "this is affordable at scale" is wider than most people building agent systems want to admit.
I proved agents can inspect hundreds of features, triage thousands of potential bugs, and fix dozens of real issues. I also proved it costs $750 for 70 fixes on a small codebase, requires constant human supervision, and exhausts multiple $200/month subscriptions in under an hour.
Both things are true. And the second part is what determines whether agent armies become infrastructure or remain expensive experiments.
