Security

My AI agents were just burning tokens...

May 18, 2026

I ran a full scan against a customer last week. Real authed scan, real Django app, real money on the clock. The scan produced 6 validated findings, which is a fine result for a first pass on a new target.

Then I broke down where the cost actually went:

62.7% of agent token spend produced zero findings.
All 6 findings came from deterministic primitives. Like rate-limit enumeration and JWT analysis.
LLM agents for lead investigation and unauthed API exposure produced nothing.

This was the most uncomfortable scan review I've done in weeks. And it forced a question I'd been avoiding.

My "AI agents" were for-loops wearing LLM costumes

Of the agents in my tool, most weren't doing real reasoning. They were picking a URL from a list, calling a tool (eg. fetch), and checked the result. A 5-line loop could have done the same work for free. I used LLM budget for URL selection that doesn't require an LLM at all.

This isn't unique to my tool. It's the dominant pattern across AI products right now. "Agentic" gets used so loosely that it covers everything from genuine multi-step reasoning to a glorified for-loop with extra tokens.

The real test is to measure: of the findings your tool produces, how many came from agents doing something a deterministic primitive couldn't? For me the answer last week was: not many.

What I did about it

The reflex move is to say "the model wasn't smart enough, switch to a smarter one." That's the wrong move, and it's the move that drives most AI tooling costs to spiral. The right move is to pull the LLM out of the places it isn't earning its keep and put it where it actually can.

Two changes followed:

1. Domain model enrichment -- The recon layer produces a structured map of the application: resources, actions, fields, sample IDs. That's deterministic. But there are things about each resource that the regex-based map can't infer. For example, which actions actually require authentication (versus appearing to)? What semantic verb the action represents (regenerate, export, upload)? What fields doest request body actually accept in practice?

These require reading response bodies and reasoning about what the application does. That's genuine LLM judgment, applied once per scan, costs ~1¢, and produces structured data the rest of the pipeline can use deterministically.

2. Attack suggester with roadmap signals -- Once the application is mapped, an LLM looks at the structured surface and proposes attacks. Not "test SQL injection on every URL" (that's a for-loop) but "this resource has an is_admin field in its update action, that's a mass assignment shape" or "this /api/v1/ai/query endpoint is an AI surface, try prompt injection." The suggester routes to existing primitives when there's a match, and logs proposals for primitives that don't exist yet. After enough customer scans, the most-suggested missing primitive is the next one to build. The product tells me what to build next, based on what real customer applications actually have.

Both of these are places where the LLM does something a regex genuinely can't. Reading response bodies semantically. Spotting attack-shaped field clusters. Recognizing that /api/v1/ai/* is an AI surface and routing differently. That's judgment, not iteration.

The reframe

What I've stopped trying to do is make my for-loop agents smarter. They're not the problem. The problem is that I was paying LLM rates for work a deterministic primitive should be doing.

What I've started doing is being deliberate about where the LLM actually earns its keep. The pattern I'm landing on: deterministic primitives do the iteration, LLM judgment shapes which iterations to run and what to make of the results. Most of the budget moves to the deterministic side; the LLM cost shrinks but the cost goes toward judgment that a regex couldn't replicate.

Net result, still measuring, but early signs are: lower token cost, more findings per dollar, and a backlog of new primitives surfaced by the suggester that I wouldn't have prioritized on my own.

The uncomfortable measurement was worth doing. I'd recommend it to anyone building with AI agents right now: actually count what your agents did versus what they cost. If most of your tokens are paying for URL selection from a candidate set, you've got the same problem I did, and the fix isn't a bigger model.

Week 6 next Tuesday.

About the author
This is some text inside of a div block.

I'm an AWS certified cloud architect from New York, who loves writing about DevSecOps, Infrastructure as Code and Serverless. Having run a tech company myself for years, I love helping other start-up scale using the latest cloud services.

Join my mailing list

Stay up to date with everything Skripted.

Sign up for periodic updates on #IaC techniques, interesting AWS services and serverless.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.