I added an AI agent whose only job is to prove a finding is wrong

June 28, 2026

The most useful thing I built in the last two weeks is a step that exists to disagree with the rest of the scanner.

It's an LLM agent I call the skeptic. Its entire job is to take a finding an agent is about to report and try to prove it's a false positive. If it succeeds, the finding gets dropped. If it can't, the finding stays. The interesting question isn't what it does. It's why it has to be a separate thing at all, instead of the scanner just checking its own work.

Why a tool can't check its own work

The obvious objection to building a skeptic is that the scanner should just be more careful in the first place. Why not have the thing that produces a finding also verify it?

Because a checker built from the same logic as the producer shares the producer's blind spots. If the thing that made the mistake checks its own work, it makes the same mistake again. To catch a wrong conclusion you need something with a different vantage point, a different incentive, and the freedom to run new experiments. Neither of the things that produce findings has those.

The deterministic detection rules can't self-check, because re-running a rule is circular. A detection primitive is fixed code: if I see a 200 plus this marker, call it a vulnerability. If that rule is wrong, asking it to re-check just applies the same wrong rule and gets the same wrong answer. It has no model of impact. It can't reason that a 200 on this particular page means nothing, or that a database error string is not the same as actually extracting data. It only knows its marker fired.

The LLM agents can't self-check either, because they already used their best reasoning to conclude the finding was real. Asking the same agent "are you sure?" in the same context, with the same framing, tends to confirm. That's just anchoring. The agent's job was to find bugs, so it leans toward flagging, and a self-check inherits that lean.

The move that actually works is splitting "find" from "refute" into opposite incentives. The producer is recall-biased: it wants to catch every bug, so it errs toward flagging. The skeptic is precision-biased: its only job is to disprove. It's the prosecutor-versus-defense split, or the author-versus-reviewer split. You don't ask the author to review their own code and call it an independent review. The independence is the whole point.

The failure mode replay can't catch

This matters because of a specific gap in how findings normally get validated.

Most security tools validate a finding by replaying it: re-send the request that triggered it, check whether the same response comes back. That catches findings that don't reproduce because the target changed between the scan and the report. Useful, but it checks the wrong question.

If the scanner's logic was wrong from the start, replaying that same logic just agrees with itself. A primitive sees a status 200 on some (hidden) endpoint and calls it a vulnerability, when a 200 there actually means nothing. Re-running the primitive produces the same 200 and the same wrong conclusion. The finding passes validation and lands in the customer's report.

Replay asks "does this still reproduce?" The question that matters is "was this ever real?" You can't answer the second question by re-running the probe that made the mistake. You need the skeptic.

How it works

The skeptic runs after a finding has already survived normal validation. It's the last gate before a finding is allowed into a report, and the loop is deliberately adversarial.

Hand the finding to an LLM as a skeptic. It gets the title, the description, the category, the request that was sent, the response that came back, and the evidence. The instruction is hostile to the finding: assume this is a false positive, and find the cheapest test that would prove it's a misinterpretation.

Demand real impact, not a status code. This is the part deterministic code can't easily encode. The prompt forces class-specific proof. An XSS finding has to show a real stolen token, not just an alert() firing. An IDOR has to show another user's private data in the response body, not just a 200. A SQL injection has to show extracted data or a timing difference, not just a database error. A passing status code is never enough on its own.

Propose concrete tests. The LLM proposes one to three specific HTTP requests that would distinguish a real bug from a mirage. This is the part no primitive did. The original detection ran one canned probe; the skeptic designs new, targeted experiments for this specific claim.

Run them, safely. The scanner executes those tests against the live target, inside strict guardrails: only the target's own hosts, no destructive methods (no DELETE, PUT, or PATCH), capped at three probes and ten seconds total. The skeptic gets to investigate, not to attack.

Judge against what came back. The responses go back to the LLM with a final question: given this evidence, is the finding still real, or did one of your own tests contradict it?

Drop only on a clear contradiction. If there's positive evidence the finding is wrong, it's dropped and the contradiction becomes the recorded reason. Anything ambiguous, the finding is kept. The skeptic is tuned to be conservative, because wrongly dropping a real bug is worse than leaving a borderline finding in for a human to look at.

The whole pass costs about half a cent per finding, which is why it can run as a smart chokepoint at the end instead of being bolted onto every primitive. One reasoning pass over the handful of findings that survived beats thirty self-checks that all share the same blind spots.

A real example

I ran this against a deliberately vulnerable test target during dogfooding. One finding claimed an unauthenticated database reset, which would be serious if real.

The skeptic probed it and looked at what actually came back. The responses were just identical page reloads, with no confirmation that any reset had executed. Its conclusion: a 200 and a page reload is not proof that the reset ran. False positive, dropped.

In the same run it left a genuine missing-rate-limiting finding completely untouched, because nothing it probed contradicted that one. That's the behavior I want. It doesn't lower the finding count across the board, which would just trade sensitivity for the appearance of precision. It surgically removes the findings it can actively prove are wrong and leaves the rest alone.

The findings that have no other check

There's one class of finding where the skeptic isn't a second opinion. It's the only opinion.

Recon-derived findings (things like header and configuration observations) skip the replay validator entirely, because there's no exploit request to replay. Before the skeptic, those went into the report having faced no adversary at all. Now they get one. For that whole class of finding, the skeptic is the only thing standing between a shaky observation and the customer's report.

Why this isn't a license to ship broken detection

The honest caveat: the skeptic is a backstop, not a substitute for getting the detection right.

When it drops a finding, that's a signal to go fix the agent that produced it. Otherwise broken heuristics just live behind the skeptic forever, and the skeptic quietly absorbs the same false positive on every scan instead of the bug getting fixed once at the source. Most of my actual work over the last two weeks was exactly that: fixing primitives at the source so they stop producing false positives in the first place. The skeptic is for the residual. The bugs I haven't found yet, and the cases too contextual to ever encode in deterministic code.

That's the real shape of the answer to "why doesn't the scanner just check itself." It does, where it can, by fixing the detection. And it adds an independent adversary for everything that careful detection can't catch. Defense in depth, two layers with opposite jobs.

The point

There's a cheap way to make an AI security tool feel trustworthy: lean hard on the LLM, let it call things vulnerabilities, and ship whatever it says. That produces reports full of false positives a human has to triage before they're usable, because the model is confidently wrong often enough that someone has to check every finding by hand.

The skeptic is the opposite instinct. It uses an LLM, but to attack the tool's own conclusions rather than to generate them. The model doesn't get to say a finding is real. It has to fail to prove the finding is fake. The verdict comes from HTTP responses that actually came back from the target, not from the model's confidence.

This is running against my own targets right now, on the way to becoming part of the production path for every customer scan. I'm still reviewing every report by hand before it ships. A step whose only job is to prove the scanner wrong is a large part of how I get to stop.

Week 12 next Tuesday.

About the author

This is some text inside of a div block.

Maxim Schram

I'm an AWS certified cloud architect from New York, who loves writing about DevSecOps, Infrastructure as Code and Serverless. Having run a tech company myself for years, I love helping other start-up scale using the latest cloud services.

Join my mailing list

Stay up to date with everything Skripted.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.