I reverted a week's work because it only worked on select targets

June 1, 2026

Last week I kept catching myself building the same mistake in different variations. Each time the work looked like progress. Each time, the honest answer to "is this actually good?" was "no, it just looks good on the target I built it against."

The clearest example: I wrote a new detection primitive, about 3005 lines and 30 regression tests, all passing, finding real vulnerabilities on WebGoat (a deliberately-vulnerable training app I use as a test target). I was ready to merge it.

Then I asked the question I've started forcing myself to ask: would this fire on a real customer?

No. The primitive was built around WebGoat's specific URL structure for its training lessons. It worked beautifully on WebGoat and would produce exactly zero findings on an actual SaaS app, because no real application is shaped like a security training course. I'd written a week's worth of a very elaborate way to pass my own test. I reverted all of it.

This is overfitting, and security tools are full of it

If you've done any machine learning, you know overfitting: a model that performs brilliantly on its training data and falls apart on anything new, because it learned the training set's quirks instead of the underlying pattern.

Security scanners overfit too. A tool gets tested against a standard set of practice apps (WebGoat, Juice Shop, DVWA), gets really good at them, and the whole time it's quietly learning the shape of the test set instead of the shape of real applications. The failure is invisible until you point it at something real. A scanner that scores 90% on the practice apps can score far lower on a customer's actual codebase, because real apps don't have the tidy, well-known vulnerability patterns the practice apps were built to teach.

The temptation is constant because it produces visible wins. A WebGoat-specific detector makes the WebGoat number go up today. The cost is deferred and invisible: the tool gets better at looking good and no better at finding bugs in the apps customers actually run.

The first customers are the subtler trap

Here's the part I didn't expect. The practice apps are the obvious overfitting risk. The harder one is your first real customers.

When you've only scanned a handful of real applications, every fix you make to satisfy those specific apps is also overfitting, just to a sample size of three instead of a sample size of one training app. And it's more dangerous, because tuning to your early customers feels like exactly the thing you're supposed to do. It not real customer-driven development.

But "make it work for the three apps I've seen" and "make it work for apps in general" are different goals, and early on they're easy to confuse. A fix that's really a hardcode for one customer's framework will pass every test I have, please the customer in front of me, and silently fail the fourth customer whose stack I never saw. The first few customers can quietly become a new, smaller test set that I overfit to without noticing.

So the rule I made last week applies in both directions. Every new capability has to pass one question before it ships: would this produce a finding on an application I haven't seen? If it only works because the app happens to be shaped like something already in front of me, whether that's WebGoat or a current customer, it's measuring the wrong thing.

The reverted primitive got rebuilt as something general: instead of one training app's URL pattern, it probes for common action endpoints (admin panels, exports, settings, management interfaces) that show up across real applications. WebGoat became incidental. So did any single customer.

Why this matters more for small dev teams

A small dev team buying a pentest doesn't have a security person to sanity-check the report. They can't look at the findings and think "this tool clearly didn't understand our auth flow." They take the report at face value. An overfit tool fails silently on exactly the customers least equipped to catch it. A clean report doesn't mean the app is secure; it means the app didn't happen to match the patterns the tool learned.

The whole value proposition is certainty for teams that can't afford a security hire. An overfit tool gives them false certainty, which is worse than no tool at all.

That's why the shortcut is never worth it, even when it's dressed up as serving a customer. The thing it trades away is the one thing the product is supposed to provide.

Week 7 next Tuesday.

About the author

This is some text inside of a div block.

Maxim Schram

I'm an AWS certified cloud architect from New York, who loves writing about DevSecOps, Infrastructure as Code and Serverless. Having run a tech company myself for years, I love helping other start-up scale using the latest cloud services.

Join my mailing list

Stay up to date with everything Skripted.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.