My API testing is one step from running itself. So what's left for me?

June 8, 2026 · QA experiments

Over the past year, I’ve been rebuilding how I test APIs, and somewhere in there the work quietly inverted. I used to write the tests. Now I mostly feed a model the feature, tell it which scenarios matter, and let it infer the expected results and generate the tests from them.

I still drive every run. But the first time I sat back and noticed how little of the actual typing was mine anymore, the honest thought wasn’t nice. It was: with a bit more effort this runs without me — so what’s my job then?

A year ago this was manual, and it started late. Most of my testing didn’t even begin until a feature had already landed in the QA environment — so I’d meet it for the first time there, then go by hand: open the API client, fire a request, eyeball the response, hand-write the assertions, build out the Postman collection, repeat for every endpoint.

It was usually one endpoint per user story, and the effort split two ways — thinking up the tests and hand-building the collection in Postman. The catch was timing. By the time a feature reached QA I had maybe half the understanding and half the tests, so I’d find the edge cases late — or worse, find out I’d been wrong about an expected result, or that I’d missed a case that actually mattered. The thinking that should have come first was happening last.

So I rebuilt it. Not “added a script,” and not “fully automated” either — I turned API testing into a pipeline I drive, where the model does the producing and I do the deciding.

What I built

Strip the branding off and it’s a pipeline with a handful of stages — and the whole point is which stages are mine and which aren’t. It runs in VS Code with an AI model (I use Claude) as the surface, against a normal REST/HTTP service, and it produces Postman tests. The stages:

Ingest. I drop the new feature’s documentation and details in, and the model and I go through it together until its understanding of the feature matches mine and we agree on what the API is supposed to do.
Anchor the happy path. I hand it concrete examples — example requests, the responses I expect back. This is me pinning down ground truth.
Build the coverage. This is the part I actually sit in. I add the scenarios I care about — the negative cases, the edge cases, the ones that bite — and the model infers the expected result for each as I go. I keep going until the coverage feels honest.
Bundle it. When I’m happy with coverage, I ask the model to generate a Postman collection of the whole set — something the rest of the team can run, drop into our integration and regression suites, or just read.
Ship it. I review the collection and integrate it.

Now look at where I am in that list. The understanding, the ground truth, the coverage, the final call — stages 1, 2, 3, and 5 — are all me. What I’ve handed off is stage 4 and the mechanical half of stage 3: turning the cases I care about into actual, runnable, shareable test code. The part that used to be the job — writing the tests — is the part the model now does.

And here’s where the question in the title comes from. None of stages 1–3 are inherently mine. With a bit more effort I could hand the model the raw feature doc, let it propose the scenarios and the expected results itself, and merge on green. I haven’t. But I can see it from here — the pipeline is one good push from not needing a driver.

You could build your own version of this. The pattern is the value, not my exact wiring.

How it actually works

Here’s the part I can show you clean-room. Forget my real endpoints — I ran the exact same loop against a boring public API anyone can hit: coinlayer’s historical-rates endpoint, GET /{date}, which hands back crypto exchange rates for a given day. Pretend it just landed in my backlog as a new endpoint to test. The shape is identical whether it’s this or your payments service.

Stage one is literally me pasting coinlayer’s endpoint doc into the chat and us walking it until the model’s read of the thing matches mine: a date in the path, an access_key, and a few query params — symbols (which coins), target (which fiat to price them in), expand (full rate objects vs. bare numbers). That tiny surface is exactly where the interesting cases hide — which is the point.

By the time I’m done, the loop hasn’t produced “a test.” It’s produced a little project, and the folders are the pipeline, frozen on disk:

coinlayer-historical/
├── docs/
│   ├── HistoricalDataEndpoint.md            # user-facing: what the endpoint supports + what it doesn't
│   └── coverage-testplan.md                 # evidence-of-testing: coverage I can hand an auditor
├── examples/                                # happy path — ground truth I anchored by hand
│   ├── btc-usd_request.md
│   ├── btc-usd_response.json
│   ├── sol-usd_request.md
│   └── sol-usd_response.json
├── tests/                                   # the cases that matter — negatives + edges I added
│   ├── btc_eth-usd_request.md               # edge: more than one symbol
│   ├── btc_eth-usd_response.json
│   ├── sol-eur_request.md                   # edge: non-USD target
│   ├── sol-eur_response.json
│   ├── negative_invalidDate_request.md      # negative: malformed date
│   ├── negative_invalidDate_response.json
│   ├── negative_fiat_request.md             # negative: bogus target currency
│   └── negative_fiat_response.json
└── postman/
    └── historicdata_postman_collection.json # the whole set, runnable + importable

Here’s the thing worth staring at: my effort lives almost entirely in two of those folders. examples/ (stage 2 — the happy path I pin down by hand) and tests/ (stage 3 — the negatives and edges I decide are worth covering) are where I actually spend myself. Everything else is one sentence to the model. Inferring the expected response for an edge case the docs never mention? A prompt — it fills the response file from what it already knows about APIs shaped like this one. postman/, the whole runnable collection? A prompt. docs/, which I’ll get to below? A prompt. What’s left to me is choosing the cases and anchoring the truth; the producing is one instruction away each time.

The pivot of the whole thing isn’t any single test — it’s the expected results. That’s the artifact I anchor and the model fills out. I give it the happy path, then I start adding the cases I actually care about — the negative ones, the edge ones — and it infers the expected result for each as I go:

Scenario	Request	Expected result
BTC in USD (happy path — I gave it this)	`GET /2026-06-05?symbols=BTC&target=USD&expand=1`	`success:true`, `historical:true`, `date` echoes `2026-06-05`, `rates.BTC` carries `rate`/`high`/`low`/`vol`
BTC and ETH (edge — I added it)	`GET /2026-06-05?symbols=BTC,ETH`	`rates` has both `BTC` and `ETH` keys, not just the first
SOL priced in EUR (edge — I added it)	`GET /2026-06-05?symbols=SOL&target=EUR`	`target:"EUR"`; the rate is in euros, not silently still USD
Malformed date (negative — I added it)	`GET /not-a-date?symbols=BTC`	`success:false`, `error.code:302`, `error.type:"invalid_date"`
Bogus target currency (negative — I added it)	`GET /2026-06-05?symbols=BTC&target=XYZ`	`success:false`; error names the bad target — not a silent fallback to USD

The happy path is the one I hand it. The four below are the ones that matter, and they’re the ones that are mine — left alone, the model is happy to confirm BTC-in-USD comes back and call the endpoint covered. Here’s the ground truth I anchored for that first row — a real response from the endpoint, sitting in examples/btc-usd_response.json, open in the actual project:

The coinlayer-historical project open in VS Code, with examples/btc-usd_response.json showing the historical BTC-in-USD response: success true, target USD, historical true, date 2026-06-05, and a rates.BTC object with rate, high, low, vol, cap and change_pct. The real project. examples/ and tests/ are where my effort goes — anchoring ground truth and choosing the cases; everything downstream is a prompt.

Once I’ve signed off on the table, it generates the Postman tests from it:

// "Historical BTC in USD — happy path" › Tests
pm.test("status is 200", () => pm.response.to.have.status(200));

const body = pm.response.json();
pm.test("succeeded and is flagged historical", () => {
    pm.expect(body.success).to.be.true;
    pm.expect(body.historical).to.be.true;
    pm.expect(body.date).to.eql("2026-06-05");
});
pm.test("BTC rate is present and numeric", () => {
    pm.expect(body.rates).to.have.property("BTC");
    pm.expect(body.rates.BTC.rate).to.be.a("number");
});

// "Malformed date" › Tests  (negative_invalidDate)
const body = pm.response.json();
pm.test("rejected as invalid_date", () => {
    pm.expect(body.success).to.be.false;
    pm.expect(body.error.code).to.eql(302);
    pm.expect(body.error.type).to.eql("invalid_date");
});

That second block is the one I care about, and it’s where coinlayer gets interesting: a bad date doesn’t come back as an HTTP 4xx. It comes back 200 OK with success:false buried in the body. Hold that thought — it’s the trap in the next section.

tests/negative_invalidDate_response.json — the rejection lives in the body (success:false, error.code 302), not the HTTP status.

When the coverage feels right, I have it bundle the whole set into historicdata_postman_collection.json — every request and its tests, integrated and ready to import. The format matters: it’s what the rest of the team already runs, and it drops straight into our integration and regression suites. And the time I save not hand-building that collection goes right back into coverage: hunting for the business case or the edge case I haven’t thought of yet. That’s the trade I keep making — less typing, more thinking about what’s missing.

Notice what moved. I’m not writing those pm.test blocks anymore. What I’m doing is owning the column that says Expected — deciding that a bogus target must error instead of quietly falling back to USD, deciding that a success:false inside a 200 still has to fail the test. That shift — from writing the tests to owning what they’re supposed to prove — is the whole post, and embarrassingly, it took me a while to notice it had happened at all.

What changed

Two things changed that I actually feel day to day.

First, where my time goes. I’m not spending it writing tests and assertions anymore — it goes to coverage, to edge cases, and to sitting down with the BAs to pin down what the expected result even should be. The work moved up the stack: from producing tests to deciding what they prove.

Second, when the work happens. The understanding and the tests are ready earlier now, so by the time a feature hits the QA environment I’m not still ramping up. I catch the inconsistencies and the bugs almost as soon as it lands — instead of stumbling onto them a week later, when I’d finally finished building the collection.

The honest headline isn’t “it’s faster,” though it is. It’s that the part that used to eat the day — writing and rewriting the actual tests — basically evaporated, and what’s left in my day is the part that was always the real engineering: deciding what the API is supposed to do. More on that below, because that part turned out to be the whole job.

The part I didn’t expect: documentation

Nobody on my team writes documentation. Be honest — nobody on your team does either. It’s the chore everyone agrees is important and everyone quietly skips, and I was as guilty as the next person. Don’t blame me; I think it’s a universal law. What little we did have was thin — a few lines, no real detail — because nobody wanted to write it and nobody wanted to review it either.

Here’s what I didn’t see coming. By the time the tests exist, the model already holds the entire shared understanding of the feature — the one we built together in stage 1, the expected results, every scenario I added. Generating docs out of that is almost free — and it’s not one doc, it’s whatever the audience needs, each its own single prompt:

A coverage / test plan — what I exercised, what passed, what I deliberately left out — the artifact I hand over as evidence that the endpoint was actually tested.
User-facing endpoint docs — what it supports, what it doesn’t, the gotchas like that 200 wrapping success:false — for whoever has to consume the thing.

And it’s not a model guessing what an endpoint does from its name; it’s documentation written from the exact understanding the tests were built on. I ask, and it’s there — accurate, because it’s grounded in the same knowledge as the tests.

The chore the whole team avoided for years turned into a side effect of testing. That still feels a little like cheating.

What’s still hard

The loop is not magic, and the failures are specific — some of them the kind that bite you in prod with a green checkmark sitting right next to them. If I only told you the win, you shouldn’t trust the rest of this post.

It infers confident expected results for the wrong reasons. Remember coinlayer’s malformed-date response — 200 OK wrapping success:false. Point the model at it cold and it does the natural thing: sees 200, writes pm.test("status is 200"), and the negative test goes green — cheerfully asserting that a request which was supposed to fail succeeded. It decided the expected result was 200 because 200 is what came back, when the contract is “this should be rejected.” A green Postman test that encodes the wrong expectation is worse than no test. And the model is good at this — confident, fluent, fast — which is exactly what makes it dangerous: a wrong expected result shows up looking just as sure as a right one. The attention I used to spend writing assertions now goes into reading that Expected column, and when it’s wrong, explaining why it’s wrong so the correction carries into the next test.
It covers the happy path and calls it done. Auth failures, malformed bodies, the 422 you actually care about, idempotency, pagination edges — it under-covers exactly the stuff that bites in production unless I name those scenarios myself. There’s no shortcut: if I don’t call out the negative and edge cases explicitly, they just don’t get covered.
It treats observed behavior as correct behavior. It infers the Expected column from how the API actually responds — or from a feature doc that might be wrong — so if there’s a bug, it’ll faithfully write a test that locks the bug in and calls it the spec. Miss that one correction and the bug doesn’t just survive — it gets canonized: the test locks it in and the docs describe it as intended. Two artifacts, now lying in sync.

None of these are reasons not to do it. They’re the map of where my attention has to go now — which is the turn.

So what is my job?

Here’s the thing I sat with after noticing how little of the test-writing was still mine.

The job stopped being “write the tests.” Writing them is now the cheap part — the part the model does in minutes. Every failure in the section above has the same shape: the model is excellent at producing tests and has no idea whether they assert the right thing. It can’t tell a correct contract from a bug that happens to pass. It doesn’t know which 422 matters. It will write a defect down as the expected result with a completely straight face.

So the part that’s left — the part that’s mine — is owning what the tests are supposed to prove. And the strange thing is what that turned into in practice. With the collection-building and the assertions off my plate, the time didn’t vanish — it moved. It goes into understanding the feature and the business logic now. I’m in more meetings with the BAs. I’ve caught issues in the feature itself, and behavior the BA hadn’t even considered — the kind of thing I’d never have surfaced back when my hours disappeared into hand-building Postman collections.

That’s not a smaller job. It’s a more concentrated one. The grind was hiding it — when most of QA is mechanical production, the judgment looks like a side effect of the typing. Take the typing away and the judgment is all that’s left, standing there in full view: deciding what the API is supposed to do, and catching the test that passes for the wrong reason. That’s not the overhead of the job anymore. That’s the job.

And that’s the part I keep circling. With a bit more effort I could hand off the stages I’ve kept — let it propose the scenarios and the expected results, and merge on green. The reason I haven’t isn’t that it’s hard. It’s that those are the stages where being wrong actually costs something.

Here’s where I’ve landed. Everything else can go — the test creation, the assertions, the integration into the suites, even kicking off the runs. Hand it all to the model; genuinely, I’m fine with that. What can’t go, at least not yet, is the part where I refine the understanding, nail down the expected results, and hold the business context. Solve that and the rest follows on its own. So “with a bit more effort” is mostly a roadmap — except for the one stage that’s actually the job, where it still reads a little like a warning.

What’s next

Two directions, and they pull against each other in a way I like. One is practical — push this flow closer to mostly-automated, hand off more of the mechanical stages, and find out how far that “one good push” really is. The other is the bigger question sitting underneath it: what my job even is now.

Because the roles are bleeding into each other. Devs can generate a basic test plan themselves; I spend my days understanding the business almost like a BA. The lines that used to say who-does-what are dissolving, and I don’t think QA comes out the other side looking the same.

That’s what the next posts are about — the experiments I’m running, what they’re teaching me, and where I think the QA role is actually heading.