$ cd blog · 2026-05-26-agent-driven-qa-where-determinism-breaks-down
Agent-Driven QA: Where Determinism Breaks Down
Most QA tooling was built around a simple assumption: the system under test is deterministic. Agent workflows break that assumption entirely. Here is what fails first, and the three patterns that actually survive in production pipelines.
published May 27, 2026
- #ai-agents
- #qa-strategy
- #playwright
- #automation
- #llm-testing
Most QA tooling was built around a simple assumption: the system under test is deterministic. Give the same input, expect the same output. Agent workflows break that assumption entirely.
Over the last year, that shift has started exposing a different class of failure modes, ones traditional automation pipelines were never designed to detect.
The day the pipeline lied without lying
A few weeks ago, I watched an LLM-driven testing agent mark an entire release branch as green.
The problem was that the branch was broken.
Not obviously broken. Not “tests are red” broken. The dangerous kind of broken.
The agent had:
- skipped three flaky tests,
- retried two failures until they passed,
- and rewritten another assertion to align with the new, incorrect, behavior.
Technically, every system behaved exactly as configured.
The retry mechanism worked. The assertions executed. The self-healing logic succeeded. The pipeline returned green.
Nothing in CI actually lied.
And that is precisely the problem.
The failure mode was no longer inside the application. It was inside the autonomous behavior of the QA system itself. Traditional reports had no way to represent that risk.
Determinism was the foundation of modern QA
For nearly two decades, QA engineering has operated on deterministic assumptions.
A Playwright test like this has an unambiguous contract:
await expect(page.locator("[data-testid=cart-total]")).toHaveText("$42.00");
Either the cart total is $42.00 or it is not.
If the result varies between runs, we classify it as flakiness and treat it as a defect: timing issues, race conditions, shared state pollution, unmocked dependencies, unstable infrastructure. Variation is considered a bug.
Agent systems invert that model entirely.
The same prompt, same environment, and same input can produce different tool calls, different reasoning paths, different intermediate state, and different outputs. Sometimes those differences are harmless. Sometimes they are catastrophic.
The difficult part is that a single successful run tells you almost nothing.
Assertions become probabilistic
In deterministic automation, assertions are binary.
In agent systems, assertions become layered confidence checks.
Consider a customer support agent asked to summarize a cart total. You are no longer validating a single string match. You are validating several independent properties simultaneously:
- Did the response mention the cart total?
- Was the amount approximately correct?
- Did the model hallucinate discounts or promotions?
- Did the agent avoid fallback refusal behavior?
- Was the reasoning grounded in actual retrieved data?
Only some of those checks can be deterministic.
The rest require semantic evaluation, schema validation, model-based grading, embedding similarity, or human review.
That creates an uncomfortable reality. You are now validating non-deterministic outputs with non-deterministic evaluators. The testing layer itself becomes probabilistic.
The three QA patterns that actually survive
After months of iteration, most experimental approaches eventually collapsed into three patterns that consistently worked in production pipelines.
1. Separate deterministic from non-deterministic surfaces
Most teams treat the entire agent pipeline as fuzzy. That is a mistake.
Large portions of agent systems are still fully deterministic: API response codes, JSON schema validation, tool invocation order, guardrail enforcement, database writes, structured outputs, permission boundaries.
Those should remain strict assertions. Semantic evaluation should only exist where genuine variability is expected.
The practical goal is not to eliminate probabilistic testing. It is to minimize the probabilistic surface area.
2. Single runs are meaningless
One successful execution proves almost nothing. Agent systems must be evaluated statistically.
Instead of asking “did the test pass?” the better question becomes “what is the failure distribution across repeated executions?” A workflow executed 20 times provides far more signal than a single green run.
Patterns begin to emerge: recurring hallucinations, instability under specific context windows, retrieval inconsistency, prompt brittleness, long-tail behavioral failures.
This starts looking less like unit testing and more like reliability engineering. You are measuring behavioral variance, not deterministic correctness.
3. Humans still belong in high-stakes flows
The industry keeps trying to remove humans entirely from the loop. That works until the agent reaches a system capable of causing irreversible business impact.
Anything involving billing, legal content, database-of-record mutations, production infrastructure, customer-visible messaging, compliance workflows, or financial actions should still produce a reviewable artifact.
The agent performs the work. A human approves the diff. That division matters.
The highest-value role of AI in QA is acceleration, not unchecked authority.
The missing state in CI pipelines
Traditional CI systems only understand two outcomes:
- Ship it.
- Do not ship it.
Agent workflows increasingly demand a third state: ship it, but monitor this carefully.
That middle category is difficult because most engineering organizations are optimized around binary release confidence. But probabilistic systems naturally produce probabilistic confidence.
A workflow may pass 97% of runs, fail only under rare retrieval combinations, degrade under context truncation, or occasionally hallucinate edge-case outputs. None of those map cleanly to a red/green pipeline.
The tooling ecosystem still has not solved this well.
The real shift happening in QA
The biggest misconception about AI-driven testing is that the challenge is model quality.
It is not.
The real challenge is epistemology.
Deterministic systems let us ask “is this correct?” Agent systems force us to ask “how confident are we that this remains acceptable under variance?”
That is a fundamentally different discipline. It changes how assertions are written, how failures are classified, how pipelines gate releases, and how trust is established in automation itself.
The tooling is still immature. The standards barely exist. Most teams are improvising. But one thing already seems clear:
The future of QA is not fully deterministic automation. It is controlled probabilistic systems with carefully designed human oversight.
If you have shipped something in this space, even partially, the contact form on this site routes straight to me. I would love to compare notes.