The Reproducibility Problem
Run the same agent on the same task twice and you might get different results. The model might choose different words, make different tool call decisions, or explore different reasoning paths. This non-determinism is a feature of language models (it enables creativity and natural-sounding output) but it's a challenge for production systems that need predictable behavior.
When a customer reports that an agent produced the wrong result, you want to reproduce the issue. But if the agent doesn't behave the same way twice, reproduction becomes difficult. The bug might not manifest on the second run, leaving you unable to diagnose or fix it.
Sources of Non-Determinism
The primary source of non-determinism is the model's temperature parameter. At temperature zero, model outputs are nearly deterministic (though not perfectly, due to floating-point computation differences). At higher temperatures, outputs vary more with each run.
Tool results can also introduce non-determinism. If the agent queries a database and the data has changed since the last run, the agent receives different information and may make different decisions. Web searches return different results at different times. Even file contents might change between runs.
Context order effects contribute as well. When multiple MCP servers return results, the order in which those results appear in the context can influence the model's decisions. Small changes in context ordering can lead to different reasoning paths.
When Determinism Matters
Not every agent application needs deterministic behavior. Creative tasks, exploratory research, and brainstorming benefit from variability. The model's ability to take different approaches on different runs can surface insights that a deterministic system would miss.
Determinism matters when the agent's output feeds into downstream systems, when results need to be auditable, or when consistency is a user expectation. Financial calculations, report generation, data processing pipelines, and any task where "it worked yesterday but gives different results today" is unacceptable all benefit from more deterministic agent behavior.
Techniques for Increasing Consistency
Setting temperature to zero (or near zero) is the most direct approach. This makes the model choose the highest-probability token at each step, reducing output variability. The tradeoff is less creative output, which may or may not matter for your use case.
Structured output formatting reduces variability by constraining what the model can produce. Instead of free-form text responses, require the agent to output structured data (JSON, specific templates) that downstream systems can parse reliably.
Caching tool results ensures that repeated runs see the same data. If the agent queries a database at step 3, caching that result means subsequent runs of the same task see identical data, reducing one source of variability.
Explicit planning before execution helps by committing the agent to a specific approach early. Once the plan is set, the execution steps are more constrained and therefore more consistent.
Testing Non-Deterministic Systems
Testing agent-based systems requires a different approach than testing deterministic software. Instead of checking for exact output matches, test for output properties: does the result contain the required information? Is it in the correct format? Does it meet quality criteria?
Statistical testing, running the same test case multiple times and checking that the success rate meets a threshold, is more appropriate than single-run pass/fail testing. If an agent produces correct results 95 out of 100 times, that tells you more than a single test run that happened to pass or fail.