Agent Behavior Is Software
An AI agent's behavior is determined by its system prompt, its tool configurations, its guardrail rules, its memory contents, and the model it runs on. Change any of these and you've changed the agent. But unlike application code, these components often live outside version control: prompts in a dashboard, tool configs in a JSON file, guardrails in a separate system.
This makes it hard to answer basic questions. What changed between yesterday and today? When did this specific behavior start? Can we go back to how the agent behaved last week? If your agent's components aren't version-controlled, you're flying blind.
What to Version Control
System prompts and instruction sets: these are the most common source of behavior changes. A small prompt tweak can dramatically change how an agent handles certain situations. Every prompt version should be tracked with a timestamp and a description of why it was changed.
Tool configurations: which MCP servers are connected, what permissions they have, what parameters they use. Adding or removing a tool changes the agent's capabilities. Changing a tool's configuration changes how it uses that capability.
Guardrail rules: the boundaries that constrain the agent's behavior. When you loosen or tighten a guardrail, that's a behavior change that should be tracked.
Model version: upgrading from one model version to another can change behavior even if nothing else changes. Track which model version is in use and when it was changed.
Practical Implementation
The simplest approach: store all agent configuration in a Git repository. Prompts as text files, tool configs as JSON or YAML, guardrails as code or config. Every change goes through a commit with a descriptive message. You get full history, diff capabilities, branching for experiments, and rollback for free.
For teams, use pull requests for agent behavior changes just like you would for code changes. "I'm changing the system prompt to handle edge case X better" should be reviewed by someone before it hits production. This catches unintended side effects before they reach users.
Testing Behavior Versions
Version control alone tells you what changed, but not whether the change was an improvement. Pair it with evaluation. Before deploying a new behavior version, run it against a test suite of representative tasks and compare results to the previous version. Did accuracy go up? Did failure rate go down? Did any previously-working scenarios break?
Search for agent evaluation tools on Skillful.sh to find testing frameworks that can run these comparisons. The combination of version control plus automated evaluation gives you confidence that changes are actually improvements.