Expect Failure at Every Step
Every external call an AI agent makes can fail. The MCP server might be down. The API might rate-limit you. The model might return malformed output. The database might be unreachable. If your agent workflow only works when everything goes right, it doesn't really work at all.
Fault tolerance means designing for the reality that things break. Not as an edge case, but as a normal operating condition. The question isn't "will this fail?" It's "when this fails, what happens next?"
Retry With Backoff
The simplest resilience pattern is retrying failed operations with exponential backoff. API returned a 503? Wait a second and try again. Still failing? Wait four seconds. Then sixteen. Most transient failures resolve themselves within a few retries, and backoff prevents you from hammering an already-struggling service.
The key is knowing what to retry and what not to retry. A 503 (service unavailable) is worth retrying. A 400 (bad request) is not, because sending the same bad request again won't produce a different result. A timeout might be worth one retry. An authentication error needs human intervention, not retries.
Checkpointing and Recovery
For long-running agent workflows, checkpointing saves progress at key milestones. If the agent crashes after completing steps 1 through 7 of a 10-step workflow, it should be able to resume from step 8 instead of starting over. This is especially important for workflows that make external changes (sending emails, creating files, calling APIs) because you don't want to duplicate those side effects.
Good checkpointing records both the agent's state and the results of completed steps. When the agent restarts, it loads the checkpoint, sees what's already done, and picks up where it left off. Several agent frameworks include built-in checkpointing support.
Fallback Strategies
When the primary approach fails, a fallback gives the agent an alternative path. Can't reach the primary database? Try the read replica. Model refusing to generate the output? Try rephrasing the prompt. Web search returning nothing? Try a different search query. Each fallback adds resilience at the cost of some complexity.
The most important fallback is escalation to a human. When the agent has exhausted its automated recovery options, it should surface the problem clearly instead of spinning in circles. A well-designed escalation path is the ultimate fallback.
Circuit Breakers
If an external service is consistently failing, a circuit breaker stops the agent from repeatedly trying to use it. After N consecutive failures, the circuit "opens" and the agent skips that service entirely (using a fallback) for a cooling period. This prevents cascading failures and gives the external service time to recover.
Related Reading
- How AI Agents Handle Partial Failures Gracefully
- The Architecture of Self-Healing AI Agent Systems
- How AI Agents Learn From Failed Task Attempts
Explore agent frameworks on Skillful.sh. Browse MCP servers.