Handling Partial Failures in AI Agent Workflows

All or Nothing Is the Wrong Default

An agent is compiling a research report. It successfully gathers data from three sources, then the fourth source times out. The default behavior in most systems is to throw an error and return nothing. But the data from those first three sources is still valuable. A well-designed agent returns what it has and clearly notes what's missing.

Partial failure handling is the difference between an agent that's useful in imperfect conditions (which is most conditions) and an agent that only works when everything goes perfectly. Real-world environments are messy: APIs go down, data sources are flaky, network connections drop. Agents that handle this gracefully are agents that actually get used.

Preserving Completed Work

The first principle is never throw away completed work. If the agent processed 7 out of 10 files before hitting an error on file 8, those 7 results should be preserved and available. This requires the agent to write results incrementally rather than waiting until everything is done to produce output.

Incremental result storage pairs well with the checkpointing patterns discussed earlier. Each completed subtask writes its result to persistent storage. If the workflow fails partway through, the completed results are still there, and the workflow can be resumed from where it stopped.

Communicating What's Missing

Partial results without context are dangerous. If the agent returns a report based on 3 out of 4 data sources without mentioning the missing source, the user might make decisions based on incomplete data without knowing it. Clear communication is essential: "This report includes data from sources A, B, and C. Source D was unavailable. Results may be incomplete."

The format matters too. A structured response that separates "completed" from "failed" subtasks is more useful than a paragraph buried in the output mentioning that something didn't work. Users should be able to see at a glance what succeeded and what didn't.

Retry and Continue Patterns

After a partial failure, the agent has three options: retry the failed portion, skip it and continue, or stop and report. The right choice depends on the nature of the failure and the task's requirements. If the failed data source is critical, retry. If it's supplementary, skip and note the gap. If the user specifically needs all sources, stop and report.

Smart retry logic considers whether retrying is likely to succeed. A timeout on a slow API might resolve with a retry. A 404 error won't. Retrying a permanent failure wastes time and should be avoided. Agent frameworks with good error handling make these distinctions automatically.

Degraded Quality Signals

Some systems assign a confidence score to partial results. "This analysis has a confidence of 75% because it's missing data from one key source." This lets the user decide whether the partial result is good enough for their needs or whether they need to wait for a complete result. It's a more nuanced signal than simply "done" or "failed."

You can find tools that help implement quality scoring by searching on Skillful.sh. The combination of partial result preservation plus quality signals makes agents substantially more useful in real-world conditions.

How AI Agents Handle Partial Failures Gracefully

All or Nothing Is the Wrong Default

Preserving Completed Work

Communicating What's Missing

Retry and Continue Patterns

Degraded Quality Signals

Related Reading