Every agent will fail. The question isn't whether — it's how gracefully.
Most agent frameworks treat errors as exceptions. Something breaks, the pipeline stops, a human gets paged. This works for traditional software. For agents, it's a design failure.
Why Agents Fail Differently
Traditional software fails on known boundaries: network timeouts, null references, type mismatches. Agent failures are semantic — the output looks valid but means the wrong thing.
Three categories of agent failure:
- Hard failures: The agent throws an exception or returns nothing. Easy to catch, easy to handle.
- Soft failures: The agent returns a result that's structurally correct but semantically wrong. A refactored function that compiles but changes behavior.
- Cascading failures: A soft failure in step 3 corrupts the context for steps 4-7, producing a chain of increasingly wrong outputs.
Soft and cascading failures account for 78% of agent incidents in our production systems.
The Self-Healing Pattern
interface RecoveryStrategy {
detect: (output: AgentOutput) => ValidationResult;
classify: (error: ValidationError) => ErrorCategory;
recover: (category: ErrorCategory, context: PipelineContext) => RecoveryAction;
escalate: (failure: UnrecoverableError) => EscalationPath;
}
The key insight: detection must be separate from execution. The agent that produced the output cannot reliably validate it. You need a second perspective — either a different agent, a rule-based validator, or a deterministic check.
Detection Layers
We use three layers of detection, each catching what the previous missed:
Layer 1: Schema Validation
Does the output match the expected structure? This catches hard failures and obvious malformations. Fast, cheap, catches ~40% of issues.
Layer 2: Semantic Checks
Does the output make sense in context? If the agent was asked to refactor a function, does the output still pass the same tests? This catches soft failures. Slower, more expensive, catches ~35% of issues.
Layer 3: Cross-Agent Review
A separate agent reviews the output with access to the original intent. This catches subtle semantic drift. Most expensive, catches the remaining ~25%.
Recovery Actions
Not all failures deserve the same response:
| Error Category | Action | Cost |
|---|---|---|
| Schema violation | Retry with stricter constraints | Low |
| Semantic drift | Retry with enriched context | Medium |
| Context corruption | Rollback to last checkpoint, re-execute | High |
| Fundamental misunderstanding | Escalate to human | Highest |
Results
After implementing self-healing pipelines across our agent infrastructure:
- Unrecoverable failures dropped from 23% to 3%
- Mean time to recovery: 45 seconds (was: manual, 15+ minutes)
- Human escalation rate: 1 in 50 tasks (was: 1 in 3)
"The goal isn't zero failures. It's zero unhandled failures."
The system doesn't prevent agents from making mistakes. It makes mistakes cheap and recovery automatic. That's the difference between a fragile pipeline and a resilient system.
