Articles/Agent workflows
PlaybookAgent workflows

Agent Error Recovery: Building Self-Healing Pipelines

Practical patterns for agents that detect, classify, and recover from failures without human intervention.

February 24, 20268 min

Every agent will fail. The question isn't whether — it's how gracefully.

Most agent frameworks treat errors as exceptions. Something breaks, the pipeline stops, a human gets paged. This works for traditional software. For agents, it's a design failure.

Why Agents Fail Differently

Traditional software fails on known boundaries: network timeouts, null references, type mismatches. Agent failures are semantic — the output looks valid but means the wrong thing.

Three categories of agent failure:

  • Hard failures: The agent throws an exception or returns nothing. Easy to catch, easy to handle.
  • Soft failures: The agent returns a result that's structurally correct but semantically wrong. A refactored function that compiles but changes behavior.
  • Cascading failures: A soft failure in step 3 corrupts the context for steps 4-7, producing a chain of increasingly wrong outputs.

Soft and cascading failures account for 78% of agent incidents in our production systems.

The Self-Healing Pattern

interface RecoveryStrategy {
  detect: (output: AgentOutput) => ValidationResult;
  classify: (error: ValidationError) => ErrorCategory;
  recover: (category: ErrorCategory, context: PipelineContext) => RecoveryAction;
  escalate: (failure: UnrecoverableError) => EscalationPath;
}

The key insight: detection must be separate from execution. The agent that produced the output cannot reliably validate it. You need a second perspective — either a different agent, a rule-based validator, or a deterministic check.

Detection Layers

We use three layers of detection, each catching what the previous missed:

Layer 1: Schema Validation

Does the output match the expected structure? This catches hard failures and obvious malformations. Fast, cheap, catches ~40% of issues.

Layer 2: Semantic Checks

Does the output make sense in context? If the agent was asked to refactor a function, does the output still pass the same tests? This catches soft failures. Slower, more expensive, catches ~35% of issues.

Layer 3: Cross-Agent Review

A separate agent reviews the output with access to the original intent. This catches subtle semantic drift. Most expensive, catches the remaining ~25%.

Recovery Actions

Not all failures deserve the same response:

Error CategoryActionCost
Schema violationRetry with stricter constraintsLow
Semantic driftRetry with enriched contextMedium
Context corruptionRollback to last checkpoint, re-executeHigh
Fundamental misunderstandingEscalate to humanHighest

Results

After implementing self-healing pipelines across our agent infrastructure:

  • Unrecoverable failures dropped from 23% to 3%
  • Mean time to recovery: 45 seconds (was: manual, 15+ minutes)
  • Human escalation rate: 1 in 50 tasks (was: 1 in 3)

"The goal isn't zero failures. It's zero unhandled failures."

The system doesn't prevent agents from making mistakes. It makes mistakes cheap and recovery automatic. That's the difference between a fragile pipeline and a resilient system.

CatoCut
CatoCut
Agent-First Engineering