CatoCut

The term "AI Developer Experience" has been hijacked. Every IDE plugin that autocompletes a function call now claims to revolutionize AI DX. Let's be precise about what this term should mean.

What AI DX Is Not

AI DX is not:

Better autocomplete in your editor
A chatbot that answers questions about your codebase
An AI that writes tests from descriptions
A copilot that suggests the next line

These are productivity tools. They're useful. They're also not AI DX.

What AI DX Actually Is

AI DX is the entire experience of developing systems where AI agents are first-class participants. It encompasses:

Spec authoring — How easy is it to declare what an agent should do?
Observability — Can you see what the agent is thinking, deciding, and doing?
Debuggability — When something goes wrong, can you trace the causal chain?
Composability — Can you combine agents into larger workflows without losing control?
Feedback loops — How quickly can you iterate on agent behavior?

"Good AI DX means the engineer never has to guess what the agent will do."

The Observability Gap

The biggest AI DX problem today is the observability gap. When a traditional function fails, you get a stack trace. When an agent fails, you get... an incorrect output.

// Traditional: clear error chain
function processOrder(order: Order): Result {
  validate(order);    // throws ValidationError
  calculate(order);   // throws CalculationError  
  submit(order);      // throws SubmissionError
}

// Agent: opaque output
async function agentProcessOrder(spec: OrderSpec): Promise<Result> {
  // What happened inside? Why this output?
  // Which parts of the spec influenced which decisions?
  // Where did the reasoning diverge from intent?
  return agent.execute(spec);
}

Without answers to these questions, debugging agent systems is archaeology — you're reconstructing intent from artifacts.

The Five Layers of Agent Observability

Layer 1: Input Tracing

What exactly did the agent receive? Not just the spec, but the resolved context, the retrieved documents, the conversation history.

Layer 2: Decision Logging

At each branch point, what options did the agent consider and why did it choose the path it took?

Layer 3: Confidence Mapping

For each output token/decision, how confident was the agent? Where was it uncertain?

Layer 4: Spec Alignment Scoring

How well does the output align with each clause of the original spec? Clause-by-clause scoring.

Layer 5: Behavioral Diffing

How does this execution differ from previous executions of the same spec? What changed and why?

Redesigning the Dev Loop

The traditional dev loop: write → run → see error → fix → repeat.

The agent-first dev loop: spec → execute → observe → adjust spec → repeat.

Notice the shift. You're not debugging code. You're refining intent. The feedback loop is about spec quality, not implementation correctness.

This is AI DX. It's not a feature. It's a paradigm.