CatoCut

In January 2026, we let an AI agent refactor 140 files across 3 microservices. Zero regressions. Here's exactly how — and the surprises along the way.

The Problem

Our payments service had accumulated 18 months of tech debt. Three services — payments-core, billing-api, and invoice-gen — shared a deprecated data format that needed migration to a new schema. Every engineer estimated 3-4 weeks of manual work.

The Setup

Instead of assigning it to a human, we wrote a spec:

goal: "Migrate all services from PaymentV1 schema to PaymentV2"
scope:
  services: [payments-core, billing-api, invoice-gen]
  file_types: [.ts, .tsx]
constraints:
  - All existing tests must pass after migration
  - No behavioral changes — output byte-identical for same inputs
  - Maximum 5 files changed per commit
  - Each commit independently revertable
verification:
  - Run full test suite after each commit
  - Integration tests after each service complete
  - Diff review by human after full migration

The Execution

The agent worked in three phases:

Phase 1: Analysis (2 hours)

The agent mapped every usage of PaymentV1 across all three services: 140 files, 23 type definitions, 47 conversion functions, 12 API endpoints, 8 database queries.

It also found a hidden dependency between billing-api and a fourth service (notifications) that we hadn't included in the scope. The agent flagged this and paused for human input.

Phase 2: Planning (30 minutes)

34 commits, ordered by dependency graph. Each commit touched at most 5 files and was independently revertable.

Phase 3: Execution (6 hours)

31 out of 34 commits passed on the first try. Three required recovery:

Commit 12: A test asserted on the old schema. Agent fixed the test in a follow-up.
Commit 23: A race condition unrelated to migration but exposed by it. Agent classified as out-of-scope.
Commit 29: Circular import from new structure. Agent restructured imports.

The Numbers

Metric	Human Estimate	Agent Actual
Duration	3-4 weeks	8.5 hours
Files changed	~140	140
Regressions	Unknown	0
Bugs found	0	5
Human time	3-4 weeks	2 hours (review)

Key Takeaway

The spec was the product. The agent was the tool. The 2-hour human review was more valuable than 4 weeks of manual refactoring — the human focused on intent verification, not mechanical execution.

"The best refactoring is the one where the engineer only has to verify, not execute."