In January 2026, we let an AI agent refactor 140 files across 3 microservices. Zero regressions. Here's exactly how — and the surprises along the way.
The Problem
Our payments service had accumulated 18 months of tech debt. Three services — payments-core, billing-api, and invoice-gen — shared a deprecated data format that needed migration to a new schema. Every engineer estimated 3-4 weeks of manual work.
The Setup
Instead of assigning it to a human, we wrote a spec:
goal: "Migrate all services from PaymentV1 schema to PaymentV2"
scope:
services: [payments-core, billing-api, invoice-gen]
file_types: [.ts, .tsx]
constraints:
- All existing tests must pass after migration
- No behavioral changes — output byte-identical for same inputs
- Maximum 5 files changed per commit
- Each commit independently revertable
verification:
- Run full test suite after each commit
- Integration tests after each service complete
- Diff review by human after full migration
The Execution
The agent worked in three phases:
Phase 1: Analysis (2 hours)
The agent mapped every usage of PaymentV1 across all three services: 140 files, 23 type definitions, 47 conversion functions, 12 API endpoints, 8 database queries.
It also found a hidden dependency between billing-api and a fourth service (notifications) that we hadn't included in the scope. The agent flagged this and paused for human input.
Phase 2: Planning (30 minutes)
34 commits, ordered by dependency graph. Each commit touched at most 5 files and was independently revertable.
Phase 3: Execution (6 hours)
31 out of 34 commits passed on the first try. Three required recovery:
- Commit 12: A test asserted on the old schema. Agent fixed the test in a follow-up.
- Commit 23: A race condition unrelated to migration but exposed by it. Agent classified as out-of-scope.
- Commit 29: Circular import from new structure. Agent restructured imports.
The Numbers
| Metric | Human Estimate | Agent Actual |
|---|---|---|
| Duration | 3-4 weeks | 8.5 hours |
| Files changed | ~140 | 140 |
| Regressions | Unknown | 0 |
| Bugs found | 0 | 5 |
| Human time | 3-4 weeks | 2 hours (review) |
Key Takeaway
The spec was the product. The agent was the tool. The 2-hour human review was more valuable than 4 weeks of manual refactoring — the human focused on intent verification, not mechanical execution.
"The best refactoring is the one where the engineer only has to verify, not execute."
