← Glossary
Recoverability
Recoverability is the ability to get back to a safe state after something goes wrong—without losing context, duplicating side effects, or trapping users. It’s the difference between “errors happen” and “errors don’t become disasters”.
Definition
- Recoverability is a design + system property: when a flow fails, the user (or operator) can understand the state and move forward safely.
- Recoverability depends on explicit state models, idempotent actions, clear retries, and manual takeover paths.
- It’s especially critical in payments, identity, logistics, and any multi-step ecosystem journey.
- Recoverability is not only error handling; it is continuity of intent: users can pause, resume, and safely complete the job later.
Why it matters
- Most real-world journeys are interrupted: network issues, timeouts, missing data, policy checks, vendor failures.
- Recoverability reduces abandonment and support cost by turning failures into managed states rather than dead ends.
- In agentic systems, recoverability prevents silent compounding errors—agents must be safe to interrupt.
- In practice, this is where many digital programs fail: the concept is understood, but the operating discipline is missing.
Common failure modes
- “Something went wrong” with no state, no next step, no recovery option.
- Retries that duplicate side effects (double charges, duplicated requests, repeated notifications).
- Hard resets that erase context and force users to start over.
- Recovery paths that exist only for support, not for users or operators.
- No pause/resume: the flow assumes continuous attention in a discontinuous world.
How I design it
- Design a state machine: requested → in progress → awaiting input → actioned → pending confirmation → completed, plus failure/rollback/manual states.
- Make retries safe: idempotency, clear confirmation, and explicit “what will happen if you retry”.
- Provide visible status history and timestamps; show what the system knows and what it needs.
- Design manual takeover: escalation routes, operator actions, and user communication patterns.
- Instrument recovery: measure failure points, time-to-recovery, and rework drivers.
- Provide a 'return later' path with preserved context: drafts, pending steps, and what remains to be done.
- Treat it as a repeatable pattern: define it, test it in production, measure it, and evolve it with evidence.
Related work
Proof map claims
Case studies
See also
Contact
Let’s discuss a leadership role, advisory work, or a complex product challenge.