Recoverability

Recoverability is the ability to get back to a safe state after something goes wrong—without losing context, duplicating side effects, or trapping users. It’s the difference between “errors happen” and “errors don’t become disasters”.

Definition

Recoverability is a design + system property: when a flow fails, the user (or operator) can understand the state and move forward safely.
Recoverability depends on explicit state models, idempotent actions, clear retries, and manual takeover paths.
It’s especially critical in payments, identity, logistics, and any multi-step ecosystem journey.
Recoverability is not only error handling; it is continuity of intent: users can pause, resume, and safely complete the job later.

Why it matters

Most real-world journeys are interrupted: network issues, timeouts, missing data, policy checks, vendor failures.
Recoverability reduces abandonment and support cost by turning failures into managed states rather than dead ends.
In agentic systems, recoverability prevents silent compounding errors—agents must be safe to interrupt.
In practice, this is where many digital programs fail: the concept is understood, but the operating discipline is missing.

Common failure modes

“Something went wrong” with no state, no next step, no recovery option.
Retries that duplicate side effects (double charges, duplicated requests, repeated notifications).
Hard resets that erase context and force users to start over.
Recovery paths that exist only for support, not for users or operators.
No pause/resume: the flow assumes continuous attention in a discontinuous world.

How I design it

Design a state machine: requested → in progress → awaiting input → actioned → pending confirmation → completed, plus failure/rollback/manual states.
Make retries safe: idempotency, clear confirmation, and explicit “what will happen if you retry”.
Provide visible status history and timestamps; show what the system knows and what it needs.
Design manual takeover: escalation routes, operator actions, and user communication patterns.
Instrument recovery: measure failure points, time-to-recovery, and rework drivers.
Provide a 'return later' path with preserved context: drafts, pending steps, and what remains to be done.
Treat it as a repeatable pattern: define it, test it in production, measure it, and evolve it with evidence.

Definition

Why it matters

Common failure modes

How I design it

Related work

Proof map claims

Case studies

See also

Contact