How we handle incidents

Reading time4min


Updated


Authors

Resend is committed to providing reliable and consistent service to our customers. Incidents are inevitable. What matters is how quickly and clearly we respond and how we can learn to make that same error not happen again.

This document explains how incidents are declared, handled, and closed.

How incidents start

An incident can start in a few ways:

  • A monitor or alert indicates the system is unhealthy
  • The team notices degraded performance or something affecting customers
  • A customer reports degraded service or unexpected behavior

We declare an incident when there is customer impact, degraded service, or a strong signal that customer impact is likely. When in doubt, we declare early. A false positive is cheaper than a late response.

If a provider issue affects our customers, we still treat it as our incident. Our customers experience Resend, not our dependencies.

How we respond

When an incident is declared, we create or join the incident channel and use it as the source of truth for triage, decisions, and updates.

Every incident should have clear ownership. One person coordinates the response. One person owns customer communication. Early in the incident, the same person may temporarily do both, but ownership should always be explicit.

Our response follows a few principles:

Triage quickly

The first step is to decide whether the report should be accepted as an incident. If not, we close it and document why. If yes, we move into incident mode immediately.

Huddle when needed

Written updates are useful, but they are not always enough. We don't expect the entire company is following the incident channel and reading. We join a huddle to pass context and make decisions fast.

Stay urgent until the incident is resolved

Incidents take priority over normal work until customer impact has been removed and the system is stable. If the incident is not resolved, it remains the top priority until the end.

Centralize communication

Findings, decisions, and next steps belong in the incident channel so everyone is working from the same context.

Mitigate first

Our first priority is to reduce or eliminate customer impact. Rollbacks, feature flags, traffic shifts, or operational workarounds come before root cause analysis. We can investigate deeply once the system is stable.

Communicate clearly

If customers are impacted, we communicate early and clearly through the appropriate channel. Broad incidents should be reflected on the status page. A single-customer incident may be handled directly.

Closing an incident

An incident is closed only when customer impact has ended and the system is stable.

Before closing an incident, we confirm:

  • Customer-facing behavior is back to normal
  • Backlog is drained or intentionally managed with no ongoing customer impact
  • DLQs are empty or intentionally draining with no ongoing customer impact
  • Alerts and monitors are green
  • The service is stable
  • Customer communication has been updated and, when needed, a final status update has been sent