At Resend, on-call exists to give every production issue a clear owner.
The goal of on-call is not heroics. The goal is to restore service quickly, escalate early, communicate clearly, and improve the system after every incident.
On-call is for production alerts and customer-impacting reliability issues.
Our on-call rotation runs weekly. Each week, we assign:
The primary on-call engineer is the first responder for alerts during the week.
The primary is expected to:
Being primary does not mean solving everything alone. Good on-call engineers ask for help early.
The secondary on-call engineer supports the primary and steps in when needed.
The secondary is expected to:
The secondary should be ready to jump in, not catch up from scratch.
It is fine to have errands or short personal commitments during your week. What matters is that coverage is always clear.
If you expect to be away from your laptop or without internet for a period of time, coordinate with the secondary ahead of time and make sure they explicitly confirm coverage.
If you cannot provide coverage for the week because of travel, illness, PTO, or other commitments, you are responsible for arranging a swap, ideally with at least one week's notice, and updating the on-call calendar. There should never be ambiguity about who is on call.
When an alert fires, the on-call engineer should:
We optimize for restoring service first. A full diagnosis can happen after the system is stable.
If the issue has customer impact or degraded performance, follow our incident process.
We prefer safe, reversible actions that reduce impact quickly.
This usually means:
The fastest path to stability is usually better than the most elegant technical fix in the moment.
Once the issue is resolved, the on-call engineer should ensure the operational context is captured and any immediate follow-up work is created.
This includes:
If a formal incident was declared, follow our post-incident review process.
On-call can be stressful. That is normal.
A few things to remember:
We want on-call to be sustainable. If the rotation is too noisy or too stressful, we should improve the system, not normalize the pain.
We evaluate on-call quality through a few simple questions: