It's 3 AM. PagerDuty is screaming. The server is down. What do you do? If you have to think about the answer, you've already failed. Incident Response is about muscle memory and prepared playbooks. Panic causes mistakes; process causes resolution.
Phase 1: Triage (Stop the Bleeding)
Don't try to fix the root cause bug yet. Restore service first.
- Is it a DDoS? Enable Cloudflare "Under Attack" mode or block the malicious IP at the firewall level.
- Is it a bad deployment? Rollback immediately.
git revertimplies you need to build/deploy again, which is slow. Better: Swap the symlink to the previous build folder instantly. - Is the DB locked? Restart the DB service. Yes, it's crude, but if it buys you 20 minutes of uptime to investigate the slow query, do it. Use the time to analyze, not just sweat.
Phase 2: Communication
Silence makes users angry. Transparency builds trust.
- Status Page: Update it immediately. "We are investigating an issue with API latency." is better than silence.
- Social Media: Acknowledge the problem publicly if it's user-facing.
- Internal: Notify the support team so they know what to tell customers. Give them a script.
Phase 3: Root Cause Analysis (RCA)
Once the fire is out, find the arsonist.
- Logs: Check
/var/log/syslog,/var/log/nginx/error.log. Look for "OOM Killer" (Out of Memory) messages. - Metrics: Check monitoring graphs (Grafana/Datadog) for spikes in CPU/RAM/IO just before the crash. Did traffic spike? Did disk I/O freeze?
Phase 4: The Post-Mortem
Write a document. It's not about blame; it's about learning.
- What happened? (The server ran out of RAM).
- Why? (A memory leak in the image resize worker).
- Why was it not detected? (We didn't have alerts for worker RAM usage > 80%).
- Action Items: (Fix the leak in code. Add the alert in Zabbix).
Golden Rule: Blame the Process, not the Person. If a human made a mistake, it's because the system allowed them to.
SecurityIncident ResponseUptime
Share:
