Your browser doesn't support the features required by impress.js.

Reliability, Resilience and Incident Management



Andrew Hatch


@hatchman76

whoami

Australia Location pin

22 years

India
USA Location pin

5 years

15 years

Team member Team member Team member Team member Team member Team member

DevOps Manager
Platform Engineering Manager
SRE Manager
Software Engineering Manager

Andrew in 1996 Cooking hobby

1996 - 1999

some recommended reading

Thinking in Systems The Tyranny of Metrics Chaos Engineering Drift into Failure Field Guide to Human Error Leaders Eat Last Normal Accidents The Challenger Launch Decision The DevOps Handbook The Goal Turn the Ship Around The Phoenix Project The Safety Anarchist Still Not Safe
Site Reliability Workbook

High Performance Operations

The Customer experience must always be protected -- graceful degradation over outages

The businesses ability to generate revenue and maximise profit is paramount

Operational discipline, across the board, makes that happen

Core Principles

Customers Are Not Your Monitoring System When users find the issue before you do, you've already failed. Systems should be detecting degradation before it becomes pain.
Incidents Are Learning Opportunities, Not Just Interruptions The postmortem isn't the end of an incident — it's the beginning of improvement. What matters is what you change next.
New Features Should Never Compromise Core Experiences No launch is worth breaking the flows that drive your revenue and trust. Integrity comes before novelty.
On-Call Is a Core Engineering Discipline It's not a chore or punishment shift — it's a critical skillset. Every engineer needs to be trained, supported, and held accountable for it.

So I read some of your post-mortems... and have some advice and guidance to share

Topics

Metrics Metrics
Monitoring Monitoring
Alerting Alerting
Deployment Safety Deployment Safety
Incidents Incidents
Discipline and Ritual Discipline and Ritual

Topic -> Alerting

Noisy Alerts don't improve safety They desensitize to actual problems and increase ops toil. Basically they are spam
Alerts don't resolve themselves If it fired, something changed. Dismissing it is betting against your customer noticing first. Jim Ockers
Ownership Every alert should have an owner, playbook, and a threshold. If it doesn't, it's not production-grade.
False Positives Aren't Free They cost trust, time, and sleep — and eventually, they cost incidents.
Fatigue = Leadership Miss Alert fatigue is a management failure. If your team is burnt out on on-call — look at your prioritization.
Observability First Good observability is a product of good alerts and logging. If your alerts are noisy the problem may be your system design
Tune and Verify Every alert is an invitation to assess if it provides value and led to engineering action

Topic -> Metrics

Metrics Must Tell Stories If your metrics don't lead to decisions, they're not helping — they're noise. Every graph should answer: "What would I do if this moved?"
Clean Your Data Dirty data ruins trust in metrics. Separate synthetic traffic, failed test runs, or spam events
Detect Regression Fast Metrics should detect regression before customers do. If conversion drops 20%, you should be notified
Metrics Need Context A flatline isn't good or bad — it depends on what it's measuring. Without historical context and goals, you're guessing
Don't Metric Everything Measuring everything isn't a strategy — it's a mess. Choose a small set of high-signal metrics, invest in making them accurate, ensure they lead to behavioral impact
Traceability is hard, but... Metrics should instrument the user's full journey, not your microservice boundaries, so you don't miss what's actually breaking the experience.

Topic -> Monitoring

Customer Experience First Your critical customer and revenue flows are your front door. Monitor these above all else
Symptoms Over Causes Good monitoring catches symptoms, not just causes. Don't wait for a DB timeout — alert when the page slows down or checkout stalls.
External ≠ Observable Third-party dependencies must be monitored like first-class systems. If Stripe dies and you find out from social media, you're already behind.
Less is Clearer Too many graphs, overlapping metrics, and unclear alerts turn dashboards into white noise.
Monitoring ≠ Afterthought If monitoring is added after the feature ships, it's already too late. Observability must be part of the design — not the cleanup.

Topic -> Deployment Safety

Block Untrusted Deploys CI/CD systems should block deploys that fail critical-path tests or have < X% test coverage — require EM override to bypass.
Progressive Rollouts Use progressive rollout strategies (10% → 25% → 60% → 100%) to reduce blast radius. If you're pushing 99% or 100% to prod that's gambling
Rollback Must Be Fast Add rollback automation for any failed canary or gating test. If your team is still SSHing to roll back, you don't have a safe deploy process — you have a recovery delay.
Staging ≠ Production "It worked in staging" is not a quality guarantee. Feature flags, data conditions, and user scale all change in prod
Decouple Safety from Speed If you're not using feature flags, you're one bug away from a full rollback. Ship code and control behavior independently
Canaries Aren't Optional Every critical service should have a canary stage — even if it's just 5% of traffic. If you're deploying blind, you're trusting luck, not process.

Topic -> Incidents & On-Call

Follow the right Metric Trends MTTA and MTTD are useful. MTTR is almost useless
Reliability Investment Learning from Incidents is critical, if you don't take action or improve anything from an incident you've wasted it
Weekly Reviews Setup a weekly recurring incident review with Managers (Engineering AND Product) and Staff+. Choose the highest impact incidents to discuss first.
You build it, you own it Operations is an engineering responsibility. If you change production, you own it. If your team has an on-call roster, you should be on it.
On-call handover Mandatory weekly meeting for everyone in the team. This is a learning opportunity for everyone. Take attendance if you have to
Roster resilience Primary, Secondary and Overrides need to be catered for. Managers are always escalation points. Escalate up the chain to the top

Topic -> Discipline and Ritual

Planned vs Unplanned If your sprint is 100% features, you're lying to yourself. Unplanned operational work is real and should feed the backlog.
Sprint Hygiene Every sprint should surface on-call-driven work. If incident and alert follow-ups aren't showing up in Jira, you're hiding operational toil and waiting for them to happen again.
Burndown by Default If you don't have a burn-down chart of unresolved incident action items, you're not learning — you're stalling. Every item should have an owner, ETA, and escalation plan if overdue.
Debt Tagging Every postmortem should tag unresolved debt: infra, process, test, tooling. Action items should have deadlines
Close testing gaps If "add tests" is always your postmortem answer, you need to step back and fix your testing strategy — not just your test files.
Escalate the Incomplete Any incident follow-up not completed within SLA should escalate to the Weekly Incident Management Review — not get silently deprioritized. Delay is risk. Track it like one.

References & Links

Site Reliability Workbook The Site Reliability Workbook The DevOps Handbook The DevOps Handbook
The Tyranny of Metrics The Tyranny of Metrics Chaos Engineering Chaos Engineering: System Resiliency in Practice
The Phoenix Project The Phoenix Project Field Guide to Human Error The Field Guide to Human Error
Leaders Eat Last Leaders Eat Last Normal Accidents Normal Accidents
The Challenger Launch Decision The Challenger Launch Decision Thinking in Systems Thinking in Systems: A Primer
The Goal The Goal Turn the Ship Around Turn the Ship Around!
Drift into Failure Drift into Failure The Safety Anarchist The Safety Anarchist
Still Not Safe Still Not Safe

Some Talks and Blogs

Academic Paper How Complex Systems Fail - Richard Cook
YouTube Talk Thumbnail Confessions of an SRE Manager
YouTube Talk Thumbnail Learning from Incidents
YouTube Talk Thumbnail Learning more from Complex Systems
Medium Blog Product Reliability — Is it Just a Matter of Perspective?
Medium Blog Improving Incident Learning Part 1
Medium Blog Improving Incident Learning Part 2
Medium Blog Improving Incident Learning Part 3
Medium Blog Improving Incident Learning Part 4

Thank You