Reliability, Resilience and Incident Management

Andrew Hatch

22 years

5 years

15 years

DevOps Manager
Platform Engineering Manager
SRE Manager
Software Engineering Manager

1996 - 1999

The Customer experience must always be protected -- graceful degradation over outages

The businesses ability to generate revenue and maximise profit is paramount

Operational discipline, across the board, makes that happen

Customers Are Not Your Monitoring System	When users find the issue before you do, you've already failed. Systems should be detecting degradation before it becomes pain.
Incidents Are Learning Opportunities, Not Just Interruptions	The postmortem isn't the end of an incident — it's the beginning of improvement. What matters is what you change next.
New Features Should Never Compromise Core Experiences	No launch is worth breaking the flows that drive your revenue and trust. Integrity comes before novelty.
On-Call Is a Core Engineering Discipline	It's not a chore or punishment shift — it's a critical skillset. Every engineer needs to be trained, supported, and held accountable for it.

So I read some of your post-mortems... and have some advice and guidance to share

Noisy Alerts don't improve safety	They desensitize to actual problems and increase ops toil. Basically they are spam
Alerts don't resolve themselves	If it fired, something changed. Dismissing it is betting against your customer noticing first.
Ownership	Every alert should have an owner, playbook, and a threshold. If it doesn't, it's not production-grade.
False Positives Aren't Free	They cost trust, time, and sleep — and eventually, they cost incidents.
Fatigue = Leadership Miss	Alert fatigue is a management failure. If your team is burnt out on on-call — look at your prioritization.
Observability First	Good observability is a product of good alerts and logging. If your alerts are noisy the problem may be your system design
Tune and Verify	Every alert is an invitation to assess if it provides value and led to engineering action

Metrics Must Tell Stories	If your metrics don't lead to decisions, they're not helping — they're noise. Every graph should answer: "What would I do if this moved?"
Clean Your Data	Dirty data ruins trust in metrics. Separate synthetic traffic, failed test runs, or spam events
Detect Regression Fast	Metrics should detect regression before customers do. If conversion drops 20%, you should be notified
Metrics Need Context	A flatline isn't good or bad — it depends on what it's measuring. Without historical context and goals, you're guessing
Don't Metric Everything	Measuring everything isn't a strategy — it's a mess. Choose a small set of high-signal metrics, invest in making them accurate, ensure they lead to behavioral impact
Traceability is hard, but...	Metrics should instrument the user's full journey, not your microservice boundaries, so you don't miss what's actually breaking the experience.

Customer Experience First	Your critical customer and revenue flows are your front door. Monitor these above all else
Symptoms Over Causes	Good monitoring catches symptoms, not just causes. Don't wait for a DB timeout — alert when the page slows down or checkout stalls.
External ≠ Observable	Third-party dependencies must be monitored like first-class systems. If Stripe dies and you find out from social media, you're already behind.
Less is Clearer	Too many graphs, overlapping metrics, and unclear alerts turn dashboards into white noise.
Monitoring ≠ Afterthought	If monitoring is added after the feature ships, it's already too late. Observability must be part of the design — not the cleanup.

Block Untrusted Deploys	CI/CD systems should block deploys that fail critical-path tests or have < X% test coverage — require EM override to bypass.
Progressive Rollouts	Use progressive rollout strategies (10% → 25% → 60% → 100%) to reduce blast radius. If you're pushing 99% or 100% to prod that's gambling
Rollback Must Be Fast	Add rollback automation for any failed canary or gating test. If your team is still SSHing to roll back, you don't have a safe deploy process — you have a recovery delay.
Staging ≠ Production	"It worked in staging" is not a quality guarantee. Feature flags, data conditions, and user scale all change in prod
Decouple Safety from Speed	If you're not using feature flags, you're one bug away from a full rollback. Ship code and control behavior independently
Canaries Aren't Optional	Every critical service should have a canary stage — even if it's just 5% of traffic. If you're deploying blind, you're trusting luck, not process.

Follow the right Metric Trends	MTTA and MTTD are useful. MTTR is almost useless
Reliability Investment	Learning from Incidents is critical, if you don't take action or improve anything from an incident you've wasted it
Weekly Reviews	Setup a weekly recurring incident review with Managers (Engineering AND Product) and Staff+. Choose the highest impact incidents to discuss first.
You build it, you own it	Operations is an engineering responsibility. If you change production, you own it. If your team has an on-call roster, you should be on it.
On-call handover	Mandatory weekly meeting for everyone in the team. This is a learning opportunity for everyone. Take attendance if you have to
Roster resilience	Primary, Secondary and Overrides need to be catered for. Managers are always escalation points. Escalate up the chain to the top

Planned vs Unplanned	If your sprint is 100% features, you're lying to yourself. Unplanned operational work is real and should feed the backlog.
Sprint Hygiene	Every sprint should surface on-call-driven work. If incident and alert follow-ups aren't showing up in Jira, you're hiding operational toil and waiting for them to happen again.
Burndown by Default	If you don't have a burn-down chart of unresolved incident action items, you're not learning — you're stalling. Every item should have an owner, ETA, and escalation plan if overdue.
Debt Tagging	Every postmortem should tag unresolved debt: infra, process, test, tooling. Action items should have deadlines
Close testing gaps	If "add tests" is always your postmortem answer, you need to step back and fix your testing strategy — not just your test files.
Escalate the Incomplete	Any incident follow-up not completed within SLA should escalate to the Weekly Incident Management Review — not get silently deprioritized. Delay is risk. Track it like one.

	The Site Reliability Workbook		The DevOps Handbook
	The Tyranny of Metrics		Chaos Engineering: System Resiliency in Practice
	The Phoenix Project		The Field Guide to Human Error
	Leaders Eat Last		Normal Accidents
	The Challenger Launch Decision		Thinking in Systems: A Primer
	The Goal		Turn the Ship Around!
	Drift into Failure		The Safety Anarchist
	Still Not Safe

	How Complex Systems Fail - Richard Cook
	Confessions of an SRE Manager
	Learning from Incidents
	Learning more from Complex Systems
	Product Reliability — Is it Just a Matter of Perspective?
	Improving Incident Learning Part 1
	Improving Incident Learning Part 2
	Improving Incident Learning Part 3
	Improving Incident Learning Part 4

Questions?