22 years
5 years
15 years
DevOps Manager
Platform Engineering Manager
SRE Manager
Software Engineering Manager
1996 - 1999
The Customer experience must always be protected -- graceful degradation over outages
The businesses ability to generate revenue and maximise profit is paramount
Operational discipline, across the board, makes that happen
| Customers Are Not Your Monitoring System | When users find the issue before you do, you've already failed. Systems should be detecting degradation before it becomes pain. |
| Incidents Are Learning Opportunities, Not Just Interruptions | The postmortem isn't the end of an incident — it's the beginning of improvement. What matters is what you change next. |
| New Features Should Never Compromise Core Experiences | No launch is worth breaking the flows that drive your revenue and trust. Integrity comes before novelty. |
| On-Call Is a Core Engineering Discipline | It's not a chore or punishment shift — it's a critical skillset. Every engineer needs to be trained, supported, and held accountable for it. |
So I read some of your post-mortems... and have some advice and guidance to share
|
Metrics |
|
Monitoring |
|
Alerting |
|
Deployment Safety |
|
Incidents |
|
Discipline and Ritual |
| Noisy Alerts don't improve safety | They desensitize to actual problems and increase ops toil. Basically they are spam |
| Alerts don't resolve themselves |
If it fired, something changed. Dismissing it is betting against your customer
noticing first.
|
| Ownership | Every alert should have an owner, playbook, and a threshold. If it doesn't, it's not production-grade. |
| False Positives Aren't Free | They cost trust, time, and sleep — and eventually, they cost incidents. |
| Fatigue = Leadership Miss | Alert fatigue is a management failure. If your team is burnt out on on-call — look at your prioritization. |
| Observability First | Good observability is a product of good alerts and logging. If your alerts are noisy the problem may be your system design |
| Tune and Verify | Every alert is an invitation to assess if it provides value and led to engineering action |
| Metrics Must Tell Stories | If your metrics don't lead to decisions, they're not helping — they're noise. Every graph should answer: "What would I do if this moved?" |
| Clean Your Data | Dirty data ruins trust in metrics. Separate synthetic traffic, failed test runs, or spam events |
| Detect Regression Fast | Metrics should detect regression before customers do. If conversion drops 20%, you should be notified |
| Metrics Need Context | A flatline isn't good or bad — it depends on what it's measuring. Without historical context and goals, you're guessing |
| Don't Metric Everything | Measuring everything isn't a strategy — it's a mess. Choose a small set of high-signal metrics, invest in making them accurate, ensure they lead to behavioral impact |
| Traceability is hard, but... | Metrics should instrument the user's full journey, not your microservice boundaries, so you don't miss what's actually breaking the experience. |
| Customer Experience First | Your critical customer and revenue flows are your front door. Monitor these above all else |
| Symptoms Over Causes | Good monitoring catches symptoms, not just causes. Don't wait for a DB timeout — alert when the page slows down or checkout stalls. |
| External ≠ Observable | Third-party dependencies must be monitored like first-class systems. If Stripe dies and you find out from social media, you're already behind. |
| Less is Clearer | Too many graphs, overlapping metrics, and unclear alerts turn dashboards into white noise. |
| Monitoring ≠ Afterthought | If monitoring is added after the feature ships, it's already too late. Observability must be part of the design — not the cleanup. |
| Block Untrusted Deploys | CI/CD systems should block deploys that fail critical-path tests or have < X% test coverage — require EM override to bypass. |
| Progressive Rollouts | Use progressive rollout strategies (10% → 25% → 60% → 100%) to reduce blast radius. If you're pushing 99% or 100% to prod that's gambling |
| Rollback Must Be Fast | Add rollback automation for any failed canary or gating test. If your team is still SSHing to roll back, you don't have a safe deploy process — you have a recovery delay. |
| Staging ≠ Production | "It worked in staging" is not a quality guarantee. Feature flags, data conditions, and user scale all change in prod |
| Decouple Safety from Speed | If you're not using feature flags, you're one bug away from a full rollback. Ship code and control behavior independently |
| Canaries Aren't Optional | Every critical service should have a canary stage — even if it's just 5% of traffic. If you're deploying blind, you're trusting luck, not process. |
| Follow the right Metric Trends | MTTA and MTTD are useful. MTTR is almost useless |
| Reliability Investment | Learning from Incidents is critical, if you don't take action or improve anything from an incident you've wasted it |
| Weekly Reviews | Setup a weekly recurring incident review with Managers (Engineering AND Product) and Staff+. Choose the highest impact incidents to discuss first. |
| You build it, you own it | Operations is an engineering responsibility. If you change production, you own it. If your team has an on-call roster, you should be on it. |
| On-call handover | Mandatory weekly meeting for everyone in the team. This is a learning opportunity for everyone. Take attendance if you have to |
| Roster resilience | Primary, Secondary and Overrides need to be catered for. Managers are always escalation points. Escalate up the chain to the top |
| Planned vs Unplanned | If your sprint is 100% features, you're lying to yourself. Unplanned operational work is real and should feed the backlog. |
| Sprint Hygiene | Every sprint should surface on-call-driven work. If incident and alert follow-ups aren't showing up in Jira, you're hiding operational toil and waiting for them to happen again. |
| Burndown by Default | If you don't have a burn-down chart of unresolved incident action items, you're not learning — you're stalling. Every item should have an owner, ETA, and escalation plan if overdue. |
| Debt Tagging | Every postmortem should tag unresolved debt: infra, process, test, tooling. Action items should have deadlines |
| Close testing gaps | If "add tests" is always your postmortem answer, you need to step back and fix your testing strategy — not just your test files. |
| Escalate the Incomplete | Any incident follow-up not completed within SLA should escalate to the Weekly Incident Management Review — not get silently deprioritized. Delay is risk. Track it like one. |
Questions?
https://opsguidance.hatchman76.com