Lots of metrics are typically available, but almost all of them are noise.
Start with the business: what is important to the business ? What kind of failures are existential threats ?
Then work your way down and design your metrics and alerts, instead of just throwing stuff at the wall.
I’ve had to push back so many times with teams whose manager at one point said “we need better monitoring / alerting” and they interpreted that to mean more metrics / alerts.
This is rarely the case.
I personally am really fond of just using a few alerts. The important thing to know that something went wrong. Not necessarily where / why / how something went wrong.
And yes, inertia is real, and false / invaluable alerts need to be killed immediately, without remorse. They are SRE’s cancer.
As you say, few is better. And a well chosen few.
you then carve out essential user flows from these system diagrams and only then, why you look at an alert in isolation can you tie the entire story, whether is this alert important, which user workflow does it break, what is the SLO on it.
Then I have a second level of this, the superpanic. Here is the "true" alert, which means "drop all things, fix this now". On every superpanic, there are stricter routines which intentionally cause friction, such as creating tickets about said superpanic, potentially hosting post mortems etc. This additional manual labour encourages tweaking the levels of the superpanic so that they sometimes are more lack, sometimes stricter, depending on the quality of the deployed services + the current load.
What signals a superpanic? Key valuable functionality being offline. Off-site uptime-checkers assuring that all primary domains resolve + serve traffic, mostly. Also crontime integration tests of core functionality. Stuff like that.
While this sounds sensible, in my experience it often becomes just a convoluted punishment for people involved in the alert firing. In general, people are lazy (sorry), and if alert makes them fill up post-mortem forms and attend mandatory late meetings with management why something got triggered for any reason - 99% of people will push to remove the alert altogether, or at least lower the priority. I haven't found a solution that doesn't include a complete overhaul of organization in the enterprise.
https://en.wikipedia.org/wiki/Nelson_rules
Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those.
I think alerts are to ops what tests are to dev. You have "unit alerts" for some small thing like the disk usage on a single host, "integration alerts" like literally "does the page load?" and then what you describe are "regression alerts", trying to prevent something that went wrong once from going wrong again. These are great but just like you wouldn't have 100% regression tests, I think it's also smart to try to get ahead of failures and have some common sense alerts defined.
Also looking at failures others had, prior experience from yourself and others contribute to good alerts. You don't have to wait for failure to implement most of them. Most of that knowlege is also trained in to most LLM's nowadays. Just ask and then also verify sources, then implement. If you get to many alerts question if you needed them or if its noice. Its a constant trimming until you find the perfect alert setup.
ElasticSearch for example can be configured using ILM policies to fill up the disk then start deleting old records. I don't need to be woken up for disk filling up on those nodes.
Even worse is CPU/RAM alerts.
“it’s not X it’s Y”
at this point when I see this pattern in writing I assume most if not all of it is AI generated - same with em-dashes.
This is not to discount the idea that alerts are more important than dashboards (I work directly in observability) - but just to say that I personally shut off reading anything else with these patterns because, generally speaking, the rest of the content is just not original or interesting.
It is very frequent to find things about which a majority of the people wrongly believe that they are X, but in fact they are Y.
In such cases, you must point to them that "it's not X it's Y".
There are a few alternative ways to formulate this, but the alternatives are typically longer and more complex.
The same happens with em-dashes, which have valid uses and one should not care that there exist some people who are not familiar with the classic ways of using punctuation.
I do not believe that the right solution is to attempt to use more convoluted expressions or inappropriate punctuation in order to avoid to be accused of being a clanker.
At the first role I ever had 10+ years ago, we had a TV in our team's office space constantly showing our dashboard for our critical services and health. We still had alerting monitors but it felt like those alarms were for important issues (like sev-2 or worse).
the last couple roles I've had we don't constantly look at our dashboards unless our monitors keep ringing us with alerts. We have also had more monitors in general than the first role I mentioned. Occasionally if another team asks us if we're affected by something we'll look at the dashboards we have to make sure we don't have a monitoring gap.
Instead I would move up a level and start with a SLO for the various "business level" metrics you might care about. Things like "request latency", "successful requests", etc.
Then use the longer lookahead "error budget" burndowns to see where your error budget is being spent, and from there decide 1.) if the SLO needs adjusting, and/or 2.) if an alert is appropriate.
To cleanly answer those questions and iterate you'll need metrics, dashboards, traces, and logs. So then you're not just making dashboards because "its best practice", you're creating them to specifically help you measure if you're meeting your stated service objectives.
As far as timespans for the error budget consumption, I’ve seen 1 hour -> 1 day -> 1 week. The 1 hour error budget rate would be a page and the others would be low priority.
So you could either keep that as the alerting and/or use the error budget “look ahead” to see if there are more specific alerts you need.
curl -fsSL https://simpleobservability.com/install.sh | sudo bash -s -- <SERVER KEY>
I dont feel comfortable running this in prod.
Each critical and warning alert should link to an "interactive runbook" - a dashboard that combines text instructions along with graphs showing real-time data.
Doing this at scale, correctly, requires both alerts-as-code and dashboards-as-code, which almost nobody does because nobody treats higher-level configuration languages (jsonnet, CUE...) with the attention and respect they deserve /cries-in-yaml
When that happens, fire off a battery of diagnostic checks which you have collected over time to pinpoint the cause.
What if the diagnostics checks don't reveal the issue? There is still value since you know these are not the reason so no time is wasted re-evaluating them. Where to get these diagnostic checks from? Well, what's the first thing responding engineers do? Open the CLI and troubleshoot. Those are your diagnostic checks. Collect, automate, capture the domain specific knowlegde and democratize it.