Alert-driven monitoring

Posted by khazit 7 days ago

Alert-driven monitoring(simpleobservability.com)

127 points | 44 commentspage 2

hyperadvanced 6 days ago|

I used to believe in alert fatigue, because you’re frequently told to repeat the line: if you have too many alerts, eventually everyone will stop paying attention to them.

I have tons of alerts at work. They go to specialized slack channels that I can look at if I need. We have on call escalation paths for critical ones and housekeeping duties for the ones that require engineers to perform a maintenance task. We have the hell channels that are 99.99% flapping, if you ever need that.

I find that observability in general has an extremely linear marginal reward curve, it basically always justifies the effort you put into setting it up.

prism56 6 days ago||

I work writing analytics and monitoring for industrial equipment. We have hundreds of sensors sending back realtime data.

There was a period of time where people were writing alerts for the sake of it (i.e we have this sensor, when should we alert on it).

Nowadays we're strictly failure mode driven, this has meant lots of sensors aren't used in the analytics. They are however available to the experts to plot them for a more holistic view if required.

nulltrace 6 days ago|

We went through the same switch. Half our alerts had been firing for a while and nobody ever acted on them.

kylemaxwell 6 days ago||

I like the ideas, but either it’s entirely LLM written or the writer has internalized “LLM voice”. At this point that is more distracting than helpful.

thunky 6 days ago||

Do we really need this comment for every article? Who cares if AI wrote it if you like the ideas?

tuetuopay 6 days ago|||

The issue is, there’s not a lot of meat in this article. Anyone who’s done any amount of SRE can perfectly articulate alert fatigue in way less words.

Yet the article doesn’t tackle at all the hard part: making alerts that are actually meaningful. They handwave it instead of giving actual advice. This post is a good intro, but I didn’t "walk away" with anything useful.

This is why, in this case, AI is important. Someone puts in an effort to write a short article (if a bit wordy) that can be used by e.g. beginners or managers? Good! I’m not the target audience. But if it’s the output of AI, what’s the intent?

thunky 6 days ago||

The aricle is a marketing page under a "Winning with us" section right next to a "CEO Page" that describes a CEO pitch. I really don't think this article is very different from thousands of others like it that were published before AI.

tuetuopay 6 days ago||

Huh, thanks, I failed to see it on my phone. Down the trash the article goes then.

loloquwowndueo 6 days ago|||

I care.

thunky 6 days ago||

Nobody cares that you care, because you're not adding anything to the conversation.

loloquwowndueo 6 days ago||

I’m answering a direct question posed by the parent post. What’s your excuse?

thunky 6 days ago||

I wrote the parent post. It was a rhetorical question.

radiator 6 days ago||

But it did not get the answer you expected. So a failed rhetorical question.

thunky 6 days ago||

Exactly.

beachy 6 days ago||

The prompt: "take existing decades-old knowledge about best practices in setting up alerts and spin it into a multi-page article presenting it as somehow novel, to assist our submarine marketing efforts".

jbmsf 6 days ago||

I have some thoughts here.

I work for a startup; we have what I think is a fairly typical setup: metrics ingested from a variety of sources, fed into industry-standard metrics/dashboard solutions, triggering escalations to humans. It's fine and I'm happy we have it, but...

The highest value source of alerting right now is one of our growth marketers who pays close attention to our CRM and product analytics tool and notices when key product funnels are underperforming.

Our next highest value signals are a handful of ad hoc alerting channels, mostly in Slack, either directly from a partner telling us that something suspicious happened on their side (think: fraud) or from in-product instrumentation sent to a channel for non-engineering visibility. Members of our business/product/operations team pay attention in these places and make decisions based on their business context.

After that, our support team is increasingly able to filter customer issues and differentiate between bugs, missing features, etc.

I know someone is going to argue that these are all a sign that we haven't instrumented the right things. Fair, but also misses the point. The decision makers in these flows don't (and won't) live in traditional alerting systems and wouldn't have helped us understand breakages without these other, ad hoc processes.

My theory is that it's relatively easy to offer a technical product that moves alerts around or that manages escalation paths. It's quite hard to design a product that surfaces detail to a non-technical export and that makes it easy to build systematic rules.

nishantjani10 6 days ago|

you are defining the art of setting up SLOs for end user workflows. This is typically achieved with contract monitoring (top down). This article is focused on bottom up approach of fine tuning and setting up alerts

jbmsf 2 days ago||

Thanks for the pointer to SLOs; I've read more than a few posts about them but it never clicked, will look more closely.

My point, I think, is still that the overwhelming focus of the tools I've seen focus on the kind of fine-tuning/setup you are describing and not the things that I find most valuable. And I think that part of the problem is that it's easy to build technology around mechanics than judgement.

huflungdung 6 days ago|

[dead]