Log level 'error' should mean that something needs to be fixed

Posted by todsacerdoti 12/17/2025

Log level 'error' should mean that something needs to be fixed(utcc.utoronto.ca)

482 points | 299 commentspage 2

Xss3 12/20/2025|

Some programs are error resistant and need an additional level: Fatal.

A warning can be ignored safely. Warnings may be 'debugging enabled, results cannot be certified' or something similar.

An error should not be ignored, an operation is failing, data loss may be occurring, etc.

Some users may be okay with that data loss or failing operation. Maybe it isnt important to them. If the program continues and does not error in the parts that matter to the user, then they can ignore it, but it is still objectively an error occurring.

A fatal message cannot be ignored, the system has crashed. Its the last thing you see before shutdown is attempted.

jillesvangurp 12/20/2025||

Errors mean I get alerted. Zero tolerance on that from my side.

yoan9224 12/21/2025||

I've found the most practical rule is: "Would I want to be paged for this at 2 AM?"

If yes: ERROR If I want to check it tomorrow: WARNING If it's useful for debugging: INFO Everything else: DEBUG

The problem with the article's approach is that libraries don't have enough context. A timeout calling an external API might be totally fine if you're retrying, but it's an ERROR if you've exhausted retries and failed the user's request.

We solve this by having libraries emit structured events with severity hints, then the application layer decides the final log level based on business impact. A 500 from a recommendation service? Warning. A 500 from the payment processor? Error.

hedayet 12/20/2025||

I agree with the principle: log level error should mean someone needs to fix something.

This post frames the problem almost entirely from a sysadmin-as-log-consumer perspective, and concludes that a correctly functioning system shouldn’t emit error logs at all. That only holds if sysadmins are the only "someone" who can act.

In practice, if there is a human who needs to take action - whether that’s a developer fixing a bug, an infra issue, or coordinating with an external dependency - then it’s an error. The solution isn’t to downgrade severity, but to route and notify the right owner.

Severity should encode actionability, not just system correctness.

aqme28 12/20/2025||

I agree with this take in a steady state, but the process of building software is just that-- it's a process.

So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.

raldi 12/20/2025|

Exactly: When you're building software, it has lots of defects (and, thus, error logging). When it's mature, it should have few defects, and thus few error logs, and each one that remains is a bug that should be fixed.

plorkyeran 12/20/2025||

I don't understand why you seem to think you're disagreeing with the article? If you're producing a lot of error logs because you have bugs that you need to fix then you aren't violating the rule that an error log should mean that something needs to be fixed.

raldi 12/20/2025||

I couldn’t agree more with the article. What made you think I disagreed?

georgefrowny 12/20/2025||

Easy to say, but there's "yes we know this is wrong but this will have to do for now" and "we don't expect to see this in real life unless something has gone sideways".

oofbey 12/20/2025|

At scale the rare events start to happen reliably. Hardware failures almost certainly cause ERROR conditions. Network glitches.

Our production system pages oncall for any errors. At night it will only wake somebody up for a whole bunch of errors. This discipline forces us to take a look at every ERROR and decide if it is spurious and out of our control or something we can deal with. At some point our production system will reach a scale where there are errors logged constantly and this strategy Durant make sense any more. But for now it helps keep our system clean.

georgefrowny 12/21/2025||

I think if someone is going be gotten out of bed that would be a critical rather then error. Generally I'd say in a large "live" system, errors end up raising Jira tickets, criticals end up ringing phones.

oofbey 12/22/2025||

Most systems I’ve worked with can go completely offline without ever logging a critical error. Some coding errors or misconfiguration or failure in a critical system - enough to log an error - and nobody can get any useful work done. I’ve never seen sobering that cash convert those into critical errors. I’m used to critical errors being rare - certain failures of a server to start. Or infra problems.

jedberg 12/20/2025||

I feel like it's more nuanced than OP writes. Presumably every log line comes from something like a try/catch. An edge case was identified, and the code did something differently.

Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.

Did it fail to do what it needed to do? ERROR

Did it do what it needed to do in the normal way because it was totally recoverable? INFO

Did data get destroyed in the process? FATAL

It should be about what the result was, not who will fix it or how. Because that might change over time.

Joker_vD 12/21/2025|

> Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.

> Did it fail to do what it needed to do? ERROR

> Did it do what it needed to do in the normal way because it was totally recoverable? INFO

We have a web-facing system (it uses a custom request-response protocol on top of Websocket... it's an old system) that users are routinely trying to, ahem, automate even though it's technically against ToS but hey, as long as we don't catch them? Anyway, it's quite often to see user connections that send malformed commands and then get disconnected after we send them a critical_error/protocol_error message — we do have quite extensive validation logic for user commands.

So, how should such errors be logged in your opinion? I know that we originally logged them as errors but very quickly changed to warnings, and precisely for the reasons outlined in TFA: if some kewl haxxor can't figure out how to quote strings in JSON, it's not really something we can't fix. We probably should keep the records, just to know that "oh, some script kiddie was trying to hack us during that time period" but nothing more than that; it definitely doesn't warrant the "hey, there are too many errors in sfo2 location, please take a look" summons at 3:00 AM from the ops team.

jedberg 12/22/2025||

It sounds like it did exactly what it was supposed to do -- reject the bad input. Looks like an INFO to me.

umpalumpaaa 12/20/2025||

What I like about objective-c’s error handling approach is that a method that can fail is able to tell if a caller considers error handling or not. If the passed *error is NULL you know that that is no way for a caller to properly handle the error. My implementations usually have this logic:

if error == NULL and operationFailed then log error Otherwise Let client side do the error handling (in terms of logging)

aunty_helen 12/21/2025||

Good logging is critical and actually having the logs turned on in production. No point writing logs if you silence them.

My company now has a log aggregator that scans the logs for errors, when it finds one, creates a Trello card, uses opus to fix the issue and then propose a PR against the card. These then get reviewed, finished if tweaks are necessary and merged if appropriate.

t43562 12/21/2025|

Errors can be recovered automatically sometimes but at the level at which you log them you don't know if that's going to happen. I therefore think this suggestion is not easy to follow.

Even if your libraries use nothing but exceptions or return codes you still end up with levels. You still end up with logs that have information in them that gets ignored when it shouldn't be because there's so much noise that people get tired of all the "cries of wolf."

Occasionally one is at a high enough level to know for sure that something needs fixing and for this I use "CRITICAL" which is my code for "absolutely sure that you can't ignore this."

IMO it's about time AI was looking at the logs to find out if there was something we really need to be alerted to.

More comments...