Posted by todsacerdoti 12/17/2025
A warning can be ignored safely. Warnings may be 'debugging enabled, results cannot be certified' or something similar.
An error should not be ignored, an operation is failing, data loss may be occurring, etc.
Some users may be okay with that data loss or failing operation. Maybe it isnt important to them. If the program continues and does not error in the parts that matter to the user, then they can ignore it, but it is still objectively an error occurring.
A fatal message cannot be ignored, the system has crashed. Its the last thing you see before shutdown is attempted.
If yes: ERROR If I want to check it tomorrow: WARNING If it's useful for debugging: INFO Everything else: DEBUG
The problem with the article's approach is that libraries don't have enough context. A timeout calling an external API might be totally fine if you're retrying, but it's an ERROR if you've exhausted retries and failed the user's request.
We solve this by having libraries emit structured events with severity hints, then the application layer decides the final log level based on business impact. A 500 from a recommendation service? Warning. A 500 from the payment processor? Error.
This post frames the problem almost entirely from a sysadmin-as-log-consumer perspective, and concludes that a correctly functioning system shouldn’t emit error logs at all. That only holds if sysadmins are the only "someone" who can act.
In practice, if there is a human who needs to take action - whether that’s a developer fixing a bug, an infra issue, or coordinating with an external dependency - then it’s an error. The solution isn’t to downgrade severity, but to route and notify the right owner.
Severity should encode actionability, not just system correctness.
So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.
Our production system pages oncall for any errors. At night it will only wake somebody up for a whole bunch of errors. This discipline forces us to take a look at every ERROR and decide if it is spurious and out of our control or something we can deal with. At some point our production system will reach a scale where there are errors logged constantly and this strategy Durant make sense any more. But for now it helps keep our system clean.
Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.
Did it fail to do what it needed to do? ERROR
Did it do what it needed to do in the normal way because it was totally recoverable? INFO
Did data get destroyed in the process? FATAL
It should be about what the result was, not who will fix it or how. Because that might change over time.
> Did it fail to do what it needed to do? ERROR
> Did it do what it needed to do in the normal way because it was totally recoverable? INFO
We have a web-facing system (it uses a custom request-response protocol on top of Websocket... it's an old system) that users are routinely trying to, ahem, automate even though it's technically against ToS but hey, as long as we don't catch them? Anyway, it's quite often to see user connections that send malformed commands and then get disconnected after we send them a critical_error/protocol_error message — we do have quite extensive validation logic for user commands.
So, how should such errors be logged in your opinion? I know that we originally logged them as errors but very quickly changed to warnings, and precisely for the reasons outlined in TFA: if some kewl haxxor can't figure out how to quote strings in JSON, it's not really something we can't fix. We probably should keep the records, just to know that "oh, some script kiddie was trying to hack us during that time period" but nothing more than that; it definitely doesn't warrant the "hey, there are too many errors in sfo2 location, please take a look" summons at 3:00 AM from the ops team.
if error == NULL and operationFailed then log error Otherwise Let client side do the error handling (in terms of logging)
My company now has a log aggregator that scans the logs for errors, when it finds one, creates a Trello card, uses opus to fix the issue and then propose a PR against the card. These then get reviewed, finished if tweaks are necessary and merged if appropriate.
Even if your libraries use nothing but exceptions or return codes you still end up with levels. You still end up with logs that have information in them that gets ignored when it shouldn't be because there's so much noise that people get tired of all the "cries of wolf."
Occasionally one is at a high enough level to know for sure that something needs fixing and for this I use "CRITICAL" which is my code for "absolutely sure that you can't ignore this."
IMO it's about time AI was looking at the logs to find out if there was something we really need to be alerted to.