Log level 'error' should mean that something needs to be fixed

Posted by todsacerdoti 12/17/2025

Log level 'error' should mean that something needs to be fixed(utcc.utoronto.ca)

482 points | 299 commentspage 3

Waterluvian 12/20/2025|

I think this is one of those discussions where there's no one right answer (though there's many wrong answers). All you have to do is pick a reasonable definition, write it down, socialize it, and be consistent when using it.

I think discussions that argue over a specific approach are a form of playing checkers.

HarHarVeryFunny 12/20/2025||

I agree with the sentiment, although not sure if "error" is the right category/verbiage for actionable logs.

In an ideal world things like logs and alarms (alerting product support staff) should certainly cleanly separate things that are just informative, useful for the developer, and things that require some human intervention.

If you don't do this then it's like "the boy that cried wolf", and people will learn to ignore errors and alarms since you've trained them to understand that usually no action is needed. It's also useful to be able to grep though log files and distinguish failures of different categories, not just grep for specific failures.

raldi 12/20/2025||

Yes. Examples of non-defects that should not be in the ERROR loglevel:

* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)

* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)

* Network error

* Downstream service overloaded

* Invalid request

Basically, when you make a request to another service and get back a status code, your handler should look like:

    logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning

(Unless you have an SLO with the service about how often you’re allowed to hit it and they only send 429 when you’re over, which is how it’s supposed to work but sadly rare.)

Hizonner 12/20/2025||

> Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)

So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?

raldi 12/20/2025||

No; I’m not understanding your point about guessing. Could you restate?

As for queries that time out, that should definitely be a metric, but not pollute the error loglevel, especially if it’s something that happens at some noisy rate all the time.

electroly 12/20/2025|||

I think OP is making two separate but related points, a general point and a specific point. Both involve guessing something that the error-handling code, on the spot, might not know.

1. When I personally see database timeouts at work it's rarely the database's fault, 99 times out of 100 it's the caller's fault for their crappy query; they should have looked at the query plan before deploying it. How is the error-handling code supposed to know? I log timeouts (that still fail after retry) as errors so someone looks at it and we get a stack trace leading me to the bad query. The database itself tracks timeout metrics but the log is much more immediately useful: it takes me straight to the scene of the crime. I think this is OP's primary point: in some cases, investigation is required to determine whether it's your service's fault or not, and the error-handling code doesn't have the information to know that.

2. As with exceptions vs. return values in code, the low-level code often doesn't know how the higher-level caller will classify a particular error. A low-level error may or may not be a high-level error; the low-level code can't know that, but the low-level code is the one doing the logging. The low-level logging might even be a third party library. This is particularly tricky when code reuse enters the picture: the same error might be "page the on-call immediately" level for one consumer, but "ignore, this is expected" for another consumer.

I think the more general point (that you should avoid logging errors for things that aren't your service's fault) stands. It's just tricky in some cases.

lanstin 12/21/2025||

Also everywhere I have worked there are transient network glitches from time to time. Timeout can often be caused by these.

makeitdouble 12/20/2025||||

> the database is owned by a separate oncall rotation

Not OP, but this part hits the same for me.

In the case your client app is killing the DB through too many calls (e.g. your cache is not working) you should be able to detect it and react, without waiting for the DB team to come to you after they investigated the whole thing.

But you can't know in advance if the DB connection errors are your fault or not, so logging it to cover the worse case scenario (you're the cause) is sensible.

raldi 12/20/2025||

I agree that you should detect this, just through a metric rather than putting DB timeouts in the ERROR loglevel.

makeitdouble 12/21/2025||

But what's the base of your metric ?

I feel you're thinking about system wide downtime with everything timing out consistently, which would be detected by the generic database server vitals and basic logs.

But what if the timeouts are sparse and only 10 or 20% more than usual from the DB POV, but it affects half of your registration services' requests ? You need it logged application side so the aggregation layer has any chance of catching it.

On writing to ERROR or not, the hresholds should be whatever your dev and oncall teams decides. Nobody outside of them will care, I feel it's like arguing which drawer the socks should go.

I was in an org where any single error below CRITICAL was ignored by the oncall team , and everything below that only triggered alerts on aggregation or special conditions. Pragmatically, we ended up slicing it as ERROR=goes to the APM, anything below=no aggregation, just available when a human wants to look at it for whatever reason. I'd expect most orgs to come with that kind of split, where the levels are hooked to processes, and not some base meaning.

Hizonner 12/20/2025|||

> No; I’m not understanding your point about guessing. Could you restate?

In the general case, the person writing the software has no way of knowing that "the database is owned by a separate oncall rotation". That's about your organization chart.

Admittedly, they'd be justified in assuming that somebody is paying attention to the database. On the other hand, they really can't be sure that the database is going to report anything useful to anybody at all, or whether it's going to report the salient details. The database may not even know that the request was ever made. Maybe the requests are timing out because they never got there. And definitely maybe the requests are timing out because you're sending too many of them.

I mean, no, it doesn't make sense to log a million identical messages, but that's rate limiting. It's still an error if you can't access your database, and for all you know it's an error that your admin will have to fix.

As for metrics, I tend to see those as downstream of logs. You compute the metric by counting the log messages. And a metric can't say "this particular query failed". The ideal "database timeout" message should give the exact operation that timed out.

zbentley 12/20/2025|||

I wish I lived in a world where that worked. Instead, I live in a world where most downstream service issues (including database failures, network routing misconfigurations, giant cloud provider downtime, and more ordinary internal service downtime) are observed in the error logs of consuming services long before they’re detected by the owners of the downstream service … if they ever are.

My rough guess is that 75% of incidents on internal services were only reported by service consumers (humans posting in channels) across everywhere I’ve worked. Of the remaining 25% that were detected by monitoring, the vast majority were detected long after consumers started seeing errors.

All the RCAs and “add more monitoring” sprints in the world can’t add accountability equivalent to “customers start calling you/having tantrums on Twitter within 30sec of a GSO”, in other words.

The corollary is “internal databases/backend services can be more technically important to the proper functioning of your business, but frontends/edge APIs/consumers of those backend services are more observably important by other people. As a result, edge services’ users often provide more valuable telemetry than backend monitoring.”

raldi 12/20/2025||

But everything you’re describing can be done with metrics and alerts; there’s no need to spam the ERROR loglevel.

zbentley 12/20/2025||

My point is that just because those problems can be solved with better telemetry doesn’t mean that is actually done in practice. Most organizations do are much more aware of/sensitive to failures upstream/at the edge than they are in backend services. Once you account for alert fatigue, crappy accountability distribution, and organizational pressures, even the places that do this well often backslide over time.

In brief: drivers don’t obey the speed limit and backend service operators don’t prioritize monitoring. Both groups are supposed to do those things, but they don’t and we should assume they won’t change. As a result, it’s a good idea to wear seatbelts and treat downstream failures as urgent errors in the logs of consuming services.

jonathrg 12/20/2025||

4xx is for invalid requests. You wouldn't log a 404 as an error

raldi 12/20/2025||

I’m talking about codes you receive from services you call out to.

mewpmewp2 12/20/2025|||

What if user sends some sort of auth token or other type of data that you yourself can't validate and third party gives you 4xx for it? You won't know ahead of time whether that token or data is valid, only after making a request to the third party.

jonathrg 12/20/2025|||

Oh that makes sense.

raldi 12/20/2025||

There are still some special cases, because 404 is used for both “There’s no endpoint with that name” and “There’s no record with the ID you tried to look up.”

makeitdouble 12/20/2025||

> This assumes an error/warning/info/debug set of logging levels instead of something more fine grained, but that's how many things are these days.

Does it ?

Don't most stacks have an additional level of triaging logs to detect anomalies etc ? It can be your New relic/DataDog/Sentry or a self made filtering system, but nowadays I'd assume the base log levels are only a rough estimate of whether an single event has any chance of being problematic.

I'd bet the author also has strong opinions about http error codes, and while I empathize, those ships have long sailed.

jmull 12/20/2025||

I encourage people to think a few moments about what to log and at what level.

You’re kind of telling a story to future potential trouble-shooters.

When you don’t think about it at all (it doesn’t take much), you tend to log too much and too little and at the wrong level.

But this article isn’t right either. Lower-level components typically don’t have the context to know whether a particular fault requires action or not. And since systems are complex, with many levels of abstractions and boxes things live in, actually not much is in a position to know this, even to a standard of “probably”.

twosdai 12/21/2025||

I understand where the author is coming from but frankly I think this design pattern is not really correct.

Obviously this depends on teams, application context and code bases. But "knowing if action needs to be taken" can't be boiled into a simple log level for most cases.

There is a reason most alerting software like pagerduty is just a trigger interface and the logic for what constitutes the "error" is typically some data level query in something like datadog, sumologic, elastic search, or graphana, that either looks for specific string messages, error types, or a collection of metric conditions.

Cool if you want to consider that any error level log needs to be an actionable error but what quickly happens is that some error cases are auto retry able due to infrastructure conditions that the application has completely no knowledge of. And to run some sort of infrastructure query at error write time in code, eg

1. Error is thrown 2. Prior to logging guess/determine if the case can be retired through a few http calls. 3. Log either a warning or an error

Seems to be a complete waste when we could just write some sort of query in our log/metrics management platform of choice which takes into account the infrastructure conditions for us.

mschuster91 12/20/2025||

> If error level messages are not such a sign, I can assure you that most system administrators will soon come to ignore all messages from your program rather than try to sort out the mess, and any actual errors will be lost in the noise and never be noticed in advance of actual problems becoming obvious.

Bold of you to assume that there are system administrators. All too often these days it's "devops" aka some devs you taught how to write k8s yamls.

alexwasserman 12/20/2025||

I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.

eg. log level WARN, message "This error is...", but it then trips an error in monitoring and pages out.

Probably breaching multiple rules here around not parsing logs like that, etc. But it's cropped up so many times I get quite annoyed by it.

dragonwriter 12/20/2025||

> I have been particularly irritated in the past where people use a lower log level and include the higher log level string in the message, especially where it's then parsed, filtered, and alerted on my monitoring.

If your parsing, filtering, and monitoring setup parses strings that happen to correspond to log level names in positions other than that of log levels as having the semantics of log levels, then that's a parsing/filtering error, not a logging error.

jonathrg 12/20/2025||

Stuff like that is a good argument for using structured logging, but even if you are just parsing text logs, surely you can make the parser be a bit more specific when retrieving the log level.

rsanek 12/20/2025||

If something needs to be fixed, why is it just a log? How is someone supposed to even notice a random error log? At the places that I've worked, trying to make alerting be triggered on only logs was always quite brittle, it's just not best practice. Throw an exception / exit the program if it's something that actually needs fixing!

Copenjin 12/20/2025|

> If something needs to be fixed, why is it just a log?

What he meant is that is an unexpected condition, that should have never happened, but that did, so it needs to be fixed.

> How is someone supposed to even notice a random error log?

Logs should be monitored.

> At the places that I've worked, trying to make alerting be triggered on only logs was always quite brittle, it's just not best practice.

Because the logs sucked. It not common practice, it should be best practice.

> Throw an exception / exit the program if it's something that actually needs fixing!

I understand the sentiment, but some programs cannot/should not exit. Or you have an error in a subsystem that should not bring down everything.

I completely agree with the approach of the author, but also understand that good logging discipline is rare. I worked in many places where logs sucked, they just dumped stuff, and had to restructure them.

lanstin 12/21/2025||

While it is fun to have your code run for 500 days without restart, it is a bad architecture. You should be able to move load around from host to host or network to network without losing any work. This involves graceful draining and then shutting down the old.

For impossible errors exiting and sending the dev team as much info as possible (thread dump, memory dump, etc) is helpful.

In my experience logs are good for finding out what is wrong once you know something is wrong. Also if the server is written to have enough but not too much logging you can read them over and get a feel for normal operation.

peanut-walrus 12/20/2025|

Disagree. If you have an error that NEEDS fixing, your program should exit. Error level logs for operation level errors are fine.

More comments...