%CPU utilization is a lie

Posted by BrendanLong 9/3/2025

%CPU utilization is a lie(www.brendanlong.com)

437 points | 167 commentspage 2

kristopolous 9/3/2025|

Tried to explain this in a job interview 5 years ago. They thought I was a bullshitter

bionsystem 9/3/2025|

Happened to me on a different topic, felt bad for way too long ; in hindsight I'm pretty sure I dodged a bullet.

kristopolous 9/3/2025||

This was the same interview where some guy was asking me about "big-o" - like the thing that you teach 19 year olds and I was saying that parallelization matters, i/o matters, quantization matters, whether you can run it on the GPU, these all matter.

The simple "big-o" number doesn't account for whether you need to pass terabytes over the bus for every operation - and on actual computers moving around terabytes, I know, shockingly, this affects performance.

And if you have a dual epyc board with 1,024 threads, being able to parallelize a solution and design things for cache optimization, this isn't meaningless.

It's a weak classifier - if you really think I'm going to be doing a lexical sort in like O(n^3) like some kind of clown, I don't know what you're hiring here.

Found out later he scored me "2/5".

Alright, cool.

kiitos 9/3/2025||

"big o" usually refers to algorithmic complexity, which is something entirely orthogonal to all of the dimensions you mentioned

obviously all of this stuff matters in the end but big-o comes before all of those other things

nomel 7 days ago||

> but big-o comes before all of those other things

If you're attempting to quantify algorithmic scalability with big-o, without those in mind, you'll often be wrong. There was a great post here a few years ago going into this, and how memory access "complexity" is what usually matters, and what dominantly shapes the scalability curve. It had nice examples showing how the expected big-o scalability curves were often completely wrong, outside of toys.

If you're not trying to quantify algorithmic scalability with big-o, then have fun coming up with a fun collection of symbols to put next to your code, and petting your spherical cow!

kiitos 6 days ago||

algorithmic complexity is 100% absolutely orthogonal to the stuff you've mentioned

what you're describing is something different than big-o, in the sense that is commonly understood, and what your interviewer almost certainly intended

I understand what you're describing and talking about but it's not big-o

I would guess that you haven't had any kind of formal cs education? no shade but like there are some important topics covered in those curriculums

kristopolous 2 days ago|||

But none of that is the point. This was at a well funded company looking for a high scalability engineer.

So excuse me for thinking that's what they're looking for and answering accordingly.

I literally only wanted to work there to team build, then snipe engineers and spin off into my own thing. So whatever.

kiitos 23 hours ago||

looks like both you and they dodged bullets in this outcome, so all worked out in the end

nomel 5 days ago|||

I have. I understand big-o, I understand that it’s just algorithmic complexity. I understand big-o is not a performance scaling model, because algorithms run on real hardware. That's fine. Some people enjoy petting spherical cows, and some people work with the nuances of reality. That's also fine.

swiftcoder 9/3/2025||

I remember being stuck in a discussion with management one time, that went something like this: Manager: CPU utilisation is 100% under load! We have to migrate to bigger instances. Me: but is the CPU actually doing useful work?

(chat, it was not. busy waiting is CPU utilisation too)

kristianp 9/4/2025|

How do you measure the amount of busy waiting?

swiftcoder 9/4/2025||

I don't think there is a good general tool for this. In this specific case, I went spelunking for all the points where we had thread contention over resources, and discovered that for several resources quite a lot of CPU cycles were being expended to no use. The goal is really to eliminate the underlying resource contention - we added per-thread caches I various places, swapped out the logging system, and were able to ~double the system throughput during times when top showed the system to be "fully loaded"

ChaoPrayaWave 9/3/2025||

These days I treat CPU usage as just a hint, not a conclusion. I also look at response times, queue lengths, and try to figure out what the app is actually doing when it looks idle.

hinkley 9/3/2025||

How many times has hyperthreading been an actual performance benefit in processors? I cannot count how many times an article has come out saying you'll get better performance out of your <insert processor here> by turning off hyperthreading in the BIOS.

It's gotta be at least 2 out of every 3 chip generations going back to the original implementation, where you're better off without it than with.

loeg 9/3/2025||

HT provides a significant benefit to many workloads. The use cases that benefit from actually disabling HT are likely working around pessimal OS scheduler or application thread use. (After all, even with it enabled, you're free to not use the sibling cores.) Otherwise, it is an overgeneralization to say that disabling it will benefit arbitrary workloads.

hedora 9/3/2025|||

There’s some argument that you should jam stuff on to as few hyperthread pairs as possible to improve energy efficiency and cache locality.

Of course, if the CPU governor is set to “performance” or “game mode”, then the OS should use as many pairs as possible instead (unless thermal throttling matters; computers are hard).

mkbosmans 9/3/2025||||

Especially in HPC there are lots of workloads that do not benefit from SMT. Such workloads are almost always bottlenecked on either memory bandwidth or vector execution ports. These are exactly the resources that are shared between the sibling threads.

So now you have a choice of either disabling SMT in the bios, or make sure the application correctly interprets the CPU topology and only spawns one thread per physical core. The former is often the easier option, both from software development and system administration perspective.

PunchyHamster 9/3/2025|||

HT cores can still run OS stuff in that case as that isn't really in contention with those. Tho I can see someone not wanting to bother with pinning

skeezyboy 9/3/2025|||

>Especially in HPC there are lots of workloads that do not benefit from SMT...So now you have a choice of either disabling SMT in the bios

Thats madness. Theyre cheaper than their all-core equivalent. Why even buy one in the first place if HT slows down the CPU? Youre still better off with them enabled.

robocat 9/3/2025|||

> use cases that benefit from actually disabling HT

Other benefits: per-CPU software licencing sometimes, and security on servers that share CPU with multiple clients.

twoodfin 9/3/2025|||

For whatever it’s worth, operational database systems (many users/connections, unpredictable access patterns) are beneficiaries of modern hyperthreading.

I’m familiar with one such system where the throughput benefit is ~15%, which is a big deal for a BIOS flag.

IBM’s POWER would have been discontinued a decade ago were it not for transactional database systems, and that architecture is heavily invested in SMT, up to 8-way(!)

jiggawatts 9/3/2025|||

I've noticed an overreliance on throughput as measured during 100% load as the performance metric, which has resulted in hardware vendors "optimising to the test" at the expense of other, arguably more important metrics. For example: single-user latency when the server is just 50% loaded.

twoodfin 9/3/2025|||

That’s more than fair.

In the system I’m most familiar with, however, the benefits of hyperthreading for throughput extend to the 50-70% utilization band where p99 latency is not stressed.

hinkley 9/3/2025|||

Or p98 time for requests. Throughput and latency are usually at odds with each other.

tom_ 9/3/2025|||

Why do they need so many threads? This really feels like they just designed the cpu poorly, in that it can't extract enough parallelism out of the instruction stream already.

(Intel and AMD stopped at 2! Apparently more wasn't worth it for them. Presumably because the cpu was doing enough of the right thing already.)

ckozlowski 9/3/2025|||

As I recall it, Intel brought about Hyperthreading on Northwood and later Pentium 4s as a way to help with issues in it's long pipeline. As I remember it described at the time, P4 had 30+ stages in it's pipeline. Many of them did not need to be used in a given thread. Furthermore, if a branch prediction engine guessed wrong, then the pipeline needed to be cleared and started anew. For a 30+ stage pipeline, that's a lot of wasted clock cycles.

So hyper-threading was a way to recoup some of those losses. I recall reading at the time that it was a "latency hiding technique". How effective it was I leave to others. But it became standard it seems on all x86 processors in time. Core and Core 2 didn't seem to need it (much shorter pipelines) but later Intel and AMD processors got it.

This is how it was explained to me at the time anyways. I was working at an OEM from '02-'05, and I recall when this feature came out. I pulled out my copy of "Inside the Machine" by Jon Stokes which goes deep into the P4 architecture, but strangely I can only find a single mention of hyperthreading in the book. But it goes far into the P4 architecture and why branch misses are so punishing. It's a good read.

Edit: Adding that I suspect instruction pipelines are not so long that adding additional threads would help. I suspect diminishing returns past 2.

justsomehnguy 9/3/2025|||

> As I recall it, Intel brought about Hyperthreading on Northwood and later Pentium 4s as a way to help with issues in it's long pipeline.

Well, Intel brought Hyperthreading to Xeon first and they were quite slow, so the additional thread performance were quite welcome there.

But the GHz race was lead to the monstruosity of 3.06GHz CPUs where the improvement in speed didn't quite translated to the improvement in performance. And while the Northwood fared well (especially considering the disaster of Willamette) GHz/performance wise, the Prescott wasn't and mostly showed the same performance in non-SSE/cache bound tasks[1], so Intel needed to push the GHz further which required a longer pipeline and brought even more penalty on a prediction miss.

Well, at least this is how I remember it.

[0] https://en.wikipedia.org/wiki/List_of_Intel_Xeon_processors_...

[1] but excelled in the room heating, people joked what they even didn't bother with an apartment heating in winter, just leaving a computer running

bee_rider 9/3/2025|||

Any time somebody mentions the Pentium 4, it feels like a peek at a time-line we didn’t end up going down. Imagine if Intel had stuck to their guns, maybe they could have pushed through and we’d have CPUs with ridiculous 90 stage pipelines, and like 4 threads per core. Maybe frameworks, languages, and programmer experience would have conspired to help write programs with threads that work together very closely, taking advantage of the shared cache of the hyperthreads.

I mean, it obviously didn’t happen, but it is fun to wonder about.

TristanBall 9/3/2025||||

I suspect part of it is licensing games, both in the sense of "avoiding per core license limits" which absolutely matters when your DB is costing a million bucks, and also in the 'enable the highest PVU score per chassis' for ibm's own license farming.

Power systems tend not to be under the same budget constraints as intel, whether thats money, power, heat, whatever, so the cost benifit of adding more sub-core processing for incremental gains is likely different too.

I may have a raft of issues with IBM, and aix, but those Power chips are top notch.

hinkley 9/3/2025||

Yeah that was another thing. You run Oracle you gotta turn that shit off in the BIOS otherwise you're getting charged 2x for 20% more performance.

wmf 9/3/2025||

AFAIK Oracle does not charge extra for SMT.

twoodfin 9/3/2025||||

Low-latency databases are architected to be memory-bandwidth bound. SMT allows more connections to be generating more loads faster, utilizing more memory bandwidth.

Think async or green threads, but for memory or branch misses rather than blocking I/O.

(As mentioned elsewhere, optimizing for vendor licensing practices is a nice side benefit, but obviously if the vendors want $X for Y compute on their database, they’ll charge that somehow.)

wmf 9/3/2025|||

Power does have higher memory latency because of OMI and it supports more sockets. But I think the main motivation for SMT8 is low-IPC spaghetti code.

BrendanLong 9/3/2025|||

To be fair, in most of these tests hyperthreading did provide a significant benefit (in the general CPU stress test, the hyperthreads increased performance by ~66%). It's just confusing that utilization metrics treat hyperthread usage the same as full physical cores.

bee_rider 9/3/2025|||

Those weird Xeon Phi accelerators had 4 threads per core, and IIRC needed at least 2 running to get full performance. They were sort of niche, though.

I guess in general parallelism inside a core will either be extracted by the computer automatically with instruction-level-parallelism, or the programmer can tell it about independent tasks, using hyperthreads. So the hyperthread implementations are optimistic about how much progrmmers care about performance, haha.

mkbosmans 9/3/2025||

Sort of niche indeed.

In addition to needing SMT to get full performance, there were a lot of other small details you needed to get right on Xeon Phi to get close to the advertised performance. Think of AVX512 and the HBM.

For practical applications, it never really delivered.

tgma 9/3/2025|||

It has a lot to do with your workload as well as if not moreso than the chip architecture.

The primary trade-off is the cache utilization when executing two sets of instruction streams.

hinkley 9/3/2025||

That's likely the primary factor, but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU.

tgma 9/3/2025|||

May be true for FMA or AVX2 or similar stuff. Outside vector units that sounds implausible. Obviously multi core thermal throttling is a thing but that would by far dominate. Hyperthreading should have minimal impact there.

gruez 9/3/2025|||

>but then there's thermal throttling as well. You can't run all of the logic units flat out on a bunch of models of CPU.

That doesn't make any sense. Disabling SMT likely saves negligible amount of power, but disables any performance to be gained from the other thread. If there's thermal budget available, it's better to spend it by shoving more work onto the second thread than to leave it disabled. If anything, due to voltage/frequency curves, it might even be better to run your CPU at lower clocks but with SMT enabled to make up for it (assuming it's amenable to your workloads), than it is to run with SMT disabled.

duped 9/3/2025|||

For me today it's definitely a pessimation because I have enough well-meaning applications that spawn `nproc` worker threads. Which would be fine if they're the only process running, but they're not.

hinkley 9/3/2025|||

I wrote a little tool for our services that could do basic expression based off of nproc based on an environment variable at startup time.

You could do one thread for every two cores, three threads for every 2 cores, one thread per core ± 1, or both (2n + 1).

Unfortunately the sweet spot based on our memory usage always came out to 1:1, except for a while when we had a memory leak that was surprisingly hard to fix, and we ran n - 1 for about 4 months while a bunch of work and exploratory testing were done. We had to tune in other places to maximize throughput.

toast0 9/3/2025|||

Wouldn't that be about the same badness without hyperthreads? If you're oversubscribed, there might be some benefit to having fewer tasks, but maybe you get some good throughput with two different application's threads running on opposite hyperthreads.

hinkley 9/3/2025||

Oversubscribing also leads to process migration, which these days leads to memory read delays.

esseph 9/3/2025|||

Intel vs AMD, you'll get a different answer on the hyperthreading question.

https://www.tomshardware.com/pc-components/cpus/zen-4-smt-fo...

toast0 9/3/2025|||

Going from 1 core to 2 hyperthreads was a big bonus in interactivity. But I think it was easy to get early systems to show worse throughput.

I think there's two kinds of loads where hyperthreads aren't more likely to hurt than help. If you've got a tight loop that uses all the processor execution resources, you're not gaining anything by splitting that in two, it just makes things harder. Or if your load is mostly bound by memory bandwidth without a lot of compute... having more threads probably means you're that much more oversubscribed on i/o and caching.

But a lot of loads are grab some stuff from memory and then do some compute, rinse and repeat. There's a lot of potential for idle time while waiting on a load, being able to run something else during that time makes a lot of sense.

It's worth checking how your load performs with hyperthreads off, but I think default on is probably the right choice.

sroussey 9/3/2025||

Definitely measure both ways and decide.

For many years (still?) it was faster to run your database with hyper threading turned off and your app server with it turned on.

FpUser 9/3/2025|||

In the old days it had made the difference between my multimedia game like application not working at all with hyperthreading off to working just fine with it on.

hinkley 9/3/2025||

Yeah when it was one core versus 1.3 cores that's fair. But 3 core machines often did better (or at least more consistently run to run) with HT disabled.

tom_ 9/3/2025||

Total throughout has always seemed better with it switched on for me, even for stuff that isn't hyper threading friendly. You get a free 10% at least.

Aissen 9/3/2025||

Funny that it talks about matrixprod, which I think is not that relevant as benchmark — unless you care about x87 performance specifically. I recently sent a pull request to try to address that in a generic manner: https://github.com/ColinIanKing/stress-ng/pull/561

Yet I'm still surprised by this benchmark. On both Zen2 and Zen4 in my tests (5900X from the article is Zen3), matrixprod still benefits from hyperthreading and scales a bit after all the physical cores are filled, unlike what the article results show.

All of this is tangential of course, as I'd tend to agree that CPU utilization% is just an imprecise metric and should only be used as a measure of "is something running".

bob1029 9/3/2025||

I think looking at power consumption is potentially a more interesting canary when using very high core count parts.

I've ran some ML experiments on my 5950x and I can tell that the CPU utilization figure is entirely decoupled from physical reality by observing the amount of flicker induced in my office lighting by the PWM noise in the machine. There are some code paths that show 10% utilization across all cores but make the cicadas outside my office window stop buzzing because the semiconductors get so loud. Other code paths show all cores 100% maxed flatline and it's like the machine isn't even on.

N_Lens 9/3/2025||

This has been my experience running production workloads as well. Anytime CPU% goes over 50-60% suddenly it'll spike to 100% rather quickly, and the app/service is unusable. Learned to scale earlier than first thought.

morning-coffee 9/3/2025||

The lie is that hyper thread "cores" are equal to real "cores". Maybe this is what happens when an over 20-year old technology (hack) becomes ubiquitous and gets forgotten about? (We have to rediscover why our performance measurements don't seem to make sense?)

The other thing I think we have a hard time visualizing is that processor is only either executing (100%) or its waiting to execute (0%) and that happens over varying timescales... so trying to assign a % in between inherently means you're averaging over some arbitrary timescale...

fennecfoxy 9/3/2025||

I think it's more for cores, right? % util is just % of idle cycles across all logical cores as far as I know.

It wouldn't really make sense to include all parts of the CPU in the calculation.

fuzzfactor 9/3/2025|

Windows users try this:

Ctrl-Alt-Del then launch TaskManager.

In TaskManager, click the "Performance" tab and see the simple stats.

While on the Performance tab, then click the ellipsis (. . .) menu, so you can then open ResourceMonitor.

Then close TaskManager.

In ResourceMonitor, under the Overview tab, for the CPU click the column header for "Average CPU" so that the processes using the most CPU are shown top-down from most usage to least.

In Overview, for Disk click the Write (B/sec) column header, for Network click Send (B/sec), and for Memory click Commit (KB).

Then under the individual CPU, Memory, Disk, and Network tabs click on the similar column headers. Under any tab now you should be able to see the most prominent resource usages.

Notice how your CPU settles down after a while of idling.

Then click on the Disk tab to focus your attention on that one exclusively.

Let it sit for 5 or 10 minutes then check your CPU usage. See if it's been climbing gradually higher while you weren't looking.

More comments...