Posted by BrendanLong 9/3/2025
The simple "big-o" number doesn't account for whether you need to pass terabytes over the bus for every operation - and on actual computers moving around terabytes, I know, shockingly, this affects performance.
And if you have a dual epyc board with 1,024 threads, being able to parallelize a solution and design things for cache optimization, this isn't meaningless.
It's a weak classifier - if you really think I'm going to be doing a lexical sort in like O(n^3) like some kind of clown, I don't know what you're hiring here.
Found out later he scored me "2/5".
Alright, cool.
obviously all of this stuff matters in the end but big-o comes before all of those other things
If you're attempting to quantify algorithmic scalability with big-o, without those in mind, you'll often be wrong. There was a great post here a few years ago going into this, and how memory access "complexity" is what usually matters, and what dominantly shapes the scalability curve. It had nice examples showing how the expected big-o scalability curves were often completely wrong, outside of toys.
If you're not trying to quantify algorithmic scalability with big-o, then have fun coming up with a fun collection of symbols to put next to your code, and petting your spherical cow!
what you're describing is something different than big-o, in the sense that is commonly understood, and what your interviewer almost certainly intended
I understand what you're describing and talking about but it's not big-o
I would guess that you haven't had any kind of formal cs education? no shade but like there are some important topics covered in those curriculums
So excuse me for thinking that's what they're looking for and answering accordingly.
I literally only wanted to work there to team build, then snipe engineers and spin off into my own thing. So whatever.
(chat, it was not. busy waiting is CPU utilisation too)
It's gotta be at least 2 out of every 3 chip generations going back to the original implementation, where you're better off without it than with.
Of course, if the CPU governor is set to “performance” or “game mode”, then the OS should use as many pairs as possible instead (unless thermal throttling matters; computers are hard).
So now you have a choice of either disabling SMT in the bios, or make sure the application correctly interprets the CPU topology and only spawns one thread per physical core. The former is often the easier option, both from software development and system administration perspective.
Thats madness. Theyre cheaper than their all-core equivalent. Why even buy one in the first place if HT slows down the CPU? Youre still better off with them enabled.
Other benefits: per-CPU software licencing sometimes, and security on servers that share CPU with multiple clients.
I’m familiar with one such system where the throughput benefit is ~15%, which is a big deal for a BIOS flag.
IBM’s POWER would have been discontinued a decade ago were it not for transactional database systems, and that architecture is heavily invested in SMT, up to 8-way(!)
In the system I’m most familiar with, however, the benefits of hyperthreading for throughput extend to the 50-70% utilization band where p99 latency is not stressed.
(Intel and AMD stopped at 2! Apparently more wasn't worth it for them. Presumably because the cpu was doing enough of the right thing already.)
So hyper-threading was a way to recoup some of those losses. I recall reading at the time that it was a "latency hiding technique". How effective it was I leave to others. But it became standard it seems on all x86 processors in time. Core and Core 2 didn't seem to need it (much shorter pipelines) but later Intel and AMD processors got it.
This is how it was explained to me at the time anyways. I was working at an OEM from '02-'05, and I recall when this feature came out. I pulled out my copy of "Inside the Machine" by Jon Stokes which goes deep into the P4 architecture, but strangely I can only find a single mention of hyperthreading in the book. But it goes far into the P4 architecture and why branch misses are so punishing. It's a good read.
Edit: Adding that I suspect instruction pipelines are not so long that adding additional threads would help. I suspect diminishing returns past 2.
Well, Intel brought Hyperthreading to Xeon first and they were quite slow, so the additional thread performance were quite welcome there.
But the GHz race was lead to the monstruosity of 3.06GHz CPUs where the improvement in speed didn't quite translated to the improvement in performance. And while the Northwood fared well (especially considering the disaster of Willamette) GHz/performance wise, the Prescott wasn't and mostly showed the same performance in non-SSE/cache bound tasks[1], so Intel needed to push the GHz further which required a longer pipeline and brought even more penalty on a prediction miss.
Well, at least this is how I remember it.
[0] https://en.wikipedia.org/wiki/List_of_Intel_Xeon_processors_...
[1] but excelled in the room heating, people joked what they even didn't bother with an apartment heating in winter, just leaving a computer running
I mean, it obviously didn’t happen, but it is fun to wonder about.
Power systems tend not to be under the same budget constraints as intel, whether thats money, power, heat, whatever, so the cost benifit of adding more sub-core processing for incremental gains is likely different too.
I may have a raft of issues with IBM, and aix, but those Power chips are top notch.
Think async or green threads, but for memory or branch misses rather than blocking I/O.
(As mentioned elsewhere, optimizing for vendor licensing practices is a nice side benefit, but obviously if the vendors want $X for Y compute on their database, they’ll charge that somehow.)
I guess in general parallelism inside a core will either be extracted by the computer automatically with instruction-level-parallelism, or the programmer can tell it about independent tasks, using hyperthreads. So the hyperthread implementations are optimistic about how much progrmmers care about performance, haha.
In addition to needing SMT to get full performance, there were a lot of other small details you needed to get right on Xeon Phi to get close to the advertised performance. Think of AVX512 and the HBM.
For practical applications, it never really delivered.
The primary trade-off is the cache utilization when executing two sets of instruction streams.
That doesn't make any sense. Disabling SMT likely saves negligible amount of power, but disables any performance to be gained from the other thread. If there's thermal budget available, it's better to spend it by shoving more work onto the second thread than to leave it disabled. If anything, due to voltage/frequency curves, it might even be better to run your CPU at lower clocks but with SMT enabled to make up for it (assuming it's amenable to your workloads), than it is to run with SMT disabled.
You could do one thread for every two cores, three threads for every 2 cores, one thread per core ± 1, or both (2n + 1).
Unfortunately the sweet spot based on our memory usage always came out to 1:1, except for a while when we had a memory leak that was surprisingly hard to fix, and we ran n - 1 for about 4 months while a bunch of work and exploratory testing were done. We had to tune in other places to maximize throughput.
https://www.tomshardware.com/pc-components/cpus/zen-4-smt-fo...
I think there's two kinds of loads where hyperthreads aren't more likely to hurt than help. If you've got a tight loop that uses all the processor execution resources, you're not gaining anything by splitting that in two, it just makes things harder. Or if your load is mostly bound by memory bandwidth without a lot of compute... having more threads probably means you're that much more oversubscribed on i/o and caching.
But a lot of loads are grab some stuff from memory and then do some compute, rinse and repeat. There's a lot of potential for idle time while waiting on a load, being able to run something else during that time makes a lot of sense.
It's worth checking how your load performs with hyperthreads off, but I think default on is probably the right choice.
For many years (still?) it was faster to run your database with hyper threading turned off and your app server with it turned on.
Yet I'm still surprised by this benchmark. On both Zen2 and Zen4 in my tests (5900X from the article is Zen3), matrixprod still benefits from hyperthreading and scales a bit after all the physical cores are filled, unlike what the article results show.
All of this is tangential of course, as I'd tend to agree that CPU utilization% is just an imprecise metric and should only be used as a measure of "is something running".
I've ran some ML experiments on my 5950x and I can tell that the CPU utilization figure is entirely decoupled from physical reality by observing the amount of flicker induced in my office lighting by the PWM noise in the machine. There are some code paths that show 10% utilization across all cores but make the cicadas outside my office window stop buzzing because the semiconductors get so loud. Other code paths show all cores 100% maxed flatline and it's like the machine isn't even on.
The other thing I think we have a hard time visualizing is that processor is only either executing (100%) or its waiting to execute (0%) and that happens over varying timescales... so trying to assign a % in between inherently means you're averaging over some arbitrary timescale...
It wouldn't really make sense to include all parts of the CPU in the calculation.
Ctrl-Alt-Del then launch TaskManager.
In TaskManager, click the "Performance" tab and see the simple stats.
While on the Performance tab, then click the ellipsis (. . .) menu, so you can then open ResourceMonitor.
Then close TaskManager.
In ResourceMonitor, under the Overview tab, for the CPU click the column header for "Average CPU" so that the processes using the most CPU are shown top-down from most usage to least.
In Overview, for Disk click the Write (B/sec) column header, for Network click Send (B/sec), and for Memory click Commit (KB).
Then under the individual CPU, Memory, Disk, and Network tabs click on the similar column headers. Under any tab now you should be able to see the most prominent resource usages.
Notice how your CPU settles down after a while of idling.
Then click on the Disk tab to focus your attention on that one exclusively.
Let it sit for 5 or 10 minutes then check your CPU usage. See if it's been climbing gradually higher while you weren't looking.