Posted by gw2 1 day ago
> If you write a program in the design of Node.js — isolating a portion of the problem, pinning it to 1 thread on one CPU core, letting it access an isolated portion of RAM with no data sharing, then you have a design that is making as optimal use of CPU-time as possible
This is .. not true as written? You get one second of CPU time per second, not four. Now, it may be quite hard to reach your full four seconds of CPU time per second, usually because of RAM bandwidth issues despite all the caching, and a hyperthreading fake "core" absolutely does not count the same as a separate die core, but the difference is real.
Author does have a point that slicing the work too small has significant overheads. But they've overstated it.
And this is before we get into the real source of parallel FLOPS, the GPU.
(edit: note that there may also be thermal issues and CPU frequency scaling going on; it is usually impossible to run all cores of a modern CPU at their max rated frequency for more than a very short time! But if you've bought a 64-core Ryzen and are only using one core, there's a huge gap there which you're not using)
Exactly.
I was messing around with adding multi-threading to this 3d thing and it slowed it down for the smaller cases up until it overcame the overhead then it sped things up. It was using OpenMP and only a couple shared loop variables so probably not as drastic as whatever node does but it did slow the common case down enough to be not worth the effort.
The author of TFA needs to go run any renderer in single and multi-thread mode then report back to the class.
Indeed. The whole of modern graphics API architecture hinges on the idea that each of your million or so pixels is a meaningful unit of work that can be done in parallel.
That is certainly not universally true for every scenario and if you need to sync state between cpu cores very often then your tasks simply don't lend themselves to parallelization. That doesn't mean that multi-threading is inheritely the wrong design choice. Of course it will always be a trade-off between performance gains and the code complexity of your job control.
Well that's... just wrong.
It is true that you can only squeeze 100% of the maximum possible useful compute out of a NUMA system with methods like the article author was suggesting. The less coordination there is between cores, the less cross-core or cross-socket communication is needed, all of which is overhead.
Caveat: If a bunch of independent processes are processing independent data, they'll increase cache thrashing at L2 and higher levels. Synchronised threads running the same code more-or-less in lockstep over the same areas of the data can benefit from sharing that independent processes can't. In some scenarios, this can be a huge speedup -- just ask a GPU programmer!
Where the process-per-core argument definitely stops being a good approach is when you start to consider latency.
Literally just this week, I need to help someone working on a Node.js app that needs to pre-cache a bunch of very expensive computations (map tiles over data changing on an interval).
Because this is CPU-heavy and Node.js is single-threaded, it kills the user experience while it is running. Interactive responses get interleaved with batch actions, and users complain.
This is not a problem with ASP.NET where this kind of work can simply run in a background thread and populate the cache without interfering with user queries!
For similar reasons, Redis replacements that use multi-threading have far lower tail latencies: https://microsoft.github.io/garnet/
I would say if you simplify this much it becomes just plain wrong.
The first mode is child process: the main process forks an entirely separate instance of Node with its own event loop, which you communicate with over IPC or some network socket.
The second mode (introduced fairly recently) is the ability to spin off worker threads which have their own event loop but share the worker thread pool of the main process. I think there is a way to share memory between these threads via some special type of buffer, but I have never used them.
The first mode maps directly to the idea of micro-services, just running on the same machine. This is why it is not really used AFAIK in modern cloud based apps, with single core micro service instances used instead. This approach has a higher latency cost but allows cheaper instances and much simpler services - it very much depends on the use case to decide if that is correct choice or not.
We have a load balancer in front of that to scale app service instances.
“Why is our app so slow?”
“It’s a mystery. Just scale it out some more!”
That's the fundamental weakness of all these 'best-practice' blog posts: authors are much more willing to pluck a tool from their toolbox and tell us to use that one than they are to give advice on how to pick the right tool for the right job.
Oh, and [2023].
"it brings complexity very few developers understand"
Someone tell this guy about GPUs ...
(but when it works, it's so gratifying! the GPU go very very fast)
Sometimes you want to do something complex in a short time though. If latency matters, sometimes you don't have a choice. Running 10 instances of the same game at 10fps each is not the answer.
Instead of "just throw it on a thread and forget about it" - in a production environment, use the job queue. You gain isolation and observability - you can see the job parameters and know nothing else came across, except data from the DB etc.
> In the 1970s, programming was an elite's task. Today programming is done by uneducated "farmers" and as a result, the care for smart algorithms, memory usage, CPU-time usage and the like has dwindled in comparison.
With threads it was a messaging service that was supposed to offer a persistent queue that could survive restarts. It possibly doubled or trebled the length of the project through multithreading bugs. I wasn't the creator of the code - it was someone without a degree. In the end I had to solve one bug on it that took 3 months to work out and that was just a double-free in some odd circumstance. Nobody wanted to touch it and muggins (i.e. me) was the last person without an excuse!
Asynchronous python and python threading are the recent ones I've experienced - javacript programmers who decided that python was trivial to learn and tried to speed everything up with threads (makes everything worse until 3.13 with a special compilation option that we could not use) and then they made life even worse to no purpose at all by using async without knowing that the ASGI system underneath didn't support it properly. Uvicorn does but uvicorn wasn't usable in that context.
Apart from creating wonderful opportunities for bugs they didn't even know how to write async unit tests so the tests always passed no matter what you did to them.
When trying to help with these issues I found the attitude to be extremely resistant. No way they were going to listen to me - the annoying whippersnapper in one case or the old fart programmer in the other. They just knew better