Posted by luu 4 days ago
> Why you should never suspend a thread in your own process.
This sounds like a good general princple but suspending threads in your own process is kind of necessary for e.g. many GC algorithms. Now imagine multiple of those runtimes running in the same process.
I think this is typically done by having the compiler/runtime insert safepoints, which cooperatively yield at specified points to allow the GC to run without mutator threads being active. Done correctly, this shouldn't be subject to the problem the original post highlighted, because it doesn't rely on the OS's ability to suspend threads when they aren't expecting it.
And if you do need to call the GC, you could manually insert function calls every x loop iterations.
True. Maybe the more precise rule is “only suspend threads for a short amount of time and don’t acquire any locks while doing it”?
The way the .NET runtime follows this rule is it only suspends threads for a very short time. After suspending, the thread is immediately resumed if it not running managed code (in a random native library or syscall). If the thread is running managed code, the thread is hijacked by replacing either the instruction pointer or the return address with a the address of a function that will wait for the GC to finish. The thread is then immediately resumed. See the details here:
https://github.com/dotnet/runtime/blob/main/docs/design/core...
> Now imagine multiple of those runtimes running in the same process.
Can that possibly reliably work? Sounds messy.
(We charged ~$20K and estimated two weeks. We had it working in two hours.)
(My case wasn't solved. It was something about variable delays in getting packets off the network and into userspace but we never got to the bottom of it).
The tricky part is ensuring that the signal handler code is async-signal-safe (which pretty much boils down to "ensure you're not acquiring any locks and be careful about reentrant code"), but at least that only has to be verified for a self-contained small function.
Is there anything similar to signals on Windows?
[1] https://learn.microsoft.com/en-us/windows/win32/api/processt...
The older API is less like signals and more like cooperative scheduling in that it waits for the target thread to be in an "alertable" state before it runs (the thread executes a sleep or a wait for something)
I wasn’t implying that APCs were new, I was implying that the ability to enqueue special (as opposed to normal) APCs from user-mode is new. And of course, that has always been possible from kernel-mode with NT.
The special APC is nicer because the OS is then aware of what you’re doing— it will perform the user-mode stack changes while transitioning back to user-mode and handle cleanup once the APC queue is drained.
I hope he keeps going, no doubt he could choose to finish up whenever he wants to.
Why was the service holding things up? Because it was waiting on acquiring a lock held by one of its other threads.
What was that other thread doing? It was deadlocked because it tried to recursively acquire an exclusive srwlock (exactly what the docs say will happen if you try).
Why was it even trying to reacquire said lock? Ultimately because of a buffer overrun that ended up overwriting some important structures.
Just curious, is this customer a game studio? I have never done any serious system programming but the gist feels like one.
(well, when I did it it was in python on python threads, which I have to assume is a lot worse. Not sure about native threads.)
The article says the thread had been hung for 5 hours. And if you understand the root cause, once it entered into the hung state, then absent some rather dramatic intervention (e.g. manually resuming the suspended UI thread), it would remain hung indefinitely.
The proper solution, as Raymond Chen notes, is to move the monitoring thread into a separate process, that would avoid this deadlock.
Unfortunately sometimes you don't have the luxury of being able to do this (e.g. on iOS, especially pre-MetricKit). We shipped one such implementation in the Twitter app (which was still there last I checked) and as far as I can tell it's safe but mostly by accident–I didn't want to to pause things for very long, so the code just suspends the thread, grabs register state, then writes the backtrace to a stack buffer before resuming. I originally wanted to grab traces without suspending the process, which is something you can actually "do" because getting register state doesn't require suspension and you need to put guards on your frame decoding anyway ("is this address I am about to dereference actually in the stack?"). But unfortunately after thinking about it I added the suspension back because trying to collect a trace from a running thread could give you a fragmented backtrace as it modifies it out from under you.