Posted by secure 6 days ago
I've been chasing flimsy but very annoying stability problems (some, of course, due to overclocking during my younger years, when it still had a tangible payoff) enough times on systems I had built that taking this one BIG potential cause out of the equation is worth the few dozens of extra bucks I have to spend on ECC-capable gear many times over.
Trying to validate an ECC-less platform's stability is surprisingly hard, because memtest and friends just aren't very reliably detecting more subtle problems. PRIME95, y-cruncher and linpack (in increasing order of effectiveness) are better than specialzied memory testing software in my experience, but they are not perfect, either.
Most AMD CPUs (but not their APUs with potent iGPUs - there, you will have to buy the "PRO" variants) these days have full support for ECC UDIMMs. If your mainboard vendor also plays ball - annoyingly, only a minority of them enables ECC support in their firmware, so always check for that before buying! - there's not much that can prevent you from having that stability enhancement and reassuring peace of mind.
Quoth DJB (around the very start of this millenium): https://cr.yp.to/hardware/ecc.html :)
This is the annoying part.
That AMD permits ECC is a truly fantastic situation, but if it's supported by the motherboard is often unlikely and worse: it's not advertised even when it's available.
I have an ASUS PRIME TRX40 PRO and the tech specs say that it can run ECC and non-ECC but not if ECC will be available to the operating system, merely that the DIMMS will work.
It's much more hit and miss in reality than it should be, though this motherboard was a pricey one: one can't use price as a proxy for features.
EDAC MC0: Giving out device to module amd64_edac
is a pretty reliable indication that ECC is working.See my blog post about it (it was top of HN): https://sunshowers.io/posts/am5-ryzen-7000-ecc-ram/
EDAC MC0: Giving out device to module igen6_edac controller Intel_client_SoC MC#0: DEV 0000:00:00.0 (INTERRUPT)
EDAC MC1: Giving out device to module igen6_edac controller Intel_client_SoC MC#1: DEV 0000:00:00.0 (INTERRUPT)
but `dmidecode --type 16` says: Error Correction Type: None
Error Information Handle: Not Provided
What does
find /sys/devices/system/edac/mc/mc0/csrow* -maxdepth 1 -type f -exec grep --color . {} +
report? /sys/devices/system/edac/mc/mc0/csrow0/ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label:MC#0_Chan#0_DIMM#0
/sys/devices/system/edac/mc/mc0/csrow0/size_mb:8192
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ue_count:0
/sys/devices/system/edac/mc/mc0/csrow0/mem_type:Unbuffered-DDR3
/sys/devices/system/edac/mc/mc0/csrow0/edac_mode:SECDED
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_dimm_label:MC#0_Chan#1_DIMM#0
/sys/devices/system/edac/mc/mc0/csrow0/dev_type:x16
> find /sys/devices/system/edac/mc/mc0/csrow* -maxdepth 1 -type f -exec grep --color . {} +It looks like DDR5 supports SECDEC by default. :-/
I would assume your particular motherboard to operate with proper SECDED+-level ECC if you have capable, compatible DIMM, enable ECC mode in the firmware, and boot an OS kernel that can make sense of it all.
I am writing this message on such an ASUS MB with a Ryzen CPU and working ECC memory. You must check that you actually have a recent enough OS to know your Threadripper CPU and that you have installed any software package required for this (e.g. on Linux "edac-utils" or something with a similar name).
Some businesses (and governments) try and unify their purchasing, but this seems to make things worse, with the purchasing department both not understanding technology and being outwitted by vendors.
Enterprise also ruins it for small/medium businesses as well, at least those with dedicated internal IT departments who do care about both the technology and the cost. We are left with unreliable consumer-grade hardware, or prohibitively expensive enterprise hardware.
There's very little in between. This market is also underserved with software/SaaS as well with the SSO Tax and whatnot. There's a huge gap between "I'm taking the owner's CC down to best buy" and "Enterprise" that gets screwed over.
I've been building my own gaming and productivity rigs for 20 years and I don't think memory has ever been a problem. Maybe survivorship bias, but surely even budget parts aren't THIS bad.
Assuming you can tell, and assuming you don't end up silently corrupting your data before then.
Without knowing how to fix that error you've lost 200 revisions of work. You can go back and find which revision had the problem, go before that, and upgrade it to the latest blender, but all your 200 revisions were made on other versions that you can't backport.
What a silly hypothetical. There's a myriad freak occurrences that could make you have to redo work that you don't worry about. Now, I'm not saying single-bit errors don't happen. They just typically don't result in the sort of cascading failure you're describing.
My point is that there are scenarios where corruption in the past puts you in a bind and can cause a lot of loss of work or expensive diagnostic and recovery process long after it first occurred, blender was just one example but it can be much worse with proprietary software binary formats where you don't have any chance of jumping into the debugger to figure out what's going wrong with an upgrade or export. And maybe the subscription version of it won't even let you go back to the old version.
> There's a myriad freak occurrences that could make you have to redo work that you don't worry about.
Yes other sources of corruption are more likely from things like software errors. It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc., but you do have a budget and ECC is much cheaper relative to that. That doesn't mean it always makes sense for everyone to pay more for ECC. But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.
Not really what I would call an "asset", but fine.
>It's not that you wouldn't worry about them if you had unlimited budget and could have people audit the code etc.
Hell, I was thinking something way simpler, like your cat climbing on the case and throwing up through the top vents, or you tripping and dropping your ass on your desk and sending everything flying.
>But I can see why people working on gigantic CAD files for nuclear reactor design, etc. tend to have workstations with ECC.
Yeah, because those people aren't buying their own machines. If the credit card is yours and you're not doing something super critical, you're probably better served by a faster processor than by worrying against freak accidents.
And let's say you have archived copies of it with checksums like I suggested, going back to all revisions ago.
What's the issue again now, that ECC would have solved? Not to mention that ECC wouldn't help at all with corruption at the disk level anyway.
If the bit flip happened in RAM, the checksum would be of the corrupted data. ECC corrects single bit errors of data on RAM.
>Not to mention that ECC wouldn't help at all with corruption at the disk level anyway.
Yes, using ECC without ZFS, btrfs, ReFS, or checksummed file formats is pretty pointless (unless your application never touches storage).
Also: DDR5 has some false ecc marketing due to the memory standard having an error correction scheme build in. Don't fall for it.
A computer with 64 GB of memory is 4 times more likely to encounter memory errors than one with 16 GB of memory.
When DIMMs are new, at the usual amounts of memory for desktops, you will see at most a few errors per year, sometimes only an error after a few years. With old DIMMs, some of them will start to have frequent errors (such modules presumably had a borderline bad fabrication quality and now have become worn out, e.g. due to increased leakage leading to storing a lower amount of charge on the memory cell capacitors).
For such bad DIMMs, the frequency of errors will increase, and it may become of several errors per day, or even per hour.
For me, a very important advantage of ECC has been the ability to detect such bad memory modules (in computers that have been used for 5 years or more) and replace them before corrupting any precious data.
I also had a case with a HP laptop with ECC, where memory errors had become frequent after being stored for a long time (more than a year) in a rather humid place, which might have caused some oxidation of the SODIMM socket contacts, because removing the SODIMMs, scrubbing the sockets and reinserting the SODIMMs made disappear the errors.
No. Or well, not exactly. More bits will flip randomly, but if between the two systems only the total installed memory changed, both systems will see the same amount of memory errors, because bit flips on the additional 48 GB will not result in errors, because they will not be used. Memory errors scale with memory used not with memory installed.
94 2025-08-26 01:49:40 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68aea758, cpuid=0x00a50f00, bank=0x00000012
95 2025-09-01 09:41:50 +0200 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=18), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=0, mcgcap=0x0000011c, status=0x9c2040000000011b, addr=0x36e701dc0, misc=0xd01a000101000000, walltime=0x68b80667, cpuid=0x00a50f00, bank=0x00000012
(this is `sudo ras-mc-ctl --errors` output)It's always the same address, and always a Corrected Error (obviously, otherwise my kernel would panic). However, operating my system's memory at this clock and latency boosts x265 encoding performance (just one of the benchmarks I picked when trying to figure out how to handle this particular tradeoff) by about 12%. That is an improvement I am willing to stomach the extra risk of effectively overclocking the memory module beyond its comformt zone for, given that I can fully mitigate it by virtue of properly working ECC.
Also: Could you not have just bought slightly faste RAM, given the premium for ECC?
And no, as ECC UDIMM for the speed (3600MHz) I run mine at simply does not exist - it is outside of what JEDEC ratified for the DDR4 spec.
DDR4-1600 (PC4-12800)
DDR4-1866 (PC4-14900)
DDR4-2133 (PC4-17000)
DDR4-2400 (PC4-19200)
DDR4-2666 (PC4-21300)
DDR4-2933 (PC4-23466)
DDR4-3200 (PC4-25600) (the highest supported in the DDR4 generation)
What's *NOT* supported are some enthusiast ones that typically require more than 1.2v for example: 3600 MT/s, 4000 MT/s & 4266 MT/s
Also, could you share some relevant info about your processor, mainboard, and UEFI? I see many internet commenters question whether their ECC is working (or ask if a particular setup would work), and far fewer that report a successful ECC consumer desktop build. So it would be nice to know some specific product combinations that really work.
- ASRock B450 Pro4
- ASRock B550M-ITX/ac
- ASRock Fatal1ty B450 Gaming-ITX/ac
- Gigabyte MC12-LE0
There's probably many others with proper ECC support. Vendor spec sheets usually hint at properly working ECC in their firmware if they mention "ECC UDIMM" support specifically.As for CPUs, that is even easier for AM4: Everything that's not based on a APU core (there are some SKUs marketed without iGPU that just have the iGPU part of the APU disabled, such as the Ryzen 5 5500) cannot support ECC. An exception to that rule are "PRO"-series APUs, such as the Ryzen 5 PRO 5650G et al., which have an iGPU, but also support ECC. Main differences (apart from the integrated graphics) between CPU and APU SKUs is that the latter do not support PCIe 4.0 (APUs are limited to PCIe 3.0), and have a few Watts lower idle power consumption.
When I originally built the desktop PC that I still use (after a number of in-place upgrades, such as swapping out the CPU/GPU combo for an APU), I blogged about it (in German) here: https://johannes.truschnigg.info/blog/2020-03-23#0033-2020-0...
If I were to build an AM5 system today, I would look into mainboards from ASUS for proper ECC support - they seem to have it pretty much universally supported on their gear. (Actual out-of-band ECC with EDAC support on Linux, not the DDR5 "on-DIE" stuff.)
This was running at like, 1866 or something. It's a pretty barebones 8th gen i3 with a beefier chipset, but ECC still came in clutch. I won't buy hardware for server purposes without it.
Edit: it's probably because I switched it to "energy efficiency mode" instead of "performance mode" because it would occasionally lock up in performance mode. Presumably with the same root cause.
Last winter I was helping someone put together a new gaming machine... it was so frustrating running into the fake ecc marketing for DDR5 that you mention. The motherboard situation for whether they support it or not, or whether a bios update added support or then removed it or added it back or not, was also really sad. And even worse IMO is that you can't actually max out 4 slots on the top tier mobos unless you're willing to accept a huge drop in RAM speed. Leads to ugly 48 GB sized sticks and limiting to two of them... In the end we didn't go with ECC for that someone, but I was pretty disappointed about it. I'm hoping the next gen will be better, for my own setup running ZFS and such I'm not going to give up ECC.
Some vendors use hamming codes with “holes” in them, and you need the CPU to also run ECC (or at least error detection) between ram and the cache hierarchy.
Those things are optional in the spec, because we can’t have nice things.
I wish AMD would make ECC a properly advertised feature with clear motherboard support. At least DDR5 has some level of ECC.
That is mostly to assist manufacturers in selling marginal chips with a few bad bits scattered around. It's really a step backwards in reliability.
Both the 8700G and the 8700G PRO are readily available in the EU, and the PRO SKU is about 50% more expensive (EUR 120 in absolute numbers): https://geizhals.eu/?cmp=3096260&cmp=3096300&cmp=3200470&act...
Does anyone maintain a list with de-facto support of amd chips and mainboards? That partlist site only shows official support IIRC, so it won't give you any results.
However in the past there have existed very few CPU models and MBs that supported either kind of DIMMs, while today this has become completely impossible, as the mechanical and electrical differences between them have increased.
In any case, today, like also 20 years ago, when searching for ECC DIMMs you must always search only the correct type, e.g. unbuffered ECC DIMMs for desktop CPUs.
In general, registered ECC DIMMs are easier to find, because wherever "server memory" is advertised, that is what is meant. For desktop ECC memory, you must be careful to see both "ECC" and "unbuffered" mentioned in the module description.
For out-of-band ECC, e.g. with standard ECC SODIMMs, all the embedded SBCs that I have seen used only CPUs that are very obsolete nowadays, i.e. ancient versions of Intel Xeon or old AMD industrial Ryzen CPUs (AMD's series of industrial Ryzen CPUs are typically at least one or two generations behind their laptop/desktop CPUs).
Moreover all such industrial SBCs with ECC SODIMMs were rather large, i.e. either in the 3.5" form factor or in the NanoITX form factor (120 mm x 120 mm), and it might have been necessary to replace their original coolers with bigger heatsinks for fanless operation.
In-band ECC causes a significant decrease of the performance, but for most applications of such mini-PCs the performance is completely acceptable.
something like that?
In my experience, it's generally unwise to push the platform you're on to the outermost of its spec'd limits. At work, we bought several 5950X-based Zen3 workstations with 128GB of 3200MT/s ECC UDIMM, and two of these boxes will only ever POST when you manually downclock memory to 3000MT/s. Past a certain point, it's silicon lottery deciding if you can make reality live up to the datasheets' promises.
edit: Looks like a lot of Asus motherboards work, and the thing to look for is "unbuffered" ECC. Kingston has some, I see 32GB module for $190 on Newegg.
Doesn't exactly sound like a use case for ECC memory, given that it can't correct these attacks. Interesting though, I'd have thought that virtual addresses would've largely fixed this.
I have followed his blog for years and hold him in high respect so I am surprised he has done that and expected stability at 100C regardless of what Intel claim is okay.
Not to mention that you rapidly hit diminishing returns pass 200W with current gen Intel CPUs, although he mentions caring able idle power usage. Why go from 150W to 300W for a 20% performance increase?
Given the motherboard and RAM will also generate quite some heat, if the case fan profile was conservative (he does mention he likes low noise), could be the insides got quite toasty.
Back when I got my 2080 Ti, I had this issue when gaming. The internal temps would get so hot due to the blanket effect of the padding I couldn't touch the components after a gaming session. Had to significantly tweak my fan profiles. His CPU at peak would generate about the same amount of heat as my 2080 Ti + CPU I had then, and I had the non-Compact case with two case fans.
[1]: https://michael.stapelberg.ch/posts/2025-05-15-my-2025-high-...
I also have a fractal define case with anti noise padding material and dust filters, but my temperatures are great and the computer is almost inaudible even when compiling code for hours with -j $(nproc). And my fans and cooler are much cheaper than his.
That should of course be sound padding...
Intel specifies a max operating temperature of 105°C for the 285K [1]. Also modern CPUs aren't supposed to die when run with inadequate cooling, but instead clock down to stay within their thermal envelope.
[1]: https://www.intel.com/content/www/us/en/products/sku/241060/...
Because CPUs can get much hotter in specific spots at specific pins no? Just because you're reading 100, doesn't mean there aren't spots that are way hotter.
My understanding is that modern Intel CPUs have a temp sensor per core + one at package level, but which one is being reported?
Anyway, OP's cooler should be able to cool down 250W CPUs below 100C. He must have done something wrong for this to not happen. That's my point -- the motherboard likely overclocked the CPU and he failed to properly cool it down or set a power limit (PL1/PL2). He could have easily avoided all this trouble.
And yeah, having Arrow Lake running at its defaults is just a waste of energy. Even halving your TDP just loses you roughly 15% performance in highly MT scenarios...
I did not overclock this CPU. I pay attention to what I change in the BIOS/UEFI firmware, and I never select any overclocking options.
Also, I have applied thermal paste properly: Noctua-supplied paste, following Noctua’s instructions for this CPU socket.
https://www.techpowerup.com/review/intel-core-ultra-9-285k/2... lists maximum temperature as 88.2C with the previous gen NH-D15 cooler.
When you do not have a bunch of components ready to swap out it is also really hard to debug these issues. Sometimes it’s something completely different like the PSU. After the last issues, I decided to buy a prebuilt (ThinkStation) with on-site service. The cooling is a bit worse, etc., but if issues come up, I don’t have to spend a lot of time debugging them.
Random other comment: when comparing CPUs, a sad observation was that even a passively cooled M4 is faster than a lot of desktop CPUs (typically single-threaded, sometimes also multi-threaded).
And if we are talking about a passively cooled M4 (MacBook Air basically) it will quite heavily throttle relatively quickly, you lose at the very least 30%.
So, let's not misrepresent things, Apple CPUs are very power efficient but they are not magic, if you hit them hard, they still need good cooling. Plenty of people have had the experience with their M4 Max, discovering that actually, if they did use the laptop as a workstation, it will generate a good amount of fan noise, there is no other way around.
Apple stuff is good because most people actually have bursty workload (especially graphic design, video editing and some audio stuff) but if you hammer it for hours on end, it's not that good and the power efficiency point becomes a bit moot.
I think a lot of it boils down to load profile and power delivery. My 2500VA double conversion UPS seems to have difficulty keeping up with the volatility in load when running that console app. I can tell because its fans ramp up and my lights on the same circuit begin to flicker very perceptibly. It also creates audible PWM noise in the PC which is crazy to me because up til recently I've only ever heard that from a heavily loaded GPU.
For a long time, my Achille's heel was my Bride's vacuum. Her Dyson pulled enough amps that the UPS would start singing and trigger the auto shutdown sequence for the half rack. Took way too long to figure out as I was usually not around when she did it.
You said the right words but with the wrong meaning! On Gigabyte mobo you want to increase the "CPU Vcore Loadline Calibration" and the "PWM Phase Control" settings, [see screenshot here](https://forum.level1techs.com/t/ddr4-ram-load-line-calibrati...).
When I first got my Ryzen 3900X cpu and X570 mobo in 2019, I had many issues for a long time (freezes at idle, not waking from sleep, bios loops, etc). Eventually I found that bumping up those settings to ~High (maybe even Extreme) was what was required, and things worked for 2 years or so until I got a 5950X on clearance last year.
I slotted that in to the same mobo and it worked fine, but when I was looking at HWMon etc, I noticed some strange things with the power/voltage. After some mucking about and theorising with ChatGPT (it's way quicker than googling for uncommon problems), it became apparent that the ~High LLC/power settings I was still using were no good. ChatGPT explained that my 3900X was probably a bit "crude" in relative quality, and so it needed the "stronger" power settings to keep itself in order. Then when I've swapped to 5950X, it happens to be more "refined" and thus doesn't need to be "manhandled" — and in fact, didn't like being manhandled at all!
But if your UPS (or just the electrical outlet you're plugged into) can't cope - dunno if I'd describe that as cratering your CPU.
Yea, but unfortunately it comes attached to a Mac.
An issue I've encountered often with motherboards, is that they have brain damaged default settings, that run CPU's out of spec. You really have to go through it all with a fine toothed comb and make sure everything is set to conservative stock manufacturer recommended settings. And my stupid MSI board resets everything (every single BIOS setting) to MSI defaults when you upgrade its BIOS.
It looks completely bonkers to me. I overclocked my system to ~95% of what it is able to do with almost default voltages, using bumps of 1-3% over stock, which (AFAIK) is within acceptable tolerances, but it requires hours and hours of tinkering and stability testing.
Most users just set automatic overclocking, have their motherboards push voltages to insane levels, and then act surprised when their CPUs start bugging out within a couple of years.
Shocking!
I'd rather run everything at 90% and get very big power savings and still have pretty stellar performance. I do this with my ThinkStation with Core Ultra 265K now - I set the P-State maximum performance percentage to 90%. Under load it runs almost 20 degrees Celsius cooler. Single core is 8% slower, multicore 4.9%. Well worth the trade-off for me.
(Yes, I know that there are exceptions.)
You can always play with the CPU governor / disable high power states. That should be well-tested.
I think you are confusing with undervolting.
It turned out during the shitcoin craze and then AI craze that hardcore gamers, aka boomers with a lot of time and retirement money on their hands and early millennials working in big tech building giant-ass man caves, are a sizeable demographic with very deep pockets.
The wide masses however, they gotta live with the scraps that remain after the AI bros and hardcore gamers have had their pick.
https://www.pugetsystems.com/blog/2024/08/02/puget-systems-p...
to;dr: they heavily customize BIOS settings, since many BIOSes run CPUs out-of-spec by default. With these customizations there was not much of a difference in failure rate between AMD and Intel at that point in time (even when including Intel 13th and 14th gen).
Yeah. If Asahi worked on newer Macs and Apple Silicon Macs supported eGPU (yes I know, big ifs), the choice would be simple. I had NixOS on my Mac Studio M1 Ultra for a while and it was pretty glorious.
I had the same issue with my MSI board, next one won't be a MSI.
My modern CPU problems are DDR5 and the pre-boot timing thing never completing. So a build of a 9700x that I did that WAS supposed to be located remotely from me has to sit in my office and have its hand held thru every reboot cuz you never know quite know when its doing to decide it needs to retime and randomly never come back. Requires pulling the plug from the back and waiting a few minutes then powering back, then waiting 30 minutes for 64gb of ddr5 to do its timing thing.
My system would randomly freeze for ~5 seconds, usually while gaming and having a video in the browser running a the same time. Then, it would reliably happen in Titanfall 2 and I noticed there were always AHCI errors in the Windows logs at the same time so I switched to an NVMe drive.
The system would also shut down occasionally (~ once every few hours) in certain games only. Then, I managed to reproduce it 100% of the time by casting lightning magic in Oblivion Remastered. I had to switch out my PSU, the old one probably couldn't handle some transient load spike, even though it was a Seasonic Prime Ultra Titanium.
I have an M1 Max, a few revisions old, and the only thing I can do to spin up the fans is run local LLMs or play Minecraft with the kids on a giant ultra wide monitor at full frame rate. Giant Rust builds and similar will barely turn on the fan. Normal stuff like browsing and using apps doesn’t even get it warm.
I’ve read people here and there arguing that instruction sets don’t matter, that it’s all the same past the decoder anyway. I don’t buy it. The superior energy efficiency of ARM chips is so obvious I find it impossible to believe it’s not due to the ISA since not much else is that different and now they’re often made on the same TSMC fabs.
This isn't really true. On the same process node the difference is negligible. It's just that Intel's process in particular has efficiency problems and Apple buys out the early capacity for TSMC's new process nodes. Then when you compare e.g. the first chips to use 3nm to existing chips which are still using 4 or 5nm, the newer process has somewhat better efficiency. But even then the difference isn't very large.
And the processors made on the same node often make for inconvenient comparisons, e.g. the M4 uses TSMC N3E but the only x86 processor currently using that is Epyc. And then you're obviously not comparing like with like, but as a ballpark estimate, the M4 Pro has a TDP of ~3.2W/core whereas Epyc 9845 is ~2.4W/core. The M4 can mitigate this by having somewhat better performance per core but this is nothing like an unambiguous victory for Apple; it's basically a tie.
> I have an M1 Max, a few revisions old, and the only thing I can do to spin up the fans is run local LLMs or play Minecraft with the kids on a giant ultra wide monitor at full frame rate. Giant Rust builds and similar will barely turn on the fan. Normal stuff like browsing and using apps doesn’t even get it warm.
One of the reasons for this is that Apple has always been willing to run components right up to their temperature spec before turning on the fan. And then even though that's technically in spec, it's right on the line, which is bad for longevity.
In consumer devices it usually doesn't matter because most people rarely put any real load on their machines anyway, but it's something to be aware of if you actually intend to, e.g. there used to be a Mac Mini Server product and then people would put significant load on them and then they would eat the internal hard drives because the fan controller was tuned for acoustics over operating temperature.
This anecdote perfectly describes my few generation old Intel laptop too. The fans turn on maybe once a month. I dont think its as power efficient as an M-series Apple CPU, but total system power is definitely under 10W during normal usage (including screen, wifi, etc).
One of the many reasons why snapdragon windows laptops failed was both amd and Intel (lunar lake) was able to reach the claimed efficiency of those chips. I still think modern x86 can match arm ones in efficiency if someone bothered to tune the os and scheduler for most common activities. M series was based on their phone chips which were designed from the ground up to run on a battery all these years. AMD/Intel just don't see an incentive to do that nor do Microsoft.
There is one exception: If I run an idle Windows 11 ARM edition VM on the mac, then the fans run pretty much all the time. Idle Linux ARM VMs don’t cause this issue on the mac.
I’ve never used windows 11 for x86. It’s probably also an energy hog.
On what metric am I ought to buy a CPU these days? Should I care about reviews? I am fine with a middle-end CPU, for what it is worth, and I thought of AMD Ryzen 7 5700 or AMD Ryzen 5 5600GT or anything with a similar price tag. They might even be lower-end by now?
Intel is just bad at the moment and not even worth touching.
And it's no bad power quality on mains as someone suggested (it's excellent here) or 'in the air' (whatever that means) if it happens very quickly after buying.
I would guess that a lot of it comes from bad firmware/mainboards, etc. like the recent issue with ASRock mainboards destroying Ryzen 9000-series GPUs: https://www.techspot.com/news/108120-asrock-confirms-ryzen-9... Anyone who uses Linux and has dealt with bad ACPI bugs, etc. knows that a lot of these mainboards probably have crap firmware.
I should also say that I had a Ryzen 3700X and 5900X many years back and two laptops with a Ryzen CPU and they have been awesome.
My belief is that it is in the memory controllers and the XMP profiles provided with RAM. It’s very easy for the XMP profiles to be overly optimistic or for the RAM to degrade overtime and fall out of spec.
Meanwhile, my intel systems are solid. Even the 9900k hand me down I have to my partner. There is an advantage to using very old tech. And they’re not even slower for gaming: everything is single core bottlenecked anyways. Only in the past year or so that AMD had surpassed in single core performance, but we are talking single digit percentage differences for gaming.
I’m glad AMD has risen, but the dialogue about AMD vs intel in the consumer segment is tainted by people who can’t disconnect their stock ownership from reality.
https://www.cpubenchmark.net/cpu_value_alltime.html
CPUs like Intel Core Ultra 7 265K are pretty close to top Ryzens
If your workload is pointer-chasing intel's new CPUs aren't great though, and the X3D chips are possibly a good pick (if the workload fits in cache) which is why they get a lot of hype from reviewers who benchmark games and judge the score 90% based on that performance.
The only issues are with an intel Bluetooth chipset, and bios auto detection bugs. Under Linux, the hardware is bug for bug compatible with Windows, and I’m down to zero known issues after doing a bit of hardware debugging.
My home server is on a 5600G. I turned it on, installed home assistant and jellyfin etc... , and since it has not been off. It's been chugging along completely unattended, no worries.
Yes, it's in a basement where temperature is never above 21C, and it's almost never pushed to 100%, and certainly never for extended periods of time.
But it's the stock cooler, cheap motherboard, cheap RAM and cheap SSD (with expensive NAS grade mechanical hard drives).
[1] Well, most non-servers are probably laptops today, but the same reasoning applies.
Definetly not that one if you plan to pair with a dedicated GPU! The 5700X has twice the L3 cache. All Ryzen 5000 with a GPU have only 16MB, 5700 has the GPU deactivated.
But see, this is why it is so difficult. I would have never guessed. I would have to research this A LOT, which I am fine with, but you know.
I also have this issue.
A common approach is to go into the BIOS/UEFI settings and check that c6 is disabled. To verify and/or temporarily turn c6 off, see https://github.com/r4m0n/ZenStates-Linux
If I enable virtualisation, the issue can be replicated within 15 minutes of boot.
But with basically half the CPU set to do nothing, and all features disabled its once a week max.
Which sucks because I basically live in WSL.
I have always run B series because I've never needed the overclocking or additional peripherals. In my server builds I usually disable peripherals in the UEFI like Bluetooth and audio as well.
Twice the memory bandwidth, twice the CPU core count... It's really wacky how they've decided to name things
The Ultra is a pair of Max chips. While the core counts didn't increase from M3 to M4 Max, overall performance is in the neighborhood of 5-25% better. Which still puts the M3 Ultra as Apple's top end chip, and the M5 Max might not dethrone it either.
The uplift in IPC and core counts means that my M1 Max MBP has a similar amount of CPU performance as my M3 iPad Air.
Of course, each generation has some single-core improvements and eventually that could catch up, but it can take a while to catch up to… twice as much silicon.
It is cheaper and more stable. Performance difference doesn’t matter that much too
On desktop PCs, thermal throttling is often set up as "just a safety feature" to this very day. Which means: the system does NOT expect to stay at the edge of its thermal limit. I would not trust thermal throttling with keeping a system running safely at a continuous 100C on die.
100C is already a "danger zone", with elevated error rates and faster circuit degradation - and there are only this many thermal sensors a die has. Some under-sensored hotspots may be running a few degrees higher than that. Which may not be enough to kill the die outright - but more than enough to put those hotspots into a "fuck around" zone of increased instability and massively accelerated degradation.
If you're relying on thermal throttling to balance your system's performance, as laptops and smartphones often do, then you seriously need to dial in better temperature thresholds. 100C is way too spicy.
If nothing else, it very clearly indicates that you can boost your performance significantly by sorting out your cooling because your cpu will be stuck permanently emergency throttling.
That said, there's a difference between a laptop cpu turbo boosting to 90 for a few minutes and a desktop cpu, which are usually cooler anyway, running at 100 sustained for three hours.
Maybe the pci bus is eating power, or maybe it’s the drives?
Smartphones have no active cooling and are fully dependent on thermal throttling for survival, but they can start throttling at as low as 50C easily. Laptops with underspecced cooling systems generally try their best to avoid crossing into triple digits - a lot of them max out at 85C to 95C, even under extreme loads.
I had an 8th-gen i7 sitting at the thermal limit (~100C) in a laptop for half a decade 24/7 with no problem. As sibling comments have noted, modern CPUs are designed to run "flat-out against the governor".
Voltage-dependent electromigration is the biggest problem and what lead to the failures in Intel CPUs not long ago, perhaps ironically caused by cooling that was "too good" --- the CPU finds that there's still plenty of thermal headroom, so it boosts frequency and accompanying voltage to reach the limit, and went too far with the voltage. If it had hit the thermal limit it would've backed off on the voltage and frequency.
> I also double-checked if the CPU temperature of about 100 degrees celsius is too high, but no: [..] Intel specifies a maximum of 110 degrees. So, running at “only” 100 degrees for a few hours should be fine.
Secondly, the article reads:
> Tom’s Hardware recently reported that “Intel Raptor Lake crashes are increasing with rising temperatures in record European heat wave”, which prompted some folks to blame Europe’s general lack of Air Conditioning.
> But in this case, I actually did air-condition the room about half-way through the job (at about 16:00), when I noticed the room was getting hot. Here’s the temperature graph:
> [GRAPH]
> I would say that 25 to 28 degrees celsius are normal temperatures for computers.
So apparently a Tom's Hardware article connected a recent heat wave with crashing computers containing Intel CPUs. They brought that up to rule it out by presenting a graph showing reasonable room temperatures.
I hope this helps.
No. High performance gaming laptops will routinely do this for hours on end for years.
If it can't take it, it shouldn't allow it.
Intel's basic 285K spec's - https://www.intel.com/content/www/us/en/products/sku/241060/... - say "Max Operating Temperature 105 °C".
So, yes - running the CPU that close to its maximum is really not asking for stability, nor longevity.
No reason to doubt your assertion about gaming laptops - but chip binning is a thing, and the manufacturers of those laptops have every reason to pay Intel a premium for CPU's which test to better values of X, Y, and Z.
But I just can't bring myself to upgrade this year. I dabble in local AI, where it's clear fast memory is important, but the PC approach is just not keeping up without going to "workstation" or "server" parts that cost too much.
There are glimmers of hope with MR-DIMMs CU-DIMM, and other approaches, but really boards and CPUs need to support more memory channels. Intel has a small advantage over AMD, but it's nothing compared to the memory speed of a Mac Pro or higher. "Strix Halo" offers some hope with four memory channel support, but it's meant for notebooks so isn't really expandable (which would enable à la carte hybrid AI; fast GPUs with reasonably fast shared system RAM).
I wish I could fast forward to a better time, but it's likely fully integrated systems will dominate if the size and relatively weak performance for some tasks makes the parts industry pointless. It is a glaring deficiency in the x86 parts concept and will result in PC parts being more and more niche, exotic and inaccessible.
On the flip-side, though: Running GPT-OSS-120b locally is "cool", but have people found useful, productivity enhancing use-cases which justify doing this over just loading $2000 into your OpenAI API account? That, I'm less sure of.
I think we'll get to the point where running a local-first AI stack is obviously an awesome choice; I just don't think the hardware or models are there yet. Next-year's Medusa Halo, combined with another year of open source model improvements might be the inflection point.
That being said, for AI, HEDT is the obvious answer. Back in the day, it was much more affordable with my 9980XE only costing $2,000.
I just built a Threadripper 9980 system with 192GB of RAM and good lord it was expensive. I will actually benefit from it though and the company paid for it.
That being said, there is a glaring gap between "consumer" hardware meant for gaming and "workstation" hardware meant for real performance.
Have you looked into a 9960 Threadripper build? The CPU isn't TOO expensive, although the memory will be. But you'll get a significantly faster and better machine than something like a 9950X.
I also think besides the new Threadripper chips, there isn't much new out this year anyways to warrant upgrading.
Competitors to NVidia really need to figure things out, even for gaming with AI being used more I think a high end APU would be compelling with fast shared memory.
It seems like large, unchallenged organizations like Intel (or NASA or Google) collect all the top talent out of school. But changing budgets, changing business objectives, frozen product strategies make it difficult for emerging talent to really work on next-generation technology (those projects have already been assigned to mid-career people who "paid their dues").
Then someone like Apple Silicon with M-chip or SpaceX with Falcon-9 comes along and poaches the people most likely to work "hardcore" (not optimizing for work/life balance) while also giving the new product a high degree of risk tolerance and autonomy. Within a few years, the smaller upstart organization has opened up in un-closeable performance gap with behemoth incumbent.
Has anyone written about this pattern (beyond Innovator's Dilemma)? Does anyone have other good examples of this?
I gather it's very difficult and expensive to make a board that supports more channels of RAM, so that seems worth targeting at the platform level. Eight channel RAM using common RAM DIMMs would transform PCs for many tasks, however for now gamers are a main force and they don't really care about memory speed.
How do you sell your systems when their time comes?
- cheap ULV chips like N100, N150, N300
- ultrabook ULV chips (I hope Lunar Lake is not a fluke)
- workstation chips that aren't too powerful (mainstream Core CPUs)
- inexpensive GPUs (a surprising niche, but excruciatingly small)
AMD has been dominating them in all other submarkets.Without a mainstream halo product Intel has been forced to compete on price, which is not something they can afford. They have to make a product that leapfrogs either AMD or Nvidia and successfully (and meaningfully) iterate on it. The last time they tried something like that was in 2021 with the launch of Alder Lake, but AMD overtook them with 3D V-Cache in 2022.
I've never overclocked anything and I've never felt I've missed out in any way. I really can't imagine spending even one minute trying to squeeze 5% or whatnot tweaking voltages and dealing with plumbing and roaring fans. I want to use the machine, not hotrod it.
I would rather Intel et al. leave a few percent "on the table" and sell things that work, for years on end without failure and without a lot of care and feeding. Lately it looks like a crapshoot trying to identify components that don't kill themselves.
This is about sane, stable defaults. If you want the extra performance far beyond the CPUs sweet-spot it should be made explicit you're forfeiting the stability headrooms.
Well, that's the issue, isn't it? Both Intel and AMD (resp. their board partners) had issues in recent times stemming from the increasingly aggressive push to the limit for those last few %.
That sounds terrible.
For example, various brands of motherboards are / were known to basically blow up AMD CPUs when using AMP/XMP, with the root cause being that they jacked an uncore rail way up. Many people claimed they did this to improve stability, but overclockers now that that rail has a sweet spot for stability and they went way beyond it (so much so that the actual silicon failed and burned a hole in itself with some low-ish probability).
https://www.computerbase.de/artikel/prozessoren/amd-ryzen-79...
Actually almost everything what you wrote is not true, and commenter above already sent you some links.
7800X3D is the GOAT, very power efficient and cool.
And even if could push it higher, they run very hot compared to other CPUs at the same power usage as a combination of AMD's very thick IHS, the compute chiplets being small/power dense and 7000 series X3D cache being on top of the compute chiplet unlike 9000 series that has it on the bottom.
The 9800x3d limited in the same way will be both mildly more power efficient from faster cores and run cooler because of the cache location. The only reason it's hotter is that it's allowed to use significantly more power, usually up to 150w stock, for which you'd have to remove the IHS on the 7800X3D if you didn't want to see magic smoke
I use Arch, btw ;)
https://www.theregister.com/2025/08/29/amd_ryzen_twice_fails...
Sufficient cooler, with sufficient airflow is always needed.
The 13900k draws more than 200W initially and thermal throttles after a minute at most, even in an air conditioned room.
I don't think that thermal problems should be pushed to end user to this degree.
So if your CPU is drawing "more than 200W" you're pretty much at the limits of your cooler.
This affects the laptop with other issues, like severe thermal throttling both in CPU and GPU.
A utility like throttlestop allows me to place maximums on power usage so I don't hit the tjMax during regular use. That is around 65-70W for the CPU - which can burst to 200+W in its allowed "Performance" mode. Absolutely nuts.
But I agree this should not be a problem in the first place.