Posted by dkechag 9 hours ago
The most insane part here is that the AMD EPYC 4565p can beat the turin's used on the cloud providers, by as much as 2x in the single core.
Our tests took 2 minutes on GCP, 1 minute flat on the 4565p with its boost to 5.1ghz holding steady vs only 4.1ghz on the gcp ones.
GCP charges $130 a month for 8vcpus. ALSO this is for SPOT that can be killed at any moment.
My 4565p is a $500 cpu... 32 vcpus... racked in a datacenter. The machine cost under 2k.
i am trying hard to convince more people to rack themselves especially for CI actions. The cloud provider charging $130 / mo for 3x less vcpus you break even in a couple months, it doesn't matter if it dies a few months later. On top of that you're getting full dedicated and 2x the perf. Anyways... glad to see I chose the right cpu type for gcloud even though nothing comes close to the cost / perf of self racking
For €104/mo you can get a 16-core Ryzen 9 7950X3D (basically identical to your 4565p) w/ 128GB DDR5, 2x2TB PCIE Gen4 SSD.
That's not to say you're wrong about dedicated being much better value than VPS on a performance per dollar basis, but the markup that the European companies charge is much, much lower compared to what they'd charge in the US.
In this instance you're looking at a ~17 month payback period even ignoring colo fees. Assuming a ~$100 colo fee that sibling comment suggested, you're looking at closer to 8 years.
It’s fun to start thinking about building your own server and putting in a rack, but there’s always a lot of tortured math to compare it to completely different cloud hosted solutions.
One of the great things about cloud instances is that I can scale them up or down with the load without being locked into some hardware I purchased. For products I’ve worked on that have activity curves that follow day-night cycles or spike on holidays, this has been amazing. In some cases we could auto scale down at night and then auto scale back up during the day. As the user base grows we can easily switch to larger instances. We can also geographically distribute servers and provide lower latency.
There is a long list of benefits that are omitted when people make arguments based solely on monthly cost numbers. If we’re going to talk about long term dedicated server contracts we should at least price against similar options from companies like Hetzner.
At work we have this day / night cycle. But for some reason we're married to AWS. If we provisioned 24/7/365 a bunch of servers at Hetzner or such to cover the peaks with some margin, it would still be cheaper by a notable margin. Sure, 90% of them would twiddle their thumbs from 22 PM to 10 AM. So what?
Sure, if your clients are completely unpredictable and you'll see x100 traffic without notice, the cloud is great.
But how many companies are actually in that kind of situation? Looking back over a year or two, we're quite reliably able to predict when we'll have more visitors and how many more compared to baseline. We could just adjust the headroom to be able to take in those spikes. And I suppose if you want to save the environment, you could just turn off the Hetzner servers while they sit unused.
> The cloud provider charging $140 / mo for 3x less vcpus you break even in a couple months, it doesn't matter if it dies a few months later
How do you calculate break even in a couple months if the machine costs $2,000 and you still have to pay colo fees?
If your colo fees were $100 month you wouldn’t break even for over 4 years. You could try to find cheaper colocation but even with free colocation your example doesn’t break even for over a year.
colo fees are cheap if you need more than just 1u. even with a 50-100 fee you easily get way more performance and come ahead within a year
You originally said “a couple months” but now it’s 6 months and assumption of $0 collocation fees which isn’t realistic
In my experience situations rarely call for precisely 32 cores for a fixed period of 3 years to support calculations like this anyway. We start with a small set of cloud servers and scale them up as traffic grows. Today’s tooling makes it easy to auto scale throughout the day, even.
When trying to rack a server everyone aims higher because it sucks to start running into limits unexpectedly and be stuck on a server that wasn’t big enough to handle the load. Then you have to start considering having at least two servers in case one starts failing.
Racking a single self-built server is great for hobby projects but it’s always more complicated for serving real business workloads.
Everyone: run your scenarios and expectations in a spreadsheet and then use real data to run your CBA. Your case will be unique(ish) so make your case for your situation.
I think you’re misreading. Even the 6 month thread was based on invalid assumptions of $0 collocation fees. Add in even cheap collocation fees and it’s pushed out even further
That’s not really a nit pick when the claims were based on impossible math. It’s more of a Motte and Bailey where they come in with a “couple of months” claim that sounds awesome on the surface but then falls back to a completely different number if anyone looks at the details.
Let’s not forget that if even three engineers are working on this migration for only a week your cost is now 10’s of thousands for this couple hundred euros cost saving.
(assuming avg all-in engineer costs in europe)
It makes no sense to optimise cost for infrastructure mostly, it does make sense to make it faster, since almost all your spend is on engineers.
Spending thousands to save hundreds is not a healthy business.
Not sure where that fear comes from. Cloud challenges can be as or more complex than bare metal ones.
Probably because most developers these days have not known a world without using cloud providers, with AWS being 20 years old now.
Big +1 to this. For what I thought was a modest sized project it feels like an np-hard problem coordinating with gcloud account reps to figure out what regions have both enough hyperdisk capacity and compute capacity. A far cry from being able to just "download more ram" with ease.
The cloud ain't magic folks, it's just someone else's servers.
(All that said... still way easier than if I needed to procure our own hardware and colocate it. The project is complete. Just delayed more than I expected.)
It’s funny, bc AWS did not start this tour of business. What they did do is make it possible to pay by the hour. The ephemeral spare compute is what they started.
Yet almost nobody understood the ephemeral part.
You might even be better off running a macmini at home fiber, especially for backend processing
What do you think the typical duty cycle is for a CI machine?
Raw performance is kind of meaningless if you aren't actually using the hardware. It's a lot of up front capex just to prove a point on a single metric.
Our CI run smaller PR checks during the day when devs make changes. In the “downtime” we run longer/more complex tests such as resilience and performance tests. So typically a machine is utilised 20-22/7.
If you need single-threaded performance, colo is really the only way to go anyway.
We have two full racks and we're super happy with them.
Don't use Hetzner for anything actually important to you. :(
- sometimes you need to limit the list of available CPU features to allow live migration between different hypervisors
- even if you migrate the virtual machine to the latest state of the art CPU, /proc/cpuinfo won't reflect it (linux would go crazy if you tried to switch the CPU information on the fly) (the frequency boost would be measurable though, just not via /proc/cpuinfo )
Similar lower specced machines that were closer to the public internet had boot disk failures, but I had a few of them, so it wasn’t an issue. Spinning metal and all.
One of the db servers dying would have required a next day colo visit… so I never rebooted.
This proc is a hidden gem.
For most workloads it’s not just the most performant, but also the best bang-for-buck.
A year ago I gave a talk about optimizing Cloud cost efficiency and I did a comparison of colocation vs cloud over time. You might find it interesting here, linking to the relative part: https://youtu.be/UEjMr5aUbbM?si=4QFSXKTBFJa2WrRm&t=1236
TLDR, colocation broke even in 6 to 18 months for on-demand and 3y reserve cloud respectively. But spot instances can actually be quite cheaper than colocation.
You generally don't go to the cloud for the price (except if we are talking hetzner etc).
Make things look like a complicated black box. Make sure it feels scary to roll your own. Hide the core technical skills behind abstracted skills
This was a really, really good write-up. I appreciated the breadth of VMs tested and the spread of benchmarks. A few random observations:
1. Turin is a beast.
2. The data on price-performance makes Hetzner look really fantastic, especially for small scale projects where region placement doesn’t matter much and big bursty scaling isn’t required.
3. I think the first ever cloud VM I ever provisioned was on DigitalOcean. I was surprised at how old their fleet was, but I guess they have some limited Emerald Rapids offerings now: https://www.digitalocean.com/blog/introducing-5th-gen-xeon-p...
They're a typical hardware maker unable to focus on software, which is why NVIDIA is now a multi-trillion dollar corporation and AMD is "just" a few hundred billion.
They've focused too much on CPUs and completely dropped the ball on AI and compute accelerators.
It's especially sad considering that the MI300 and related accelerators on paper are competitive with NVIDIA hardware, it's just that they have nowhere near the same software stack, so nobody cares.
We were stuck with Intel, its nice that we have better CPUs.
And don’t get me started on the valuation of companies riding the AI bubble.
7 Zip benchmark
9800X3D 130 GIPs compression, 134 GIPs decompress.
C8A 21577 MIPs (21.5GIPs) compression, 9868 MIPS decompression (9.9GIPs).
Geekbench 5
9800X3D 16975 multithread, 2474 single thread
C8A 4049 multithread, 2240 single thread
A desktop class CPU is definitely quicker single threaded and multithreaded, no surprises there most of these are dual core. The single threaded performance of the C8A is actually pretty good but its also the best of the bunch by a wide margin most of the CPUs are far behind. Memory performance appears to be attrocious all around.
I’d only add that its very common for gaming computers to be screaming fast- moreso than workstation machines or servers which are a bit more conservative with performance and emphasise correctness (slower cores, slower ram, more ECC). Its not a lot, but it can feel annoying when you sit on a company issued workstation that cost €10,000 but get worse performance than a €2,500 gaming computer.
Every big corporate I have worked at has lower cost of capital than Amazon, and yet they want to move to AWS. I just dont understand it.
Cloud looks expensive on sticker price, but it buys instant provisioning, autoscaling, managed databases and multi-region DR, and those benefits only pay off if you actually exploit autoscaling, reserved or savings plans, spot fleets and cost tooling like Kubecost or AWS Compute Optimizer to enforce right-sizing and kill zombie instances.
If you want cheap dev and UAT keep them on on-prem metal or cheap colo, but automate with Terraform and run reproducible runtimes like k3s or devcontainers so environments stay consistent and you do not trade lower capex for a creeping operations nightmare.
I think in practice the system administrators are still in the company now as AWS engineers, they still keep all that platform stuff running and your paying AWS for their engineers too as well as electricity. It has the advantage of being very quick to spin up another box, but also machines these days can come with 288 cores, its not a big stretch to maintain sufficient surplass and the tools to allow teams to self service.
Things are in a different place to when AWS first released, AWS ought to be charging a lot less for the compute, memory and storage, their business is wildly profitable at current rates because per core machines got cheaper.
"We Moved from AWS to Hetzner. Cut Costs 89%. Here’s the Catch."
https://medium.com/lets-code-future/we-moved-from-aws-to-het...
ChatGPT tells me "no theory, no fluff" all the time :D
""" No warning. No traffic spike. Just… more money gone.
That’s when I finally looked at Hetzner.
I’ve seen too many backend systems fail for the same reasons — and too many teams learn the hard way.
So I turned those incidents into a practical field manual: real failures, root causes, fixes, and prevention systems.
No theory. No fluff. Just production. """
It's clearly slop, they immediately use effectively the same one again:
""" That last line isn’t a joke. There were charges I genuinely couldn’t explain. Elastic IPs we forgot to release. Snapshots from instances that no longer existed. CloudFront distributions someone set up for testing. """
No, human writers don't repeat this pattern every single paragraph. They use it at most across in a whole article.
I'm just irked that it's being called out for AI slop because "I feel it in my bones!!"
There's a good chance it was written using AI -- should that matter? If the content is wrong/sucks, say that instead. If you're going to dismiss all AI assisted writing: good luck in the next decade.
It’s like suddenly all memes are just the same meme and nobody makes their own memes because “AI does it better”.
The style of writing is an intrinsic part of communication, if you can’t critique that then what is content? We’re not machines sharing pieces of data with each other.
Maintaining and updating your own hardware comes with so much operational overhead compared to magically spinning up and down resources as needed. I don’t think this really needs to be said.
You're never just paying for the hardware.
This benchmark seems to recommend Oracle Cloud, but I’ve heard that Oracle has historically used aggressive licenses and legal terms to keep customers locked-in.
Once we did this, the move was fairly easy (the exception being having to write our own auto-scaling logic as the built-in one is very limited). Overall we reduced our cloud spend (even accounting for the additional staff) by about 40%. Bandwidth is practically free and you are not limited to specific combos of CPU/RAM (so you can easily provision something with 7 cores and 9 GB RAM). Another big factor for us was that compute costs in OCI don't vary by region.
I will not recommend using any other managed services from OCI besides the basics (we tried some and they are not very reliable). We've seen minor issues in Networking periodically (Private DNS, LBs or interconnectivity between compute instances), but overall I would say the switch has been worth it.
Weirdly they didn’t allow me to add payment info to continue. Even weirder their sales people kept contacting me asking me to come back. When I explained the situation they all tried to fix it and then went radio silent until the next sales rep came along to try to convince me to stay.
I searched Reddit at the time and a lot of other people had the same experience. A lot of other people were bragging about abusing their free tier and trials without consequences. I still don’t know how they decided to permanently close my account (without informing the sales team)
As long as you don't use their exclusive DBaaS, moving away is easier than from other places, as egress traffic is free.
The user experience though, stuff of nightmares...
The account creation process was really confusing, and they kept turning off my instance because usage was not high enough.
It seemed quited oudated/confusing to use last time I tried it a few years ago.
And the pricing is laughable expensive comparr to OVH.