Top
Best
New

Posted by antov825 8/31/2025

Use One Big Server (2022)(specbranch.com)
350 points | 323 commentspage 3
ChrisArchitect 8/31/2025|
Previously: https://news.ycombinator.com/item?id=32319147
dang 8/31/2025|
Thanks! Macroexpanded:

Use one big server - https://news.ycombinator.com/item?id=32319147 - Aug 2022 (585 comments)

jeffrallen 8/31/2025||
I work for a cloud provider and I'll tell you, one of the reasons for the cloud premium is that it is a total pain in the ass to run hardware. Last week I installed two servers and between them had four mysterious problems that had to be solved by reseating cards, messing with BIOS settings, etc. Last year we had to deal with a 7 site, 5 country RMA for 150 100gb copper cables with incorrect coding in their EEPROMs.

I tell my colleagues: it's a good thing that hardware sucks: the harder it is to run bare metal, the happier our customers are that they choose the cloud. :)

(But also: this is an excellent article, full of excellent facts. Luckily, my customers choose differently.)

Nextgrid 8/31/2025|
Fortunately, companies like Hetzner/OVH/etc will handle all this bullshit for you for a flat monthly fee.
ahdanggit 9/1/2025||
I used a colo once a few years ago at a small datacenter in the midwest, I was shocked at how unprofessional everything was, machines laying in the hallway, a guy was sleeping in one of the offices. They let me setup my server and was left unattended several times, I could have just poked the power button on a nearby server or moved a cable or whatever. It was a 1.5 hour drive away, and I wasn't running anything serious so I just went with it but pulled my stuff out after my 1 year subscription was up.
turtlebits 8/31/2025||
The problem is sizing and consistency. When you're small, it's not cost effective to overprovision 2-3 big servers (for HA).

And when you need to move fast (or things break), you can't wait a day for a dedicated server to come up, or worse, have your provider run out of capacity (or have to pick a different specced server)

IME, having to go multi cloud/provider is a way worse problem to have.

andersmurphy 8/31/2025||
Most industries are not bursty. Overprovision in not expensive for most businesses. You can handle 30000+ updates a second on a 15$ VPS.

A multi node system tends to be less reliable and more failure points than a single box system. Failures rarely happen in isolation.

You can do zero downtime deployment with a single machine if you need to.

Aeolun 9/1/2025||
> A multi node system tends to be less reliable and more failure points than a single box system. Failures rarely happen in isolation.

Just like a lot of problems exists between keyboard and chair, a lot of problems exist between service A and service B.

The zero downtime deployment for my PHP site consisted of symlinking from one directory to another.

andersmurphy 9/1/2025||
Nice!

Honestly, we need to stop promoting prematurely making everything a network request as a good idea.

Nextgrid 9/1/2025||
> we need to stop promoting prematurely making everything a network request as a good idea

But how are all these "distributed systems engineers" going to get their resume points and jobs?

matt-p 8/31/2025||
There are a number of providers who provision dedicated servers via API in minutes these days. Given a dedicated server starts at around $90/Month it probably does make sense for alot of people.
winrid 9/1/2025||
A $20 dedicated server from OVH can outperform $144 VPSs from Linode in my testing, on passmark.
talles 8/31/2025||
Don't forget the cost of managing your one big server and the risk of having such single point of failure.
Puts 8/31/2025||
My experience after 20 years in the hosting industry is that customers in general have more downtime due to self-inflicted over-engineered replication, or split brain errors than actual hardware failures. One server is the simplest and most reliable setup, and if you have backup and automated provisioning you can just re-deploy your entire environment in less than the time it takes to debug a complex multi-server setup.

I'm not saying everybody should do this. There are of-course a lot of services that can't afford even a minute of downtime. But there is also a lot of companies that would benefit from a simpler setup.

sgarland 9/1/2025|||
Yep. I know people will say, “it’s just a homelab,” but hear me out: I’ve ran positively ancient Dell R620s in a Proxmox cluster for years. At least five. Other than moving them from TX to NC, the cluster has had 100% uptime. When I’ve needed to do maintenance, I drop one at a time, and it maintains quorum, as expected. I’ll reiterate that this is on circa-2012 hardware.

In all those years, I’ve had precisely one actual hardware failure: a PSU went out. They’re redundant, so nothing happened, and I replaced it.

Servers are remarkably resilient.

EDIT: 100% uptime modulo power failure. I have a rack UPS, and a generator, but once I discovered the hard way that the UPS batteries couldn’t hold a charge long enough to keep the rack up while I brought the generator online.

whartung 9/1/2025||
Being as I love minor disaster anecdotes where doing all the "right things" seem to not make any difference :).

We had a rack in data center, and we wanted to put local UPS on critical machines in the rack.

But the data center went on and on about their awesome power grid (shared with a fire station, so no administrative power loss), on site generators, etc., and wouldn't let us.

Sure enough, one day the entire rack went dark.

It was the power strip on the data centers rack that failed. All the backups grids in the world can't get through a dead power strip.

(FYI, family member lost their home due to a power strip, so, again, anecdotally, if you have any older power strips (5-7+ years) sitting under your desk at home, you may want to consider swapping it out for a new one.)

sgarland 9/1/2025||
For sure, things can and will go wrong. For critical services, I’d want to split them up into separate racks for precisely that reason.

Re: power strips, thanks for the reminder. I’m usually diligent about that, but forgot about one my wife uses. Replacement coming today.

ocdtrekkie 8/31/2025||||
My single on-premise Exchange server is drastically more reliable than Microsoft's massive globally resilient whatever Exchange Online, and it costs me a couple hours of work on occasion. I probably have half their downtime, and most of mine is scheduled when nobody needs the server anyhow.

I'm not a better engineer, I just have drastically fewer failure modes.

talles 8/31/2025||
Do you develop and manage the server alone? It's a quite a different reality when you have a big team.
ocdtrekkie 8/31/2025||
Mostly myself but I am able to grab a few additional resources when needed. (Server migration is still, in fact, not fun!)
api 9/1/2025||||
A lot of this attitude comes from the bad old days of 90s and early 2000s spinning disk. Those things failed a lot. It made everyone think you are going to have constant outages if you don’t cluster everything.

Today’s systems don’t fail nearly as often if you use high quality stuff and don’t beat the absolute hell out of SSD. Another trick is to overprovision SSD to allow wear leveling to work better and reduce overall write load.

Do that and a typical box will run years and years with no issues.

motorest 8/31/2025||||
> My experience after 20 years in the hosting industry is that customers in general have more downtime due to self-inflicted over-engineered replication, or split brain errors than actual hardware failures.

I think you misread OP. "Single point of failure" doesn't mean the only failure modes are hardware failures. It means that if something happens to your nodes whether it's hardware failure or power outage or someone stumbling on your power/network cable, or even having a single service crashing, this means you have a major outage on your hands.

These types of outages are trivially avoided with a basic understanding of well-architected frameworks, which explicitly address the risk represented by single points of failure.

fogx 8/31/2025||
don't you think it's highly unlikely that someone will stumble over the power cable in a hosted datacenter like hetzner? and even if, you could just run a provisioned secondary server that jumps in if the first becomes unavailable and still be much cheaper.
motorest 9/1/2025|||
> don't you think it's highly unlikely that someone will stumble over the power cable in a hosted datacenter like hetzner?

You're not getting the point. The point is that if you use a single node to host your whole web app, you are creating a system where many failure modes, which otherwise could not even be an issue, can easily trigger high-severity outages.

> and even if, you could just run a provisioned secondary server (...)

Congratulations, you are no longer using "one big server", thus defeating the whole purpose behind this approach and learning the lesson that everyone doing cloud engineering work is already well aware.

juped 9/1/2025||
Do you actually think dead simple failover is comparable to elastic kubernetes whatever?
motorest 9/1/2025||
> Do you actually think dead simple failover is comparable to elastic kubernetes whatever?

References to "elastic Kubernetes whatever" is a red herring. You can have a dead simple load balancer spreading traffic across multiple bare metal nodes.

juped 9/1/2025||
Thanks for switching sides to oppose yourself, I guess?
motorest 9/2/2025||
> Thanks for switching sides to oppose yourself, I guess?

I'm baffled by your comment. Are you sure you read what I wrote?

toast0 8/31/2025||||
I don't know about Hetzner, but the failure case isn't usually tripping over power plugs. It's putting a longer server in the rack above/below yours and pushing the power plug out of the back of your server.

Either way, stuff happens, figuring out what your actual requirements around uptime, time to response, and time to resolution is important before you build a nine nines solution when eight eights is sufficient. :p

kapone 9/1/2025||
> It's putting a longer server in the rack above/below yours and pushing the power plug out of the back of your server

Are you serious? Have you ever built/operated/wired rack scale equipment? You think the power cables for your "short" server (vs the longer one being put in) are just hanging out in the back of the rack?

Rack wiring has been done and done correctly for ages. Power cables on one side (if possible), data and other cables on the other side. These are all routed vertically and horizontally, so they land only on YOUR server.

You could put a Mercedes Maybach above/below your server and nothing would happen.

toast0 9/1/2025||
Yes I'm serious. My managed host took several of our machines offline when racking machines under/over ours. And they said it was because the new machines were longer and knocked out the power cables on ours.

We were their largest customer and they seemed honest even when they made mistakes that seemed silly, so we rolled our eyes and moved on with life.

Managed hosting means accepting that you can't inspect the racks and chide people for not cabling to your satisfaction. And mistakes by the managed host will impact your availability.

kapone 9/2/2025||
I hope that "managed host" got fired in a heartbeat and you moved elsewhere. Because they don't know WTF they're doing. As simple as that.
toast0 9/2/2025||
We did eventually move elsewhere because of acquisition. Of course those guys didn't even bother to run LACP and so our systems would regularly go offline for a bit whenever someone wanted to update a switch. I was a lot happier at the host that sometimes bumped the power cables.

Firing a host where you've got thousands of servers is easier said than done. We did do a quote exercise with another provider that could have supported us, and it didn't end up very competitive ... and it wouldn't have been worth the transition. Overall, there were some derpy moments, but I don't think we would have been happier anywhere else, and we didn't want to rent cages and run our own servers.

icedchai 8/31/2025|||
It's unlikely, but it happens. In the mid 2000's I had some servers at a colo. They were doing electrical work and took out power to a bunch of racks, including ours. Those environments are not static.
Aeolun 9/1/2025||||
In my experience, my personal services have gone down exactly zero times. Actually not entirely true, but every time they stopped working the servers had simply run out of disk space.

The number of production incidents on our corporate mishmash of lambda, ecs, rds, fargate, ec2, eks etc? It’s a good week when something doesn’t go wrong. Somehow the logging setup is better on the personal stuff too.

talles 8/31/2025||||
I also have seem the opposite somewhat frenquently: some team screws up the server and unrelated stable services that are running since forever (on the same server) are now affected due messing up the environment.
jeffrallen 8/31/2025|||
Not to mention the other leading cause of outages: UPS's.

Sigh.

icedchai 8/31/2025||
UPSes always seem to have strange failure modes. I've had a couple fail after a power failure. The batteries died and they wouldn't come back up automatically when the power came back. They didn't warn me about the dead battery until after...
sgarland 9/1/2025||
That’s why they have self-tests. Learned that one the hard way myself.
icedchai 9/1/2025||
My UPS was supposedly "self testing" itself periodically and it still happened!
sgarland 9/1/2025||
Oof, sorry.
ies7 9/1/2025|||
The last 4-5 years taught me that my most often single point of failure where I can't do a thing is Cloudflare not my on premise servers
joek1301 8/31/2025|||
Related: https://brooker.co.za/blog/2024/06/04/scale.html
lelanthran 9/1/2025|||
> Don't forget the cost of managing your one big server

Is that more, less than or about the same as having an AWS/Azure/GCP consultant?

What's the difference in labour per hour?

> the risk of having such single point of failure.

At the prices they charge I can have two hot failovers in two other datacenter and still come out ahead.

wmf 8/31/2025|||
Don't forget to read the article.
chrisweekly 8/31/2025|||
I'll take a (lone) single point of failure over (multiple) single points of failure.
juped 9/1/2025|||
The predictable cost, you mean, making business planning way easier? And you usually have two, because sometimes kernels do panic or whatever.
justmarc 8/31/2025||
AWS has also been a single point of failure multiple times in history, and there's no reason to believe this will never happen again.
suriya-ganesh 9/1/2025||
Being a big server proponent myself. Usually for one reason or the other there is need to introduce some socket style communication to the frontend and that becomes impossible in a single machine after a certain threshold.

Is there something obvious that I'm missing?

winrid 9/1/2025||
I've had 100k+ users connected to mid range linode boxes. Do you have that many?

Even still at that point you just round robin to a set of big machines. Easy

MitPitt 9/1/2025||
Load Balancing, Redundancy and Fail-Over
jiggawatts 9/1/2025||
I'm in the process of breaking up a legacy deployment on "one big server" into something cloud native like Kubernetes.

The problem with one big server is that few customers have ONE (1) app that needs that much capacity. They have many small apps that add up to that much capacity, but that's a very different scenario with different problems and solutions.

For example, one of the big servers I'm in the process of teasing apart has about 100 distinct code bases deployed to it, written by dozens of developers over decades.

If any one of those apps gets hacked and this is escalated to a server takeover, the other 99 apps get hacked too. Some of those apps deal with PII or transfer money!

Because a single big server uses a single shared IP address for outbound comms[1] this means that the firewall rules for 100 apps end up looking like "ALLOW: ANY -> ANY" for two dozen protocols.

Because upgrading anything system-wide on the One Big Server is a massive Big Bang Change, nobody has had the bravery to put their hand up and volunteer for this task. Hence it has been kept alive running 13 year old platform components because 2 or 3 of the 100 apps might need some of those components... but nobody knows which two or three apps those are, because testing this is also big-bang and would need all 100 apps tested all at once.

It actually turned out that even Two Big (old) Servers in a HA pair aren't quite enough to run all of the apps so they're being migrated to newer and better Azure VMs.

During the interim migration phase instead of Two Big Server s there are Four Big Servers... in PRD. And then four more in TST, etc... Each time a SysOps person deploys a new server somewhere, they have to go tell each of the dozens of developers where they need to deploy their apps today.

Don't think DevOps automation will rescue you from this problem! For example in Azure DevOps those 100 apps have 100 projects. Each project has 3 environments (=300 total) and each of those would need a DevOps Agent VM link to the 2x VMs = 600 VM registrations to keep up to date. These also expire every 6 months!

Kubernetes, Azure App Service, AWS App Runner, and GCP App Engine serve a purpose: They solve these problems.

They provide developers with a single stable "place" to dump their code even if the underlying compute is scaled, rebuilt, or upgraded.

They isolate tiny little apps but also allow the compute to be shared for efficient hosting.

They provide per-app networking and firewall rules.

Etc...

[1] It's easy to bind distinct ingress IP addresses on even a single NIC (or multipe), but it's weirdly difficult to split the outbound path. Maybe this is easier on Linux, but on Windows and IIS it is essentially impossible.

mystifyingpoi 9/1/2025|
Finally, someone said it.

> 100 distinct code bases deployed to it

I've worked in a company, where the owner would spend money on anything except hosting. Admin guy would end up deploying a new app on whatever VPS that had the most RAM free at that time.

Ironically, consolidating this mess to "one big server", which was my ungrateful job for many months, fixed many issues. Though, it was done by slicing the host into tiny KVM virtual machines.

jiggawatts 9/1/2025||
> slicing the host into tiny KVM virtual machines.

That's my other option: a bunch of Azure VM Scale Sets using the tiniest size that will run Windows Server, such as B2as_v2. A handful of closely related apps on each so that firewall rules can be restricted to something sane. Shared Azure Files for the actual app deployments so that devs never need to know the VM names. However, this feels an awful lot like reinventing Kubernetes... but worse.

My job would be sooo much simpler if Microsoft just got off their high horse and supported their own Active Directory in App Service instead of pretending it no longer exists.

SatvikBeri 8/31/2025||
A lot of these articles look at on-demand pricing for AWS. But you're rarely paying on-demand prices 24/7. If you have a stable workload, you probably buy reserved instances or a compute savings plan. At larger scales, you use third party services to get better deals with more flexibility.

A while back I looked into renting hardware, and found that we would save about 20% compared to what we actually paid AWS – in partially because location and RAM requirements made the rental more expensive than anticipated, and partially because we were paying a lot less than on-demand price for AWS.

20% is still significant, but it's a lot less than the ~80% that this and other articles suggest.

vidarh 9/1/2025|
This is usually only true of you lift and shift your AWS setup exactly as-is, instead of looking at what hardware will run your setup most efficiently.

The biggest cost with AWS also isn't compute, but egress - for bandwidth heavy setups you can sometimes finance the entirety of the servers from a fraction of the savings in egress.

I cost optimize setups with guaranteed caps at a proportion of savings a lot of the time, and I've yet to see a setup where we couldn't cut the cost far more than that.

SatvikBeri 9/1/2025||
I'd definitely be curious to hear how you'd approach our overall situation. We don't have significant egress costs, nor has any place I've worked with before. Our AWS costs are about 80% EC2 and Fargate, with the rest scattered over various services. Roughly half our spend is on 24/7 reserved instances, while the other half is in bursty analytics workloads.

Our workloads are primarily memory-bound, and AWS offers pretty good options there, e.g. x2gd instances have 16gb RAM/cpu, while most rental options we found were much more CPU focused (and charged for it.)

Nextgrid 9/1/2025|||
> while most rental options we found were much more CPU focused

Out of curiosity have you benchmarked it? I find that AWS "vCPUs" are significantly slower than a core (or even hyperthread) of a real CPU, and this constrains memory bandwidth too. A single bare-metal can often replace many EC2s.

Another thing to consider is the easy access of persistent NVME drives, something not possible on AWS. Yes you still need backups, but ideally you will only need those backups once a year or less. I've dealt with extremely complex and expensive solutions on AWS that could be trivially solved by just one persistent machine with NVME drives (+ a spare for redundancy). Having the data there persistently (at a cheap price per GB) means you avoid having to shuffle data around or can precompute derived data to speed up lookups at runtime.

If you're actually serious about exploring options to move your infra to bare-metal or hybrid feel free to reach out for a no-obligations call; email in my profile. It seems like you've already optimized it quite well so I'd be curious to see if there is still room for improvement. (Or if you don’t mind, share what your stack is and let others chip in too!)

vidarh 9/3/2025|||
I don't know what to say to this. X2gd instances are horrifically expensive - if you haven't trivially found cheaper machines elsewhere, there are essential details of your requirements you're leaving out.
Havoc 8/31/2025||
>Unfortunately, since all of your services run on servers (whether you like it or not), someone in that supply chain is charging you based on their peak load.

This seems fundamentally incorrect to me? If I need 100 units of peak compute during 8 hours of work hours, I get that from Big Cloud, and they have two other clients needing same in offset timezones then in theory the aggregate cost of that is 1/3rd of everyone buying their own peak needs.

Whether big cloud passes on that saving is another matter, but it's there.

i.e. big cloud throws enough small customers together so that they don't have "peak" per se just a pretty noisy average load that is in aggregate mostly stable

vidarh 9/1/2025||
But they generally don't. Most people don't have large enough daily fluctuations for these demand curves to flatten out enough. And the providers also need enough capacity to handle unforeseen spikes. Which is also why none of them will let you scale however far you want - they still impose limits so they can plan the excess they need.
Havoc 9/1/2025||
> And the providers also need enough capacity to handle unforeseen spikes.

Indeed but the headroom the cloud needs overall is less than every customers individual worst case scenarios added up. They’d take a percentage of that total because statistically a situation where 100% of customers are at 100% of their peak at 100% same point in time is improbable

Must admit little surprised this logic isn’t self evident

vidarh 9/3/2025||
The logic isn't self evident because it is irrelevant that their total demand doesn't add up this way, because the unit cost of the capacity is higher, and so the cost still ends up being far higher.

The unit cost is higher for many reasons, but the two basic ones are margins (exorbitant ones; this is not an efficient market) and that the providers also need to charge for the unused capacity to meet demand from customers behaving in ways they don't on fixed capacity systems, where spreading workloads over time wherever possible tends to become the norm.

The demand curves when you're charged for total demand over time rather than peak demand are fundamentally different, and so while you're right the peaks rarely add up in the worst possible way, empirically the peaks end up high enough that cloud compute is expensive even before the exorbitant margins the large cloud providers charge.

namibj 9/1/2025||
In which cloud can I book a machine with a guaranteed (up to general uptime SLA) end/termination time that's fixed for both?
qaq 8/31/2025|
and now consider 6th Gen EPYC will have 256 cores also you can have 32 hot-swap SSDs with like 10mil plus of random write IOPS and 60mil plus random read IOPS in a single 2U box
More comments...