Posted by Torq_boi 1 day ago
1 - Cloud – This is minimising cap-ex, hiring, and risk, while largely maximising operational costs (its expensive) and cost variability (usage based).
2 - Managed Private Cloud - What we do. Still minimal-to-no cap-ex, hiring, risk, and medium-sized operational cost (around 50% cheaper than AWS et al). We rent or colocate bare metal, manage it for you, handle software deployments, deploy only open-source, etc. Only really makes sense above €$5k/month spend.
3 - Rented Bare Metal – Let someone else handle the hardware financing for you. Still minimal cap-ex, but with greater hiring/skilling and risk. Around 90% cheaper than AWS et al (plus time).
4 - Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills, scale, cap-ex, and if you plan to run the servers for at least 3-5 years.
A good provider for option 3 is someone like Hetzner. Their internal ROI on server hardware seems to be around the 3 year mark. After which I assume it is either still running with a client, or goes into their server auction system.
Options 3 & 4 generally become more appealing either at scale, or when infrastructure is part of the core business. Option 1 is great for startups who want to spend very little initially, but then grow very quickly. Option 2 is pretty good for SMEs with baseline load, regular-sized business growth, and maybe an overworked DevOps team!
[0] https://lithus.eu, adam@
A core at this are all the 'managed' services - if you have a server box, its in your financial interest to squeeze as much per out of it as possible. If you're using something like ECS or serverless, AWS gains nothing by optimizing the servers to make your code run faster - their hard work results in less billed infrastructure hours.
This 'microservices' push usually means that instead of having an on-server session where you can serve stuff from a temporary cache, all the data that persists between requests needs to be stored in a db somewhere, all the auth logic needs to re-check your credentials, and something needs to direct the traffic and load balance these endpoint, and all this stuff costs money.
I think if you have 4 Java boxes as servers with a redundant DB with read replicas on EC2, your infra is so efficient and cheap that even paying 4x for it rather than going for colocation is well worth it because of the QoL and QoS.
These crazy AWS bills usually come from using every service under the sun.
1) Senior engineer starts on AWS
2) Senior engineer leaves because our industry does not value longevity or loyalty at all whatsoever (not saying it should, just observing that it doesn't)
3) New engineer comes in and panics
4) Ends up using a "managed service" to relieve the panic
5) New engineer leaves
6) Second new engineer comes in and not only panics but outright needs help
7) Paired with some "certified AWS partner" who claims to help "reduce cost" but who actually gets a kickback from the extra spend they induce (usually 10% if I'm not mistaken)
Calling it it ransomware is obviously hyperbolic but there are definitely some parallels one could draw
On top of it all, AWS pricing is about to massively go up due to the RAM price increase. There's no way it can't since AWS is over half of Amazon's profit while only around 15% of its revenue.
In theory with perfect documentation they’d have a good head start to learn it, but there is always a lot of unwritten knowledge involved in managing an inherited setup.
With AWS the knowledge is at least transferable and you can find people who have worked with that exact thing before.
Engineers also leave for a lot of reasons. Even highly paid engineers go off and retire, change to a job for more novelty, or decide to try starting their own business.
unfortunately it lot of things in AWS that also could be messed up so it might be really hard to research what is going on. For example, you could have hundreds of Lambdas running without any idea where original sources and how they connected to each-other, or complex VPCs network routing where some rules and security groups shared randomly between services so if you do small change it could lead to completely difference service to degrade (like you were hired to help with service X but after you changes some service Y went down and you even not aware that it existed)
"Today, we are going to calculate the power requirements for this rack, rack the equipment, wire power and network up, and learn how to use PXE and iLO to get from zero to operational."
Part of what clouds are selling is experience. A "cloud admin" bootcamp graduate can be a useful "cloud engineer", but it takes some serious years of experience to become a talented on prem sre. So it becomes an ouroboros: moving towards clouds makes it easier to move to the clouds.
If by useful you mean "useful at generating revenue for AWS or GCP" then sure, I agree.
These certificates and bootcamps are roughly equivalent to the Cisco CCNA certificate and training courses back in the 90's. That certificate existed to sell more Cisco gear - and Cisco outright admitted this at the time.
That is not true. It takes a lot more than a bootcamp to be useful in this space, unless your definition is to copy-paste some CDK without knowing what it does.
But will the market demand it? AWS just continues to grow.
The number of things that these 24x7 people from AWS will cover for you is small. If your application craps out for any number of reasons that doesn't have anything to do with AWS, that is on you. If your app needs to run 24x7 and it is critical, then you need your own 24x7 person anyway.
Meanwhile AWS breaks once or twice a year.
I've only had one outage I could attribute to running on-prem, meanwhile it's a bit of a joke with the non-IT staff in the office that when "The Internet" (i.e. Cloudflare, Amazon) goes down with news reports etc our own services are all running fine.
Youngsters nowadays start with very polished interfaces and smartphones, so even if the cloud wasn't there it would take them a decade to learn systems design on-the-job, which means it wouldn't happen anyway for most. The cloud nowadays mostly exists because of that dearth of system internals knowledge.
While there still are around people who are able to design from scratch and operate outside a cloud, these people tend to be quite expensive and many (most?) tend to work for the cloud companies themselves or SaaS businesses, which means there's a great mismatch between demand and supply of experienced system engineers, at least for the salaries that lower tier companies are willing to pay. And this is only going to get worse. Every year, many more experienced engineers are retiring than the noobs starting on the path of systems engineering.
I am sure it happens a multitude of ways but I have never seen the case you are describing.
> 4) Ends up using a "managed service" to relieve the panic
It's not as though this is unique to cloud.
I've seen multiple managers come in and introduce some SaaS because it fills a gap in their own understanding and abilities. Then when they leave, everyone stops using it and the account is cancelled.
The difference with cloud is that it tends to be more central to the operation, so can't just be canceled when an advocate leaves.
What do you think RedHat support contracts are? This situation exists in every technology stack in existence.
I'll give you an alternative scenario, which IME is more realistic.
I'm a software developer, and I've worked at several companies, big and small and in-between, with poor to abysmal IT/operations. I've introduced and/or advocated cloud at all of them.
The idea that it's "more expensive" is nonsense in these situations. Calculate the cost of the IT/operations incompetence, and the cost of the slowness of getting anything done, and cloud is cheap.
Extremely cheap.
Not only that, it can increase shipping velocity, and enable all kinds of important capabilities that the business otherwise just wouldn't have, or would struggle to implement.
Much of the "cloud so expensive" crowd are just engineers too narrowly focused on a small part of the picture, or in denial about their ability to compete with the competence of cloud providers.
This has been my experience as well. There are legitimate points of criticism but every time I’ve seen someone try to make that argument it’s been comparing significantly different levels of service (e.g. a storage comparison equating S3 with tape) or leaving out entire categories of cost like the time someone tried to say their bare metal costs for a two server database cluster was comparable to RDS despite not even having things like power or backups.
Also, if the cloud systems are architected properly before IT gets hold of them, then they tend to retain their good properties for a long time, especially if others are paying attention to e.g. gitops pull requests.
My current company ended up replacing its (small) operations team in order to get people with cloud expertise. We hired the new team for the skills we needed. It's worked out well.
As far as I know, nothing comes close to Aurora functionality. Even in vibecoding world. No, 'apt-get install postgres' is not enough.
What you’re asking for can mostly be pieced together, but no, it doesn’t exist as-is.
Failover: this has been a thing for a long time. Set up a synchronous standby, then add a monitoring job that checks heartbeats and promotes the standby when needed. Optionally use something like heartbeat to have a floating IP that gets swapped on failover, or handle routing with pgbouncer / pgcat etc. instead. Alternatively, use pg_auto_failover, which does all of this for you.
Clustering: you mean read replicas?
Volume-based snaps: assuming you mean CoW snapshots, that’s a filesystem implementation detail. Use ZFS (or btrfs, but I wouldn’t, personally). Or Ceph if you need a distributed storage solution, but I would definitely not try to run Ceph in prod unless you really, really know what you’re doing. Lightbits is another solution, but it isn’t free (as in beer).
Cross-region replication: this is just replication? It doesn’t matter where the other node[s] are, as long as they’re reachable, and you’ve accepted the tradeoffs of latency (synchronous standbys) or potential data loss (async standbys).
Metrics: Percona Monitoring & Management if you want a dedicated DB-first, all-in-one monitoring solution, otherwise set up your own scrapers and dashboards in whatever you’d like.
What you will not get from this is Aurora’s shared cluster volume. I personally think that’s a good thing, because I think separating compute from storage is a terrible tradeoff for performance, but YMMV. What that means is you need to manage disk utilization and capacity, as well as properly designing your failure domain. For example, if you have a synchronous standby, you may decide that you don’t care if a disk dies, so no messing with any kind of RAID (though you’d then miss out on ZFS’ auto-repair from bad checksums). As long as this aligns with your failure domain model, it’s fine - you might have separate physical disks, but co-locate the Postgres instances in a single physical server (…don’t), or you might require separate servers, or separate racks, or separate data centers, etc.
tl;dr you can fairly closely replicate the experience of Aurora, but you’ll need to know what you’re doing. And frankly, if you don’t, even if someone built a OSS product that does all of this, you shouldn’t be running it in prod - how will you fix issues when they crop up?
Nobody doubts one could build something similar to Aurora given enough budget, time, and skills.
But that's not replicating the experience of Aurora. The experience of Aurora is I can have all of that, in like 30 lines of terraform and a few minutes. And then I don't need to worry about managing the zpools, I don't need to ensure the heartbeats are working fine, I don't need to worry about hardware failures (to a large extent), I don't need to drive to multiple different physical locations to set up the hardware, I don't need to worry about handling patching, etc.
You might replicate the features, but you're not replicating the experience.
Managed services have a clear value proposition. I personally think they're grossly overpriced, but I understand the appeal. Asking for that experience but also free / cheap doesn't make any sense.
If ECS is faster, then you're more satisfied with AWS and less likely to migrate. You're also open to additional services that might bring up the spend (e.g. ECS Container Insights or X-Ray)
Source: Former Amazon employee
We used EFS to solve that issue, but it was very awkward, expensive and slow, its certainly not meant for that.
Microservices is a killer with cost. For each microservices pod - you're often running a bunch of side cars - datadog, auth, ingress - you pay massive workload separation overhead with orchestration, management, monitoring and ofc complexity
I am just flabbergasted that this is how we operate as a norm in our industry.
My biggest gripe with this is async tasks where the app does numerous hijinks to avoid a 10 minute lambda processing timeout. Rather than structure the process to process many independent and small batches, or simply using a modest container to do the job in a single shot - a myriad of intermediate steps are introduced to write data to dynamo/s3/kinesis + sqs/and coordination.
A dynamically provisioned, serverless container with 24 cores and 64 GB of memory can happily process GBs of data transformations.
If you can keep 4 "Java boxes" fed with work 80%+ of the time, then sure EC2 is a good fit.
We do a lot of batch processing and save money over having EC2 boxes always on. Sure we could probably pinch some more pennies if we managed the EC2 box uptime and figured out mechanisms for load balancing the batches... But that's engineering time we just don't really care to spend when ECS nets us most of the savings advantage and is simple to reason about and use.
It runs slower than a bloated pig, especially on a shared hosting node, so now needs Kubernetes and cloud orchestration to make it “scalable” - beyond a few requests per second.
You don’t need colocation to save 4x though. Bandwidth pricing is 10x. EC2 is 2-4x especially outside US. EBS for its iops is just bad.
[0] https://carolinacloud.io, derek@
So in practice cloud has become the more expensive option the second your spend goes over the price of 1 engineer.
I see it from the other direction, when if something fails, I have complete access to everything, meaning that I have a chance of fixing it. That's down to hardware even. Things get abstracted away, hidden behind APIs and data lives beyond my reach, when I run stuff in the cloud.
Security and regular mistakes are much the same in the cloud, but I then have to layer whatever complications the cloud provide comes with on top. If cost has to be much much lower if I'm going to trust a cloud provider over running something in my own data center.
The main benefit of outsourcing to aws etc is that the CEO isn't yelling at you when it breaks, because their golf buddies are in the same situation.
We figured, "Okay, if we can do this well, reliably, and de-risk it; then we can offer that as a service and just split the difference on the cost savings"
(plus we include engineering time proportional to cluster size, and also do the migration on our own dime as part of the de-risking)
Expect a significant exit expense, though, especially if you are shifting large volumes of S3 data. That's been our biggest expense. I've moved this to Wasabi at about 8 euros a month (vs about $70-80 a month on S3), but I've paid transit fees of about $180 - and it was more expensive because I used DataSync.
Retrospectively, I should have just DIYed the transfer, but maybe others can benefit from my error...
https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...
But. Don't leave it until the last minute to talk to them about this. They don't make it easy, and require some warning (think months, IIRC)
Hopefully someone else will benefit from this helpful advice.
Out of interest, how old are you? This was quite normal expectation of a technical department even 15 years ago.
It’s not rocket science, especially when you’re talking about small amounts of data (small credit union systems in my example).
Even at their peak, Heroku was a niche. If you’d gone conferences like WWDC or Pycon at the time, they’d be well represented, yes, and plenty of people liked them but it wasn’t a secret that they didn’t cover everyone’s needs or that pricing was off putting for many people, and that tended to go up the bigger the company you talked to because larger organizations have more complex needs and they use enough stuff that they already have teams of people with those skills.
Again 15 years even in moderately large organizations it was quite common as a product engineer to not be responsible for the provisioning all the required services for whatever you were building. And again it’s not the rule but it is far from being an exception. Not sure what you’re trying to prove or disprove.
I find it equally disingenuous to suggest that Heroku was only for startups with lavish budgets. Absolutely not true. That’s my only purpose here. Everyone has different experiences but don’t go and push your own narrative as the only one especially when it’s not true.
The world's a lot bigger than startups
Your original statement is factually incorrect.
It's 2026 and banks are still running their mainframe, running windows VMs on VMware and building their enterprise software with Java.
The big boys still have their own datacenters they own.
Sure, they try dabbling with cloud services, and maybe they've pushed their edge out there, and some minor services they can afford to experiment with.
See, turning up a VM, installing and running Postgres is easy.
The hard part is keeping it updated, keeping the OS updated, automate backups, deploying replicas, encrypting the volumes and the backups, demonstrating to a third party auditor all of the above... and mind that there might be many other things I honestly ignore!
I'm not saying I won't go that path, it might be a good idea after a certain scale, but in the first and second year of a startup your mind should 100% be on "How can I make my customer happy" rather than "We failed again the audit, we won't have the SOC 2 Type I certification in time to sign that new customer".
If deciding between Hetzner and AWS was so easy, one of them might not be pricing its services correctly.
Also, just availability of these things on AWS has been a real pain - I think every startup got a lot of credits there, so flood of people trying to then use them.
One point to keep in mind is that the effort is not constant. Once you reach a certain level of competency and stability in your setup, there is not much difference in time spent. I also felt that self-managed gave us more flexibility in terms of tuning.
My final point is that any investment in databases whether as a developer or as an ops person is long-lived and will pay dividends for a longer time than almost all other technologies.
Take two equivalent machines, set up with streaming replication exactly as described in the documentation, add Bacula for backups to an off-site location for point-in-time recovery.
We haven't felt the need to set up auto fail-over to the hot spare; that would take some extra effort (and is included with AWS equivalents?) but nothing I'd be scared of.
Add monitoring that the DB servers are working, replication is up-to-date and the backups are working.
Same here. But, I assume you have managed PostgreSQL in the past. I have. There are a large number of people software devs who have not. For them, it is not a low complexity task. And I can understand that.
I am a software dev for our small org and I run the servers and services we need. I use ansible and terraform to automate as much as I can. And recently I have added LLMs to the mix. If something goes wrong, I ask Claude to use the ansible and terraform skills that I created for it, to find out what is going on. It is surprisingly good at this. Similarly I use LLMs to create new services or change configuration on existing ones. I review the changes before they are applied, but this process greatly simplifies service management.
I'd say needing to read the documentation for the first time is what bumps it up from low complexity to medium. And then at medium you should still do it if there's a significant cost difference.
But if you were in my team I'd expect you to have read at least some of the documentation for any service you provision (self-hosted or cloud) and be able to explain how it is configured, and to document any caveats, surprises or special concerns and where our setup differs / will differ from the documented default. That could be comments in a provisioning file, or in the internal wiki.
That probably increases our baseline complexity since "I pressed a button on AWS YOLO" isn't accepted. I think it increases our reliability and reduces our overall complexity by avoiding a proliferation of services.
The main pair of PostgreSQL servers we have at work each have two 32-core (64-vthread) CPUs, so I think that's 128 vCPU each in AWS terms. They also have 768GiB RAM. This is more than we need, and you'll see why at the end, but I'll be generous and leave this as the default the calculator suggests, which is db.m5.12xlarge with 48 vCPU and 192GiB RAM.
That would cost $6559/month, or less if reserved which I assume makes sense in this case — $106400 for 3 years.
Each server has 2TB of RAID disk, of which currently 1TB is used for database data.
That is an additional $245/month.
"CloudWatch Database Insights" looks to be more detailed than the monitoring tool we have, so I will exclude that ($438/month) and exclude the auto-failover proxy as ours is a manual failover.
With the 3-year upfront cost this is $115000, or $192000 for 5 years.
Alternatively, buying two of yesterday's [2] list-price [3] Dell servers which I think are close enough is $40k with five years warranty (including next-business-day replacement parts as necessary).
That leaves $150000 for hosting, which as you can see from [4] won't come anywhere close.
We overprovision the DB server so it has the same CPU and RAM as our processing cluster nodes — that means we can swap things around in some emergency as we can easily handle one fewer cluster node, though this has never been necessary.
When the servers are out of warranty, depending on your business, you may be able to continue using them for a non-prod environment. Significant failures are still very unusual, but minor failures (HDDs) are more common and something we need to know how to handle anyway.
[1] https://calculator.aws/#/createCalculator/RDSPostgreSQL
[2] https://news.ycombinator.com/item?id=46899042
[3] There are significant discounts if you order regularly, buy multiple servers, or can time purchases when e.g. RAM is cheaper.
I think if it were true that the tuning is easier if you run the infrastructure yourself, then this would be a good point. But in my experience, this isn't the case for a couple reasons. First of all, the majority of tuning wins (indexes, etc.) are not on the infrastructure side, so it's not a big win to run it yourself. But then also, the professionals working at a managed DB vendor are better at doing the kind of tuning that is useful on the infra side.
I can see how the cost savings could justify that, but I think it makes sense to bias toward avoiding investing in things that are not related to the core competency of the business.
With a managed solution, all of that is amortized into your monthly payment, and you're sharing the cost of it across all the customers of the provider of the managed offering.
Personally, I would rather focus on things that are in or at least closer to the core competency of our business, and hire out this kind of thing.
I didn't include labour costs, but the self-hosted tasks (set up of hardware, OS, DB, backup, monitoring, replacing a failed component which would be really unusual) are small compared to the labour costs of the DB generally (optimizing indices, moving data around for testing etc, restoring from a backup).
this part is actually scariest, since there are like 10 different 3rd party solutions of unknown stability and maintanability.
AWS charge about $500/month for this, so there's plenty of room to pay a consultant and still come out way ahead.
The flip side is that compliance is a little more involved. Rather than, say, carve out a whole swathe of SOC-2 ops, I have to coordinate some controls. It's not a lot, and it's still a lot lighter than I used to do 10+ years ago. Just something to consider.
The cloud is great when you just need to start and when you do not know what scale you will need. Minimal initial cost and no wasted time over planning things you do not know enough about.
The cloud is horrible for steady-state demand. You are over-paying for your base load. If your demand does not scale that much, you do not benefit from the flexibility. Distance from the edge can cause performance problems. In an effort to “save money” you will chase complexity and bake in total reliance on your cloud provider. AAA l
Starting with the cloud makes sense. Just make sure not to engineer a solution you cannot take somewhere else.
As you scale and demand becomes known, you can start to migrate some stuff on premises or to other managed providers.
The great thing about “cloud architecture” is that you can use a hybrid model. You can selectively move parts of the stack. You can host your baseline demand and still rely on the cloud for scalability.
Where you need to spend the money and gain the expertise is in design. Not a giant features waterfall but rather knowing how to build an application and infrastructure that is adaptable and portable as you scale.
Keep it simple but also keep it modular.
At least, that had been my experience.
- 2x Intel Xeon 5218
- 128gb Ram
- 2x960GB SSD
- 30TB monthly bandwidth
I pay around an extra $200/month for "premium" support and Acronis backups, both of which have come in handy, but are probably not necessary. (Automated backups to AWS are actually pretty cheap.) It definitely helps with peace of mind, though.
I have setup encrypted backups to go to my backup server in the office. We have a gigabit service at the office. Critical data changes are backed up every hour and full backup once a day.
There is a world of difference between renting some cabinets in an Equinix datacenter and operating your own.
5 - Datacenter (DC) - Like 4, except also take control of the space/power/HVAC/transit/security side of the equation. Makes sense either at scale, or if you have specific needs. Specific needs could be: specific location, reliability (higher or lower than a DC), resilience (conflict planning).
There are actually some really interesting use cases here. For example, reliability: If your company is in a physical office, how strong is the need to run your internal systems in a data centre? If you run your servers in your office, then there's no connectivity reliability concerns. If the power goes out, then the power is out to your staff's computers anyway (still get a UPS though).
Or perhaps you don't need as high reliability if you're doing only batch workloads? Do you need to pay the premium for redundant network connections and power supplies?
If you want your company to still function in the event of some kind of military conflict, do you really want to rely on fibre optic lines between your office and the data center? Do you want to keep all your infrastructure in such a high-value target?
I think this is one of the more interesting areas to think about, at least for me!
Offices are usually very expensive real estate in city centers and with very limited cooling capabilities.
Then again the US is a different place, they don't have cities like in Europe (bar NYC).
Thank goodness we did all the capex before the OpenAI ram deal and expensive nvidia gpus were the worst we had to deal with.
It sounds like they probably have revenue in the €500mm range today. And given that the bare metal cost of AWS-equivalent bills tends to be a 90% reduction, we'll say a €10mm+ bare metal cost.
So I would say a cautious and qualified "yes". But I know even for smaller deployments of tens or hundreds of servers, they'll ask you what the purpose is. If you say something like "blockchain," they're going to say, "Actually, we prefer not to have your business."
I get the strong impression that while they naturally do want business, they also aren't going to take a huge amount of risk on board themselves. Their specialism is optimising on cost, which naturally has to involve avoiding or mitigating risk. I'm sure there'd be business terms to discuss, put it that way.
(While we’re all speculating)
I wouldn't be surprised if mining is also associated with fraud (e.g. using stolen credit cards to buy compute).
Netflix might be spending as much as $120m (but probably a little less), and I thought they were probably Amazon's biggest customer. Does someone (single-buyer) spend more than that with AWS?
Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer, and Netflix's shareholders would probably be worried about risk relying on a vendor that is much smaller than them.
Sometimes if the companies are friendly to the idea, they could form a joint venture or maybe Netflix could just acquire Hertzner (and compete with Amazon?), but I think it unlikely Hertzner could take on Netflix-sized for nontechnical reasons.
However increasing pop capacity by 30% within 6mo is pretty realistic, so I think they'd probably be able to physically service Netflix without changing too much if management could get comfortable with the idea
I'm not convinced.
I assume someone at Netflix has thought about this, because if that were true and as simple as you say, Netflix would simply just buy Hetzner.
I think there lots of reasons you could have this experience, and it still wouldn't be Netflix's experience.
For one, big applications tend to get discounts. A decade ago when I (the company I was working for) was paying Amazon a mere $0,2M a month and getting much better prices from my account manager than were posted on the website.
There are other reasons (mostly from my own experiences pricing/costing big applications, but also due to some exotic/unusual Amazon features I'm sure Netflix depends on) but this is probably big enough: Volume gets discounts, and at Netflix-size I would expect spectacular discounts.
I do not think we can estimate the factor better than 1.5-2x without a really good example/case-study of a company someplace in-between: How big are the companies you're thinking about? If they're not spending at least $5m a month I doubt the figures would be indicative of the kind of savings Netflix could expect.
When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.
I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.
I invest in Netflix, which means I'm giving them some fast cash to grow that business.
I'm not giving them cash so that they can have cash.
If they share a business plan that involves them having cash to do X, I wonder why they aren't just taking my cash to do X.
They know this. That's why on the investors calls they don't talk about "optimising costs" unless they're in trouble.
I understand self-hosting and self-building saves money in the long-long term, and so I do this in my own business, but I'm also not a public company constantly raising money.
> When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.
I'm a mere 0,1% of your spend, and I get discounts.
You would not be paying "list price".
Netflix definitely would not be.
My point is that even if I get 20 times discount on egress its still nowhere close, since i have to buy everything else - compute, storage are more expensive, and even with 5-10x discounts from list price its not worth it.
(Our cloud bills are in the millions as well, I am familiar with what discounts we can get)
A little scare for both sides.
Unless we're misunderstanding something I think the $100Ms figure is hard to consider in a vacuum.
I’m not surprised, but you’d think there would be some point where they would decide to build a data center of their own. It’s a mature enough company.
If you're willing to share, I'm curious who else you would describe as being in this space.
My last decade and a half or so of experience has all been in cloud services, and prior to that it was #3 or #4. What was striking to me when I went to the Lithus website was that I couldn't figure out any details without hitting a "Schedule a Call" button. This makes it difficult for me to map my experiences in using cloud services onto what Lithus offers. Can I use terraform? How does the kubernetes offering work? How does the ML/AI data pipelines work? To me, it would be nice if I could try it out in a very limited way as self-service, or at least read some technical documentation. Without that, I'm left wondering how it works. I'm sure this is a conscious decision to not do this, and for good reasons, but I thought I'd share my impressions!
We're not really that kind of product company; we're more of a services company. What we do is deploy Kubernetes clusters onto bare metal servers. That's the core technical offering. However, everything beyond that is somewhat per-client. Some clients need a lot of compute. Some clients need a custom object storage cluster. Some clients need a lot of high-speed internal networking. Which is why we prefer to have a call to figure out specifically what your needs are. But I can also see how this isn't necessarily satisfying if you're used to just grabbing the API docs and having a look around.
What we will do is take your company's software stack and migrate it off AWS/Azure/Google and deploy it onto our new infrastructure. We will then become (or work with) your DevOps team to supporting you. This can be anything from containerising workloads to diagnosing performance issues to deploying a new multi-region Postgres cluster. Whatever you need done on your hardware that we feel we can reasonably support. We are the ones on-call should NATS fall over at 4am.
Your team also has full access to the Kubernetes cluster to deploy to as you wish.
I think the pricing page is the most concrete thing on our website, and it is entirely accurate. If you were to phone us and say, "I want that exact hardware," we would do it for you. But the real value we also offer is in the DevOps support we provide, actually doing the migration up-front (at our own cost), and being there working with your team every week.
In my current job, I think we're honestly a bit past the phase where I would want to take on a migration to a service like yours. We already have a good team of infrastructure folks running our cloud infrastructure, and we have accepted the lock-in of various AWS managed services. So the high-touch devops support doesn't sound that useful to me (we already have people who are good at this), and replacing all the locked-in components seems unlikely to have good ROI. I think we'd be more likely to go straight to #3 if we decided to take that on to save money.
But I'll probably be a founder or early employee at a new startup again someday, and I'm intrigued by your offering from that perspective. But it seems pretty clear to me that I shouldn't call you up on day 1, because I'm going to be nowhere near $5k a month, and I want to move faster than calling someone up to talk about my needs. I want to self-serve a small amount of usage, and cloud services seem really great for that. But this is how they get you! Once you've started with a particular cloud service, it's always easiest to take on more lock-in.
At some point between these two situations, though, I can see where your offering would be great. But the decision point isn't all that clear to me. In my experience, by the time you start looking at your AWS bill and thinking "crap, that seems pretty expensive", you have better things to do than an infrastructure migration, and you have taken on some lock-in.
I do like the idea of high-touch services to solve the breaking-the-lock-in challenge! I'll certainly keep this in mind next time I find myself in this middle ground where the cloud starts feeling more expensive than it's worth, but we don't want to go straight to #3.
Unfortunately, (successful) startups can quickly get trapped into this option. If they're growing fast, everyone on the board will ask why you'd move to another option at the first place. The cloud becomes a very deep local minimum that's hard to get out off.
Is it still the cheapest after you take into account that skills, scale, cap-ex and long term lock-in also have opportunity costs?
You can get locked into cloud too.
The lock in is not really long term as it is an easy option to migrate off.
It works because bare metal is about 10% the cost of cloud, and our value-add is in 1) creating a resilient platform on top of that, 2) supporting it, 3) being on-call, and 4) being or supporting your DevOps team.
This starts with us providing a Kubernetes cluster which we manage, but we also take responsibility for the services run on it. If you want Postgres, Redis, Clickhouse, NATS, etc, we'll deploy it and be SLA-on-call for any issues.
If you don't want to deal with Kubernetes then you don't have to. Just have your software engineers hand us the software and we'll handle deployment.
Everything is deployed on open source tooling, you have access to all the configuration for the services we deploy. You have server root access. If you want to leave you can do.
Our customers have full root access, and our engineers (myself included) are in a Slack channel with you engineers.
And, FWIW, it doesn't have to be Hetzner. We can colocate or use other providers, but Hetzner offer excellent bang-per-buck.
Edit: And all this is included in the cluster price, which comes out cheaper than the same hardware on the major cloud providers
You're a brave DevOps team. That would cause a lot of friction in my experience, since people with root or other administrative privileges do naughty things, but others are getting called in on Saturday afternoon.
We rent hardware and also some VPS, as well as use AWS for cheap things such as S3 fronted with Cloudflare, and SES for priority emails.
We have other services we pay for, such as AI content detection, disposable email detection, a small postal email server, and more.
We're only a small business, so having predictable monthly costs is vital.
Our servers are far from maxed out, and we process ~4 million dynamic page and API requests per day.
The core services are cheap. S3 is cheap. Dynamo is cheap. Lambda is exceedingly cheap. Not understanding these services on their own terms and failing to read the documentation can lead one to use them in highly inefficient ways.
The "cloud" isn't just "another type of server." It's another type of /service/. Every costly stack I've seen fails to accept this truth.
https://docs.hetzner.com/cloud/technical-details/faq/#what-k...
> Buy and colocate the hardware yourself – Certainly the cheapest option if you have the skills
back then this type of "skill" was abundant. You could easily get sysadmin contractors who would take a drive down to the data-center (probably rented facilities in a real-estate that belonged to a bank or insurance) to exchange some disks that died for some reason. such a person was full stack in a sense that they covered backups, networking, firewalls, and knew how to source hardware.
the argument was that this was too expensive and the cloud was better. so hundreds of thousands of SME's embraced the cloud - most of them never needed Google-type of scale, but got sucked into the "recurring revenue" grift that is SaaS.
If you opposed this mentality you were basically saying "we as a company will never scale this much" which was at best "toxic" and at worst "career-ending".
The thing is these ancient skills still exist. And most orgs simply do not need AWS type of scale. European orgs would do well to revisit these basic ideas. And Hetzner or Lithus would be a much more natural (and honest) fit for these companies.
Even some really old (2000s-era) junk I found in a cupboard at work was all hot-swap drives.
But more realistically in this case, you tell the data centre "remote hands" person that a new HDD will arrive next-day from Dell, and it's to go in server XYZ in rack V-U at drive position T. This may well be a free service, assuming normal failure rates.
Remote hands is a thing indeed. Servers also tend to be mostly pre-built now days by server retailers, even when buying more custom made ones like servermicro where you pick each component. There isn't that many parts to a generic server purchase. Its a chassi, motherboard, cpu, memory, and disks. PSU tend to be determined by the motherboard/chassi choice, same with disk backplanes/raid/ipmi/network/cables/ventilation/shrouds. The biggest work is in doing the correct purchase, not in the assembly. Once delivered you put on the rails, install any additional item not pre-built, put it in the rack and plug in the cables.
It baffles me that my career trajectory somehow managed to insulate me from ever having to deal with the cloud, while such esoteric skills as swapping a hot swap disk or racking and cabling a new blade chassis are apparently on the order of finding a COBOL developer now. Really?
I can promise you that large financial institutions still have datacenters. Many, many, many datacenters!
Software development isn't a typical SME however. Mike's Fish and Chips will not buy a server and that's fine.
plus, infra flexibility removes random constraints that e.g. Cloudflare Workers have
Reality is these days you just boot a basic image that runs containers
[0] Longer list here: https://github.com/alexellis/awesome-baremetal
For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.
Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.
I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.
I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.
I worked in a company with two server farms (a main and a a backup one essentially) in Italy located in two different regions and we had a total of 5 employees taking care of them.
We didn't hear about them, we didn't know their names, but we had almost 100% uptime and terrific performance.
There was one single person out of 40 developers who's main responsibility were deploys, and that's it.
It costed my company 800k euros per year to run both the server farms (hardware, salaries, energy), and it spared the company around 7-8M in cloud costs.
Now I work for clients that spend multiple millions in cloud for a fraction of the output and traffic, and I think employ around 15+ dev ops engineers.
Running full scale kubernets, with multiple databases and services and expected 99.99% uptime likely can't be handled by one person.
Why do so many developers and sysadmins think they're not competent for hosting services. It is a lot easier than you think, and its also fun to solve technical issues you may have.
If you want true reliability, you need redundant physical locations, power, networking. That’s extremely easy to achieve on cloud providers.
It doesn't make sense if you only have few servers, but if you are renting equivalent of multiple racks of servers from cloud and run them for most of the day, on-prem is staggeringly cheaper.
We have few racks and we do "move to cloud" calculation every few years and without fail they come up at least 3x the cost.
And before the "but you need to do more work" whining I hear from people that never did that - it's not much more than navigating forest of cloud APIs and dealing with random blackbox issues in cloud that you can't really debug, just go around it.
On cloud it's out of your control when an AZ goes down. When it's your server you can do things to increase reliability. Most colos have redundant power feeds and internet. On prem that's a bit harder, but you can buy a UPS.
If your head office is hit by a meteor your business is over. Don't need to prepare for that.
It is a different skillset. SRE is also an under-valued/paid (unless one is in FAANGO).
It’s also nontrivial once you go past some level of complexity and volume. I have made my career at building software and part of that requires understanding the limitations and specifics of the underlying hardware but at the end of the day I simply want to provision and run a container, I don’t want to think about the security and networking setup it’s not worth my time.
Because those services solve the problem for them. It is the same thing with GitHub.
However, as predicted half a decade ago with GitHub becoming unreliable [0] and as price increases begin to happen, you can see that self-hosting begins to make more sense and you have complete control of the infrastructure and it has never been more easier to self host and bring control over costs.
> its also fun to solve technical issues you may have.
What you have just seen with coding agents is going to have the same effect on "developers" that will have a decline in skills the moment they become over-reliant on coding agents and won't be able to write a single line of code at all to fix a problem they don't fully understand.
I agree that solving technical issues is very fun, and hosting services is usually easy, but having resilient infrastructure is costly and I simply don't like to be woken up at night to fix stuff while the company is bleeding money and customers.
Speaking as someone who does this, it is very straightforward. You can rent space from people like Equinix or Global Switch for very reasonable prices. They then take care of power, cooling, cabling plant etc.
We also rely on github. It has historically been good a service, but getting worth it.
(hardware engineer trying to understand wtaf software people are saying when they speak)
The argument made 2 decades ago was that you shouldn't own the infrastructure (capital expense) and instead just account for the cost as operational expense (opex). The rationale was you exchange ownership for rent. Make your headache someone else's headache.
The ping pong between centralized vs decentralized, owned vs rented, will just keep going. It's never an either or, but when companies make it all-or-nothing then you have to really examine the specifics.
The Cloud providers made a lot of sense to finance departments since aside from the promised savings, you would take that cloud expense now and lower your tax rate.
After the passing of the One Beautiful Bill ("OBB"), the law allows you to accelerate CapEx to instead expense it[1], similar to the benefit given by cloud service providers.
This puts way more wind on the sails of the on-prem movement, for sure
[1] https://www.iqxbusiness.com/big-beautiful-bill-impact-on-cap...
if you are a profitable company paying taxes, you 100% want to defer taxes (part of EBITDA) thus trade earnings for market share.
This is exactly what TCI did [1] with cable
[1] https://www.colinkeeley.com/blog/john-malone-operating-manua...
However, spending a premium on cloud services over what you could with an on-prem capital investment does not help your cash position.
His tenant of frugality would have conflicted, especially since the cloud premium can easily exceed the tax rate - that is to say, paying taxes would have been cheaper
Section in your linked article about frugality https://www.colinkeeley.com/blog/john-malone-operating-manua...
In any case, spending on this either opex or capex doesn't help you gain or lose marketshare. Conserving cash can help, so you'd want to employ the lower cost option regardless of what line of the financial statement it hits - it's not going to be cloud if you follow that thought through
If cost was equal then opex gives a tax advantage, most companies are valued on EBITDA so still may not be their priority to optimize tax spend - a lot of other methods to avoid taxes. But the environment I've operated in I choose to capex because it conserves cash (is cheaper) and improves EBITDA optics (is excluded)
On prem is maybe not the best first step but Colo or dedicated servers gives you a cleaner path to going on-prem if you ever decide to. The cost of growth is too high in cloud.
Learning how to run servers is actually less complicated than all the cloud architecture stuff and doesn’t have to be slower. There’s no one sized fits all, but I believe old boring solutions should be employed first and could be used to run most applications. Technology has a way of getting more complex every year just to accomplish the same tasks. But that’s largely optional.
I say lower your tax bill.
"not the same ting" - nnt
In other words, please explain how it makes sense to lower tax bill by shift the expenses to opex when that process involves paying more for the same utility?
The only reason to lower the tax bill is to conserve cash. The article you linked to explains it that way too.
It seems the main issue is that everyone is anchored to AWS so they have no incentive to reduce their prices. Probably same for Azure. I think Google is just risky because they kill products so easily.
That was part of the reason.
The real reason was the internal infrastructure team in many orgs got nowhere. There was a huge queue and many teams instead had to find infinite workarounds including standing up their own. The "cloud" provided a standardized way to at least deal with this mess e.g. single source of billing.
> A 1990s VP of IT would look at this post and say, what's new?
Speed. The US lives in luxury but outside of that it often takes a LONG time to get proper servers. You don't just go online. There are many places where you have to talk to a vendor with no list price and the drama continues. Being out of capacity can mean weeks to months before you get anywhere.
Which many places ?
All teams will henceforth expose their data and functionality through service interfaces
Oh man, this is bad advice. Airborn humidity and contaminants will KILL your servers on a very short horizon in most places - even San Diego. I highly suggest enthalpy wheel coolers (kyotocooling is one vendor - switch datacenters runs very similar units on their massive datacenters in the Nevada desert) as they remove the heat from the indoor air using outdoor air (+can boost slightly with an integrated refrigeration unit to hit target intake temps) without switching the air from one side to the other. This has huge benefits for air control quality and outdoor air tolerance and a single 500KW heat rejection unit uses only 25KW of input power (when it needs to boost the AC unit's output). You can combine this with evaporative cooling on the exterior intakes to lower the temps even further at the expense of some water consumption (typically far cheaper than the extra electricity to boost the cooling through an hvac cycle).
Not knocking the achievement just speaking from experience that taking outdoor air (even filtered + mixed) into a datacenter is a recipe for hardware failure and the mean time to failure for that is highly dependant on your outdoor air conditions. I've run 3MW facilities with passive air cooling and taking outdoor air directly into servers requires a LOT more conditioning and consideration than is outlined in this article.
Likewise the impact on server longevity is not a finite boundary but rather "exposure over time" gradient that, if exceeding the "low risk" boundary (>-12'C/10'f dew point or >15'C/59'f dry bulb temp) results in lower MTBF than design. This is defined (and server equipment manufacturers conform and build to) ASHRAE TC 9.9. This mean - if you're running your servers above high risk curve for humidity and temperature, you're shortening the life considerably compared to low risk curve.
Generally, 15% RH is considered suboptimal and can be dangerous near freezing temperatures - in San Diego in January there were several 90%+RH scenarios that would have been dangerous for servers even when mixed down with warm exhaust air - furthermore, the outdoor air at 76'f during that period means you have limited capacity to mix in warm exhaust air (which btw came from that same 99%RH input air) without getting into higher-than-ideal intake temps.
Any dew points above 62.5'f are considered high risk for servers - as are any intake temps exceeding 32'C/90'f. You want to be on the midpoint between those and 16'C/65'f temps & -12'C/10'f dew point to have no impact on server longevity or MTBF rates.
As a recent example:
KCASANDI6112 - January 2, 2026
High Low Average
Temperature 73.4 °F 59.9 °F 63.5 °F
Dew Point 68.0 °F 60.0 °F 62.6 °F
Humidity 99 % 81 % 96 %
Precipitation 0.12 in -- --
Lastly, air contaminants - in the form of dust (that can be filtered out) and chemicals (which can't without extensive scrubbing) are probably the most detrimental to server equipment if not properly managed, and require very intentional and frequent filter changes (typically high MERV pleated filters changed on a time or pressure drop signal) to prevent server degradation and equipment risks.The last consideration is fire suppression - permitted datacenters usually require compliance with separate fire code, such that direct outdoor air exchange without active shutdown and dry suppression is not permitted - this is to prevent a scenario where your equipment catches on fire and a constant supply of fresh oxygen-rich outdoor air turns that into an inferno. Smoke detection systems don't operate well with outdoor-mixed air or any level of airborn particulates.
So - for those reasons - among a few others - open air datacenters are not recommended unless you're doing them at google or meta scale, and in those scenarios you typically have much more extensive systems and purpose-designed hardware in order to operate for the design life of the equipment without issues.
At least a decade of that "long time" involved ordinary servers stuffed with GPUs (not ASICs) -- first for Bitcoin, then for Ethereum (until ~3 years ago).
When I'm launching a project it's easier for me to rent $250 worth of compute from AWS. When the project consumes $30k a month, it's easier for me to rent a colocation.
My point is that a good engineer should know how to calculate all the ups and downs here to propose a sound plan to the management. That's the winning thing.
In 99.999999% of cases management has already decided and is just informing you, because they know better.
Perhaps an exception (yet so far, I've never encounter the situation you describe)
There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.
People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.
The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.
This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
You rent a dataspace, which is OPEX not CAPEX, and you just lease the servers, which turns big CAPEX into monthly OPEX bill
Running your own DC is "we have two dozen racks of servers" endeavour, but even just renting DC space and buying servers is much cheaper than getting same level of performance from the cloud.
> This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
YOU NEED THOSE PEOPLE TO MANAGE CLOUD TOO. That's what always get ignore in calculations, people go "oh, but we really need like 2-3 ops people to cover datacenter and have shifts on the on-call", but you need same thing for cloud too, it is just dumped on programmers/devops guys in the team rather than having separate staff.
We have few racks and the part related to hardware is small part of total workload, most of it is same as we would (and do for few cloud customers) in cloud, writing manifests for automation.
Finally, some sense! "Cloud" was meant to make ops jobs disappear, but they just increased our salary by turning us into "DevOps Engineers" and the company's hosting bill increased fivefold in the process. You will never convince even 1% of devs to learn the ops side properly, therefore you'll still end up hiring ops people and we will cost you more now. On top of that, everyone that started as a "DevOps Engineer" knows less about ops than those that started as ops and transitioned into being "DevOps Engineers" (or some flavour of it like SREs or Platform Engineers).
If you're a programmer scared into thinking AI is going to take away your job, re-read my comment.
Just database management is a pretty specialized skill, separate from development or optimizing the structures of said data... For a lot of SaaS providers, if you aren't at a point where you can afford a dedicated DBA/Ops staff just for data, that's one reason you might lean into cloud operations or hybrid ops just for dbms management, security and backups. This is a low hanging fruit in terms of cloud offerings evem... but can shift a lot of burden in terms of operational overhead.
Again, depending on your business and data models.
But it is significantly cheaper and faster
As a hear-say anecdote, thats why some startups have db servers with hundreds of gb of ram and dozens of cpus to run a workload that could be served from a 5 year old laptop.
Once they are up and running that employee is spending at most a few hours a month on them. Maybe even a few hours every six months.
OTOH you are specifically ignoring that you'll require mostly the same time from a cloud trained person if you're all-in on AWS.
I expect the marginal cost of one employee over the other is zero.
You should also calculate the cost of getting it up and running. With Google Cloud (I don't actually use AWS), I mainly worry about building docker containers in CI and deploying them to vms and triggering rolling restarts as those get replaced with new ones. I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours. Also, where does the hardware live? What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where? Do you pay for security for wherever all that happens? What about cleaning, AC, or a special server room in your building. All that stuff is cost. Some of it is upfront cost. Some of it is recurring cost.
The article is a about a company that owns its own data center. The cost they are citing (5 million) is substantial and probably a bit more complete. That's one end of the spectrum.
> I don't worry about booting them. I don't worry about provisioning operating systems or configuration to them. Or security updates. They come up with a lot of pre-provisioned monitoring and other stuff. No effort required on my side.
These are not difficult problems. You can use the same/similar cloud install images.
A 10 year old nerd can install Linux on a computer; if you're a professional developer I'm sure you can read the documentation and automate that.
> And for production setups. You need people on stand by to fix the server in case of hardware issues; also outside office hours.
You could use the same person who is on standby to fix the cloud system if that has some failure.
> Also, where does the hardware live?
In rented rackspace nearby, and/or in other locations if you need more redundancy.
> What's your process when it fails? Who drives to wherever the thing is and fixes it? What do you pay them to be available for that? What's the lead time for spare components? Do you actually keep those in supply? Where?
It will probably report the hardware failure to Dell/HP/etc automatically and open a case. Email or phone to confirm, the part will be sent overnight, and you can either install it yourself (very, very easy for things like failed disks) or ask a technician to do it (I only did this once with a CPU failure on a brand new server). Dell/HP/etc will provide the technician, or your rented datacentre space will have one for simpler tasks like disks.
It is sad that the knowledge of how easy it really is, is getting extinct. The cloud and SaaS companies benefit greatly.
The installation itself was handled by the vendor and datacenter. For hard drive failures, our vendor (who provided the warranty) shipped a drive and had a technician drive to the site. We had to 1. tell the datacenter to expect the package and let the tech in, and 2. be online to run the command to blink the lights on the drive that needed replacing and then verify that the drive came online. This 6-company dance (us, vendor, DC, tech, fedex, HDD manufacturer) was more annoying than just terminating an EC2 instance and recreating it (or having EBS handle drive failures behind the scenes) but it wasn't that bad in the grand scheme of things.
I was not doing the calculation. I was only pointing out that it was not as simple as you make it out to be.
Okay, a few other things that aren't in most calculations:
1. Looking at jobs postings in my area, the highest paid ones require experience with specific cloud vendors. The FTEs you need to "manage" the cloud are a great deal more expensive than developers.
2. You don't need to compare on-prem data center with AWS - you can rent a pretty beefy VPS or colocate for a fraction of the cost of AWS (or GCP, or Azure) services. You're comparing the most expensive alternative when avoiding cloud services, not the most typical.
3. Even if you do want to build your own on-prem rack, FTEs aren't generally paid extra for being on the standby rota. You aren't paying extra. Where you will pay extra is for hot failovers, or machine room maintenance, etc, which you don't actually need if your hot failover is a cheap beefy VPS-on-demand on Hetzner, DO, etc.
4. You are measuring the cost of absolute 0% downtime. I can't think of many businesses that have such high sensitivity to downtime. Even banks handle downtime much larger than that even while their IT systems are still up. With such strict requirements you're getting into the spot where the business itself cannot continue because of catastrophe, but the IT systems can :-/. What use is the IT systems when the business itself may be down?
The TLDR is:
1. If you have highly paid cloud-trained FTEs, and
2. Your only option other than Cloud is on-prem, and
3. Your FTEs are actually FT-contractors who get paid per hour, and
4. Your uptime requirements are moire stringent than national banks,
yeah, then cloud services are only slightly more expensive.
You know how many businesses fall into that specific narrow set of requirements?
If you do it only a few hours every 6 months, you are not maintaining your infrastructure, you are letting it die (until the need arises and everything must be done and this is a massive project)
Here's what TFA says about this:
> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out.
and I think they're right. Be careful how you start because you may be stuck in the initial situation for a long time.
The upfront capex does not need to be that high, unless you're running your own AI models. Other than leasing new ones, as a sibling comment stated, you can buy used. You can get a solid Dell 2U with a full service contract (3 years) for ~$5-10K depending on CPU / memory / storage configuration. Or if you don't mind going older - because honestly, most webapps aren't doing anything compute-heavy - you can drop that to < $1K/node. Replacement parts for those are cheap, so buy an extra of everything.
It really depends on the business model as to how well you might support your own infrastructure vs. relying on a new backend instance per client in a cloud infrastructure that has already solved many of the issues at play.
Then you're probably going to need some combination of HIPAA / SOC 2 / PCI DSS certification, regardless of where your servers are physically located. AWS has certified the infrastructure side for you, but that doesn't remove your obligations for the logical side.
> Are you prepared for appropriate data isolation/sharding and controls? Do you have a strategy for scaling database operations per client or across all clients?
Again, you're going to need that regardless of where your servers physically exist.
> vs. relying on a new backend instance per client in a cloud infrastructure
You want to spin up an EC2 per client, and run an isolated copy of the application, isolated DB, etc. inside of it? That sounds like a nightmare to manage, especially if you want or need HA capabilities.
>> vs. relying on a new backend instance per client in a cloud infrastructure
> You want to spin up an EC2 per client, and run an isolated copy of the application, isolated DB, etc. inside of it? That sounds like a nightmare to manage, especially if you want or need HA capabilities.
No... just running a new hosted database instance per client... but (re)using your service/application infrastructure, but just connecting through a different database host/proxy based on the client for the request.Just that utility at the database management layer is probably worth the price of entry for using cloud resources if you can't justify and cover the cost of say 5+ employees just for the data management infrastructure.
Or use Citus Postgres, and get sharding by schema for free, so you have both isolation and more or less infinite growth.
I’m not sure why if you think it would take 5 employees to manage self-hosted DBs, that it won’t take close to that to manage cloud-hosted ones. The only real difference you’re going to have once both are set up is dealing with any possible hardware issues. The initial setup for backups, streaming replication, etc. is a one-time thing, and then it just works. Hire a contractor for that, optionally keeping them on retainer for emergencies if you want.
You still have to deal with DB issues with a managed service: things like schema management, table design, index maintenance, parameter tuning, query optimization are all your responsibility, not the cloud provider’s.
As to 5 dedicated employees for db systems management... that's just roughly where I would put the breakpoint... short of that you're more likely to have people in mixed roles during development, where people spend only part of their time managing migrations for schema changes and most of their time will be developing features.
The schema, table, index design etc. are largely done by the developers themselves at a startup level.... and even then, it's a problem where the costs can be op-ex against direct revenue scaling. So having 1000 clients isn't relatively more or less expensive than the first 5-10, it's baked into the model.
Cloud integrations, for example, allow you to simply use a different database instance altogether per customer, while you can share services that utilize a given db connection. But actually setting up and managing that type of database infrastructure yourself may be much more resource intensive from a head count perspective.
I mention this, because having completely separate databases is an abstraction that cloud operations have already solved... while you can choose other options, such as more complex data models to otherwise isolate or share resources how does this complexity affect your services down-stream and the overall data complexities across one or all clients.
Harder still, if your data/service is centered around b2b clients of yours that have direct consumer interactions... then what if the industry is health or finance where there are even more legal concerns. Figuring a minimal (off the top) cost of each client of yours and scaling to the number of users under them isn't too hard to consider if you're using a mix of cloud services in concert with your own systems/services.
So yeah.. there's definitely considerations in either direction.
The issue with comma.ai is that the company is HEAVILY burdened with Geohotz ideals, despite him no longer even being on the board. I used to be very much into his streams and he rants about it plenty. A large reason of why they run their own datacenter is that they ideologically refuse to give money to AWS or Google (but I guess Microsoft passes their non woke test).
Which is quite hilarious to me because they live in a very "woke" state and complain about power costs in the blog post. They could easily move to Wyoming or Montana and with low humidity and colder air in the winter run their servers more optimally.
The climate in Wyoming and Montana are actually worse in terms of climate. San Diego's climate extremes are less extreme than those places. Though moving out of CA is a good idea for power cost reasons, also addressed in the blog.
It's typically going to cost significantly less; it can make a lot of sense for small companies, especially.
You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.
Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.
I would be vary of a smallish company building their own Jira in house in a similar way.
>Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
Yes, but one differentiating factor is always price and you don't want to lose all your margins to some infrastructure provider.
Think of a ~5000 employee startup. Two scenarios:
1. if they win the market, they capture something like ~60% margin
2. if that doesn't happen, they just lose, VC fund runs out and then they leave
In this dynamic, costs associated with infrastructure don't change the bottomline of profitability. The risk involved with rolling out their on infrastructure can hurt their main product's existence itself.
>Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.
Well, exactly. But the degree to which the price of a specific input affects your bottom line depends on your product.
During the dot com era, some VC funded startups (such as Google) made a decision to avoid using Windows servers, Oracle databases and the whole super expensive scale-up architecture that was the risk-free, professional option at the time. If they hadn't taken this risk, they might not have survived.
[Edit] But I think it's not just about cloud vs on-premises. A more important question may be how you're using the cloud. You don't have to lock yourself into a million proprietary APIs and throw petabytes of your data into an egress jail.
But most importantly, the attractive power that companies doing on-premise infrastructure have towards the best talent.
Capex needs work. A couple of years, at least.
If you are willing to put in the work. Your mundane computer is always better than the shiny one you don't own.
Of course creating a VM is still a teraform commit away (you're not using clickops in prod surely)
If you want a custom server, one or a thousand, it's at least a couple of weeks.
If you want a powerful GPU server, that's rack + power + cooling (and a significant lead time). A respectable GPU server means ~2KW of power dissipation and considerable heat.
If you want a datacenter of any size, now that's a year at least from breaking ground to power-on.
Scale up, prove the market and establish operations on the credit card, and if it doesn’t work the money moves onto more promising opportunities. If the operation is profitable you transition away from the too expensive cloud to increase profitability, and use the operations incoming revenue to pay for it (freeing up more money to chase more promising opportunities).
Personally I can’t imagine anything outside of a hybrid approach, if only to maintain power dynamics with suppliers on both sides. Price increases and forced changes can be met with instant redeployments off their services/stack, creating room for more substantive negotiations. When investments come in the form of saving time and money, it’s not hard to get everyone aligned.
I think the primary reason that people over fixate on the cloud is that they can't do math. So renting is a hedge.
Would love to see people read, write and do more math.
Even spending 10k recurring can be easier administratively that spending 10k on a one time purchase that depreciates over a 3 year cycle in some organisations because you don’t have to go into meetings to debate whether it’s actually a 2 or 4 year depreciation or discuss opportunity costs of locking up capital for 3 years etc.
Getting things done is mostly a matter of getting through bureaucracy. Projects fail because of getting stuck in approvals far more often than they fail because of going overbudget.
Of course not.
But we are talking about a cost difference of tens of times, maybe a few hundred. The cloud is not like "most of the time".
If you don’t, you’ll be stuck trying to figure out data centres. Hiring tons of infrastructure experts, trying to manage power consumption. And for what? You won’t sell any more nails.
If you’re a company like Google, having better data centres does relate to your products, so it makes sense to focus on them and build your own.
Now on-prem is cool again.
Makes me wonder whether we’re already setting up the next cycle 10 years from now, when everyone rediscovers why cloud was attractive in the first place and starts saying “on-prem is a bad idea” again.
My entire career I’ve encountered people passionately pushing for on-prem and railing against anything cloud. I can’t remember a time when Hacker News comments leaned pro-cloud because it’s always been about self-hosting.
The few times the on-prem people won out in my career never went exactly as they imagined. Buying a couple servers and setting them up at the colo is easy enough, but the slow and steady drag of maintaining your own infrastructure starts to work its way into every development cycle after that. In my experience, every team has significantly underestimated how all the little things add up to a drag on available time for other work.
The best case for on-prem that I saw was when a company was basically in maintenance mode. Engineers had a lot of extra time to optimize, update. maintain, and cost reduce without subtracting from feature development or bug fixes.
The worst cases for on-prem I’ve seen have been funded startups. In this situation it’s imperative that everyone focus on feature development and rapid iteration. Letting some of the engineers get sidetracked with setting up and maintaining their own hosting to save a dollar amount that barely hires 1-2 more engineers but sets the schedule back by many months was a huge mistake.
In my experience, most engineers become less enchanted with rolling their own on premises hosting as they get older. Their work becomes more about getting the job done quickly and to budget, not hyper-optimizing the hosting situation at the expense of inviting more complexity and miscellaneous tasks into their workload.
This is cyclical and I see the main axis of contention as centralized vs de-centralized computing.
Mainframes (network) gave way to mini and microcomputers (PCs). PCs gave way to server farms and web-based applications. Private servers and data centers gave way to the Cloud. Edge computing is again a push towards a more decentralized model.
Like all good engineering problems, where data and applications are hosted involve tradeoffs. Priorities change. Technologies change. But oftentimes, what works in one generation doesn't in the next. Part of it is the slow march of progress. But I think some of it is just not wanting to use your parent's technology stack and wanting to build your own.
The cloud vs. on-prem tradeoff is one of flexibility, capacity, maintenance, and capex vs opex.
It's a similar story in application development. At one point, we're navigating text forms on a mainframe, the next it's a GUI local application, followed by Electron or Web applications with remote data. We'll cycle back to local-first data (likely on-phone local models).
When you start to hear about the network being the computer again, you'll know we've started to swing back the other way again.
That's pretty much the dogma of the 2010s.
It doesn't matter that my org runs a line-of-business datacentre that is a fraction of the cost of public cloud. It doesn't matter that my "big" ERP and admin servers take up half a rack in that datacentre. MBA dogma says that I need to fire every graybeard sysadmin, raze our datacentre facility to the ground, and move to AWS.
Fun fact, salaries and hardware purchases typically track inflation, because switching cost for hardware is nil and hiring isn't that expensive. Whereas software is usually 5-10% increases every year because they know that vendor lock-in and switching costs for software are expensive.
AWS has redundant data centres across the world and within each region. A file in S3 will never be lost, even if you store it for a thousand years.
What happens if your city has a tornado and your data centre gets hit? Is your company now dead?
And how much do you spend on all these sysadmins? 200k each? If you’re saving 20k/month by paying 100k/month in salaries, you aren’t saving anything.