Don't rent the cloud, own instead

Posted by Torq_boi 1 day ago

Don't rent the cloud, own instead(blog.comma.ai)

1175 points | 491 commentspage 2

hbogert 1 day ago|

Datacenters need cool dry air? <45%

No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.

swiftcoder 1 day ago||

The datacenter is in San Diego - a quick Google confirms that external humidity pretty much never drops below 50% there.

Things would be different in a colder climate where humidity goes --> 0% in the winter

mbreese 1 day ago|||

Low is good if you are also adding more humidity back in. If you want to maintain 45-50% (guessing), then you would want <45% environmental humidity so that you can raise it to the level you want. You're right about avoiding static, but you'd still want to try to keep it somewhat consistent.

It is much cheaper to use external air for cooling if you can.

hbogert 1 day ago||

Yeah but the article makes it sound as if lower is better, which it is definitely not. And yeah you need to control humidity, that might mean sometimes lowering, and sometimes increase it by whatever solution you have.

Also this is where cutting corner indeed results in lower cost, which was the reason for the OP to begin with. It just means you won't get as good a datacenter as people who are actually tuning this whole day and have decades of experience.

CamperBob2 1 day ago||

Low humidity causes static electricity.

RAM that is plugged in and operating isn't subject to external ESD, unless you count lightning strikes. Where are you getting this?

sgarland 1 day ago||

Note that they're running R630/R730s for storage. Those are 12-year old servers, and yet they say each one can do 20 Gbps (2.5 GBps) of random reads. In comparison, the same generation of hardware at AWS ({c,m,r}4) instance maxes out at 50% of that for EBS throughput on m4, and 70% on r4 - and that assumes carefully tuned block sizes.

Old hardware is _plenty_ powerful for a lot of tasks today.

treesknees 1 day ago||

I’m on a project at work replacing our R430s and R730s. They’ve been absolute tanks with very few hardware failures. That said, my company chooses to have OEM support for replacing failed components and keeping firmware/bios/idrac updated. You can absolutely run these if you’re OK with 3rd party replacements or parting out spare machines. Some industries are more tolerant to this than others.

sgarland 1 day ago||

I ran 3x R620s 24/7/365 in my homelab for ~6 years (well, other than when I moved, or shut one down for a clean-and-inspect, or lost power in excess of what my UPS could handle... thanks, Texas). The only things that failed during that time were a couple of sticks of RAM, and a PSU.

sdbrown 1 day ago||

On a related-different axis, I've consistently seen on-prem GPUs running identical workloads ~35% faster than the same workloads on the same cloud hardware, regardless of intermediate infra stack layering/versioning choices. Weird but I'm not complaining!

jononor 21 hours ago||

They might be undervolting and underclocking the GPUs to improve longevity?

vadepaysa 1 day ago||

I was an on-prem maxi (if thats a thing) for a long time. I've run clusters that costed more than $5M, but these days I am a changed man. I start with PaaS like Vercel and work my way down to on-prem depending on how important and cost conscious that workload is.

Pains I faced running BIG clusters on-prem.

1. Supply chain Management -- everything from power supplies all the way to GPUs and storage has to be procured, shipped, disassembled and installed. You need labor pool and dedicated management.

2. Inventory Management -- You also need to manage inventory on hand for parts that WILL fail. You can expect 20% of your cluster to have some degree of issues on an ongoing basis

3. Networking and security -- You are on your own defending your network or have to pay a ton of money to vendors to come in and help you. Even with the simplest of storage clusters, we've had to deal with pretty sophisticated attacks.

When I ran massive clusters, I had a large team dealing with these. Obviously, with PaaS, you dont need anyone.

cheema33 1 day ago||

> I was an on-prem maxi (if thats a thing) for a long time. I've run clusters that costed more than $5M, but these days I am a changed man.

I have had a similar transformation. I still host non-critical services on-prem. They are exceptionally cheap to run. Everything else, I host it on Hetzner.

majormajor 1 day ago||

In addition to those sorts of non-first-hardware-purchase costs, the person writing the check needs to think long and hard about how bad an outage would be, and how much money it makes sense to budget simply to "avoiding outages." And the more important it is not to have any downtime, the more it's gonna cost to build up some sort of substitute for cross-datacenter cloud functionality. (You are also likely not going to be as good at either managing and configuring those networks, or hiring people to do so, as AWS, either.)

sys42590 1 day ago||

It would be interesting to hear their contingency plan for any kind of disaster (most commonly a fire) that hits their data center.

sschueller 1 day ago||

Yep, does anyone remember the OVH fire[1][2]?

[1] https://www.techradar.com/news/remember-the-ovhcloud-data-ce...

[2] https://blocksandfiles.com/wp-content/uploads/2023/03/ovhclo...

otherme123 1 day ago|||

I fully lost three small VPS there, and their response was poor: they didn't even refund time lost, they didn't compensate for time lost (e.g. a couple of months of free VPS), I got better updates from the news than from them (news were saying "almost total loss", while them were trying to convince me that I had the incredible bad luck that my three VPS were in the very small zone affected by the fire). The only way I had to recover what I lost was backups in local machines.

When someone point out how safe are cloud providers, as if they have multiple levels of redundancy and are fully protected against even an alien invasion, I remember the OVH fire.

wiether 1 day ago||

OVH VPS is not the same as say, AWS EC2.

It's their "Compute" under "Public Cloud" that is competing against AWS EC2. https://us.ovhcloud.com/public-cloud/compute/

They handled the fire terribly and after that they improved a bit, but an OVH VPS is just a VM running on a single piece of hardware. Quite not the same thing as the "Compute" which is running on clusters.

AndroTux 1 day ago|||

contingency plan: Don't build your data center out of wood.

srg0 1 day ago||

Plastic is made from the same stuff as gasoline.

direwolf20 1 day ago||

Drain cleaner and hydrochloric acid makes salt water. Water is made of highly explosive hydrogen. Salt is made of toxic chlorine and explosive sodium.

fpoling 1 day ago|||

They use the datasenter for model training, not to serve online users. Presumably even if it will be offline for a week or even a month it will not be a total disaster as long as they have, for example, offsite tape backups.

instagib 1 day ago|||

Flooding due to burst frozen pipe, false sprinkler trigger, or many others.

Something very similar happened at work. Water valve monitoring wasn’t up yet. Fire didn’t respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.

twelvechairs 1 day ago|||

Theres only one solution to this problem and its 2 data centres in some way or form

mbreese 1 day ago||

What's the line from Contact?

why build one when you can have two at twice the price?

But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.

golem14 1 day ago||

Or build two 2.5MM DCs (if can parallelize your workload well enough) and in case of disaster, you only lose capacity.

You need however plan for 1MM+ pa in OPEX because good SREs ain’t cheap (or hardware guys building and maintaining machines)

direwolf20 1 day ago||

the plan is to not set it on fire. If your office burns down you are already screwed

insuranceguru 1 day ago||

The own vs rent calculus for compute is starting to mirror the market value vs replacement cost divergence we see in physical assets. Cloud is convenient because it lowers OpEx initially, but you lose control over the long-term CapEx efficiency. Once you reach a certain scale, paying the premium for AWS flexibility stops making sense compared to the raw horsepower of owned metal.

seg_lol 1 day ago|

Using "big" cloud providers is often a mistake. You want to use rented assets to bootstrap and then start deploying on instances that are more and more under your control. With big cloud providers, it is easy to just succumb to their service offerings rather than do the right thing. Do your PoC on Hetzner and DigitalOcean then scale with purpose.

pja 1 day ago||

I’m impressed that San Diego electrical power manages to be even more expensive than in the UK. That takes some doing.

butterisgood 1 day ago||

I think this is how IBM is making tons of money on mainframes. A lot of what people are doing with cloud can be done on premises with the right levels of virtualization.

https://intellectia.ai/news/stock/ibm-mainframe-business-ach...

60% YoY growth is pretty excellent for an "outdated" technology.

pu_pe 1 day ago||

> Self-reliance is great, but there are other benefits to running your own compute. It inspires good engineering.

It's easy to inspire people when you have great engineers in the first place. That's a given at a place like comma.ai, but there are many companies out there where administering a datacenter is far beyond their core competencies.

I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies. The same way that comma.ai employees likely don't have an in-house canteen, it can make sense to focus on what you are good at and outsource the rest.

szszrk 1 day ago||

> I feel like skilled engineers have a hard time understanding the trade-offs from cloud companies.

They spend too much time on yet another cloud native support group call, learning for ThatOneCloudProvider certificates, figuring out that single implementation caveats, standardizing security procedures between cloud teams, and so on.

Yet people in the article just throw a 1000 lines of code KV store mkv [0] on a huge raw storage server and call it a day. And it's a legit choice, they did actual study beforehand and concluded: we don't need redundancy in most cases. At all. I respect that.

[0] https://github.com/geohot/minikeyvalue

Torq_boi 1 day ago||

We actually do have an in-house chef lol.

yomismoaqui 1 day ago||

This quote is gold:

The cloud requires expertise in company-specific APIs and billing systems. A data center requires knowledge of Watts, bits, and FLOPs. I know which one I rather think about.

rudolph9 1 day ago|

> Having your own data center is cool

This company sounds more like a hobby interest than a business focused on solving genuine problems.

HanClinto 1 day ago|||

It kinda' does, doesn't it?

Re: the "hobby" part is where I agree with you the most. Where you say it's not solving genuine problems is where I differ the most.

It really feels to me like Comma is staffed by people who recognize that they never stopped enjoying playing with Lego -- their bricks just grew up, and they realized they can:

1) solve real-world problems

2) not be jerks about it

3) get paid to do it

Not everything has to be about optimizing for #3.

I'm a happy paying customer of Comma.ai (Comma four, baby!) -- their product is awesome, extremely consumer-friendly, and I hope they can grow in their success!

adeebshihadeh 1 day ago||

comma four, baby!!

glad you're enjoying it :)

BirAdam 1 day ago||||

To me it sounds more like a return to vertical integration.

This is becoming increasingly common as far as I can tell.

There are benefits either direction, and I think that each company needs to evaluate the pros and cons themselves. Emotional pros/cons are something companies need to evaluate as employee morale can make or break a company. If the company is super technical in culture and they gain something intangible that is boosting the bottom line, having a datacenter as a "cool" factor is probably worth it.

vovavili 1 day ago||||

I'd argue that it is in the long-term interest of any genuinely innovative company to attract intellectually curious talent with some coolness factor.

adeebshihadeh 1 day ago|||

you can have fun while solving real problems. it might even be a requirement

juvoly 1 day ago|

> Cloud companies generally make onboarding very easy, and offboarding very difficult. If you are not vigilant you will sleepwalk into a situation of high cloud costs and no way out. If you want to control your own destiny, you must run your own compute.

Cost and lock-in are obvious factors, but "sovereignty" has also become a key factor in the sales cycle, at least in Europe.

Handing health data, Juvoly is happy to run AI work loads on premise.

More comments...