How do you deploy in 10 seconds?

Posted by mpweiher 10/28/2024

How do you deploy in 10 seconds?(paravoce.bearblog.dev)

62 points | 54 comments

notwhereyouare 10/29/2024|

100% this is how my company used to deploy. we had multiple servers. rsync'd code to each server and cycled IIS. Worked good. Deploys to our farm took just a minute or two, because it would do a double deploy just to be extra sure everything went out.

Then the BORG came and assimilated us. Our deploys take easily 45+ minutes to really start shifting traffic

valbaca 10/29/2024|

what's IIS?

AndrewDucker 10/29/2024|||

Internet Information Server - the MS web server.

pier25 10/30/2024||

Is it still in use? I thought it had been replaced by Kestrel.

AndrewDucker 10/30/2024||

Kestrel does ASP.Net very quickly, because that's all it does. If you want a full web server, that does things like static files, SSL, Reverse Proxying, etc. then you want IIS at the very least sitting in front of it.

See https://stackify.com/kestrel-web-server-asp-net-core-kestrel... - comparison table 3/4 of the way down.

notwhereyouare 10/30/2024|||

as andrew mentioned, microsoft's server. we are a .net shop

0xbadcafebee 10/29/2024||

I've regularly gotten CI/CD deploys down to <30 seconds without a ton of fancy caching. You just need to look at what's taking a lot of time, and optimize.

- On commit/push, your build runs once, and stores in an artifact. If nothing has changed, don't rebuild, reuse.

- Your build gets packed into a Docker container once and pushed to a remote registry. If nothing has changed, don't rebuild, reuse.

- Every test and subsequent stage uses the same build artifact and/or container. Again, this is as simple as pulling a binary or image. Within the same pipeline workspace, it's a file on disk shared between jobs.

- Using a self-hosted CI/CD runner, on the same network and provider as your artifact/container registry, means extremely low-latency, high-bandwidth file transfers. And because it's self-hosted, you don't have to wait for a runner to be allocated, it's waiting for you; it's just connected to remotely and immediately used. K8s runners on autoscaling clusters make it easy to scale jobs in parallel.

- Having each pipeline step use a prebuilt Docker container, and not having each step do a bunch of repetitive stuff (like installing tools, downloading deps...) when they don't need to, is essential. If every single job is doing the same network transfer and same tool install every time, optimize it.

- A kubernetes deploy to production should absolutely take its time to cycle an old and new pod. Half of the point of K8s is to prevent interrupting production traffic and rely on it to automatically resolve issues and prevent larger problems. This means leaning on health checks and ramping traffic for safety. But actually running the deploy part should be nearly instantaneous, a `kubectl apply` or `helm upgrade` should take seconds.

The only exception to all this is if you (rightly) have a very large test suite that takes a while to go through. You can still optimize the hell out of tests for speed and parallelize them a lot, though.

mikeocool 10/29/2024||

In my experience actually getting the code to the prod servers and restarting the app is rarely the slow part of CD. These days it mostly seems to be: 1) building all the javascript and 2) running the tests.

hellcow 10/30/2024|

Can't help you with the JS compile times, sadly. I think that bed is made. :)

I prefer to run tests locally whenever possible, for instance as a git hook, rather than in a CI instance. If you need auditability for something like PCI, that approach probably won't work, but I think the small web (i.e. most of the web) can do just fine with it.

syndicatedjelly 10/29/2024||

> Every developer knows how to compile locally. Only a few at my company understand Docker's layer caching + mounts and Github Actions' yaml-based workflow configuration in any detail. So when something doesn't work compiling locally, it's also a lot easier for a typical developer to figure out why.

How is this a good excuse? Is it really that difficult for a developer to spend an afternoon understanding GitHub Actions and Docker, at least at a superficial level so they can understand what they're looking at?

HL33tibCe7 10/29/2024|

Understanding these things at a level where you can optimise a CI pipeline is actually quite difficult. The evidence for this is that almost every CI pipeline I’ve ever seen, in the many companies have been at, has been horrendously unoptimised and slow.

syndicatedjelly 10/29/2024||

Understanding something and optimizing it are not the same skill set. The author of the article seems to imply that even looking at a pipeline workflow is too daunting for developers, which is what I'm challenging. I personally am working on a horrendously slow pipeline right now, and I agree that it's frustrating to troubleshoot. But ignoring the problem is certainly not the solution - I've slowly but surely chipped away at various aspects and have a much better understanding of our CI/CD now.

In the words of Jake the Dog, "Sucking at something is the first step to getting good at something."

BadBadJellyBean 10/29/2024||

We used to do similar things. Then devs pushed stuff to prod without committing them. Then things broke when pushing the version that was in git. Then I forced everything into docker. Things got better. If you want to do things fast invest into local test invironments. CI/CD is more than just a way to deploy things. And maybe a little friction to just push things might push a developer to write a test for the functionality.

hellcow 10/30/2024|

I solved this a different way -- only very senior engineers were allowed to access/deploy to production. Senior engineers (by experience, not title!) had a much better understanding of the full system; they better understood the risks during a deploy and what to watch. They were doing the PR reviews as well.

Many ways to skin the cat. This is just one of them.

BadBadJellyBean 10/30/2024||

Senior engineers made the mess. Because "I just need to fix this now. I'll commit it later". CI/CD gives you stronger guarantees. You can know which code is now in prod. That is so much harder to ensure when there is human intervention.

greggyb 10/29/2024||

Why did this get flagged? Seems very on topic for HN and is not blogspam or clickbait.

MortyWaves 10/30/2024|

I've been noticing quite poor moderation in the last few days but generally over the last year or so a general lowering of tone and quality.

qudat 10/29/2024||

Here's my one-weird-trick:

On my VM, keep this running:

    while true; do ssh pipe.pico.sh sub deploy-app; docker compose pull && docker compose up -d; done

On my local machine:

    docker buildx build --push -t ghcr.io/abc/app .; ssh pipe.pico.sh pub deploy-app -e

https://pipe.pico.sh/

mdaniel 10/31/2024||

Heh, another feather in the "all you need is PostgreSQL" hat: if you already use PG, then LISTEN/NOTIFY can do what that external host does as well as acting like a health check for the instance since if it can't access PG it likely is in a bad way (situationally, of course, like all solutions coming from Internet commentary)

yjftsjthsd-h 10/29/2024||

Why not just (from the local machine) run

  docker buildx build --push -t ghcr.io/abc/app . && ssh myvm 'cd /my/app/path && docker compose pull && docker compose up -d'

?

qudat 10/30/2024||

That's a really good point!

That might work if you have a single VM but it's a little more complicated when you have an app on multiple instances.

pipe is a multicast pubsub which means you can have many subscribers.

hellcow 10/30/2024||

Oh wow, my blog post. Hi all! I was wondering why I had a huge surge of readership today. I'll be in the comments.

ledgerdev 10/30/2024||

Thanks so much for this post and the other about provisioning. I'm going to try this exactly. Great suggestion about having caddy just use try_duration to minimize downtime.

easterncalculus 10/30/2024||

Curious, how did you follow your readership? I thought bearblog didn't have analytics, but I guess you must be running something.

hellcow 10/30/2024||

They actually do have analytics, but you need to subscribe (it's a small service, if you can, pay a few bucks to support good people!).

I can see a count of readers per day on each post. It also shows counts of devices, browsers, countries, and referrers. Here's what it looks like: https://herman.bearblog.dev/public-analytics/

easterncalculus 10/30/2024||

Thanks!

torvald 10/29/2024||

This is a great approach if you have an idea or startup and just want to get things done -- one I would choose 10 out of 10 times when starting something new. You'll know when it’s time to move on, and you can likely postpone that step a bit longer as well.

oneplane 10/29/2024|

Or (on K8S) you set your drain time to 0, the surge to 9999999% and the PDB to "screw everything". Now your deployments take 2 seconds (the time to pull down your change and run it).

You also just lost all your guardrails and collaborative controls, as well as created a dependency on all engineers being equally capable.

In other words, unless you are DHH and don't have to scale (both in terms of workload and terms of company), this scenario doesn't apply in the real world.

EatFlamingDeath 10/29/2024|

Exactly. I mean, I understand that 45+ minutes to deploy something that takes less than a minute to build is obnoxious, but the pipeline is not always there to only build the app. Deploying in 10 seconds means no safeguards and that you can send broken code to production. And pipelines are about automation too. Having a sane pipeline that will check formatting, linting, test, build and deploy quickly to a server is not that hard. Well, if you don't care for production being down for a couple of minutes, fine, do the "10 second deployment". But, at least for me, even in really small projects, it doesn't make any sense.

oneplane 10/30/2024||

Indeed. Same goes for perceived 'slowness'. At the point where you're deploying to production, it should already be fire-and-forget; your local development, tests, acceptance or whatever else you have should already have been done. There is no "gee I wonder what this looks like once deployed". Or at least, there shouldn't be... (another red flag for the 10 second crowd)

More comments...