Posted by mpweiher 10/28/2024
If you're hitting this, you need to take a look into the service as the problem, not blame the infra layer.
k8s can absolute roll out a deployment in <60s, if not <10s. The bottleneck I see, when it is slow, is slow app termination. If your service takes 5 minutes to terminate, it isn't going to matter what the infrastructure layer does. Sometimes this is failing to handle SIGTERM (resulting in k8s having to fall back on timing out & SIGKILL'ing) … but sometimes it's just the app is slow to terminate, 'cause bugs. But it's those bugs that should get fixed.
You can somewhat workaround it by setting the surge to 100%. (And … even if you have a fast app, 100% surge might be a good idea, too. As always, it depends. If surging is going to eat up all the available RAM or CPU … maybe not.)
And most importantly: the underlying principles guiding k8s's behavior are going to apply just as equally to a shell script. app.service doesn't respond to SIGTERM[1]? You're going to have to decide what to do. Surge or not? Same thing. Potential for surge to result in resource depletion…?
> bash script
A service's program/code should generally be owned by root:. A service (generally) does not need the ability to re-write its own code.
> Only a few at my company understand Docker's layer caching + mounts and Github Actions' yaml-based workflow configuration in any detail.
… the working knowledge of either of those two things is not rocket science. The docker caching is probably the worst of the two; but you only need to understand it to speed up builds.
While GHA's YAML isn't pretty … it's also hardly complex. And for the most part, if your action simply defers to a script in the repository (e.g., I keep these in ci/), then it's mostly reproducible locally, too. (And there are some tools out there to run a GHA workflow locally, too, if you need more completeness than "just run the same script as the workflow".)
> Show you how I provision a Debian server using Ansible
I have spent enough years with Ansible to know all the problems with it, and I'd rather not go back to it.
([1] although a vanilla systemd service is going to have an "advantage" in that the default SIGTERM handling is different from a container. So it might look faster, in the case of buggy-app-with-no-SIGTERM-handler will die instantly … but it's probably still a bug, as ax'ing the service is probably also just dropping requests on the floor.)