Posted by codesuki 1 day ago
Over the years CI tools have gone from specialist to generalist. Jenkins was originally very good at building Java projects and not much else, Travis had explicit steps for Rails projects, CircleCI was similarly like this back in the day.
This was a dead end. CI is not special. We realised as a community that in fact CI jobs were varied, that encoding knowledge of the web framework or even language into the CI system was a bad idea, and CI systems became _general workflow orchestrators_, with some logging and pass/fail UI slapped on top. This was a good thing!
I orchestrated a move off CircleCI 2 to GitHub Actions, precisely because CircleCI botched the migration from the specialist to generalist model, and we were unable to express a performant and correct CI system in their model at the time. We could express it with GHA.
GHA is not without its faults by any stretch, but... the log browser? So what, just download the file, at least the CI works. The YAML? So it's not-quite-yaml, they weren't the first or last to put additional semantics on a config format, all CI systems have idiosyncrasies. Plugins being Docker images? Maybe heavyweight, but honestly this isn't a bad UX.
What does matter? Owning your compute? Yeah! This is an important one, but you can do that on all the major CI systems, it's not a differentiator. Dynamic pipelines? That's really neat, and a good reason to pick Buildkite.
My takeaway from my experience with these platforms is that Actions is _pretty good_ in the ways that truly matter, and not a problem in most other ways. If I were starting a company I'd probably choose Buildkite, sure, but for my open source projects, Actions is good.
In game development we care a lot about build systems- and annoyingly, we have vanishingly few companies coming to throw money at our problems.
The few that do, charge a kings ransom (Incredibuild). Our build times are pretty long, and minimising them is ideal.
If, then, your build system does not understand your build-graph then you’re waiting even longer for builds or you’re keeping around incremental state and dirty workspaces (which introduces transient bugs, as now the compiler has to do the hard job of incrementally building anyway).
So our build systems need to be acutely aware of the intricacies of how the game is built (leading to things like UnrealEngine Horde and UBA).
If we used a “general purpose” approach we’d be waiting in some cases over a day for a build, even with crazy good hardware.
Game dev has a serious case of NIH - sometimes for good reasons but in lots of cases it’s because things have been set up in a way that makes changing that impractical. Using UBA as an example - FastBuild, Incredibuild, SNDBS Sccache all exist as either caching or distribution systems. Compiling a game engine isn’t much different to compiling a web browser (which ninja was written for).
I’ve worked at two game studios where we’ve used general purpose CI systems and been able to push out builds in < 15 minutes. Horde and UBA exist to handle how epic are doing things internally, rather than as an inherent requirement on how to use the tools effectively. If you don’t have the same constraints as developing Unreal Engine (and Fortnite) then you don’t have the same needs.
(I worked for epic when horde came online, but don’t any more).
WRT github actions... I agree with OOP, they leave much to be desired, esp when working on high-velocity work. My ci/cd runs locally first and then GHA is (slower) verification, low-noise, step.
The systems I like to design that use GHA usually only use the good parts. GitHub is a fine events dispatcher, for instance, but a very bad workflow orchestrator. So delegate that to a system that is good at that instead
They answer your "so what" quite directly:
>> Build logs look like terminal output, because they are terminal output. ANSI colors work. Your test framework’s fancy formatting comes through intact. You’re not squinting at a web UI that has eaten your escape codes and rendered them as mojibake. This sounds minor. It is not minor. You are reading build logs dozens of times a day. The experience of reading them matters in the way that a comfortable chair matters. You only notice how much it matters after you’ve been sitting in a bad one for six hours and your back has filed a formal complaint.
Having to look mentally ignore ANSI escape codes in raw logs (let alone being unable to unable to search for text through them) is annoying as hell, to put it mildly.
And how do you expect people to even know about this workaround, and how to search for text with it? It's not like the GitHub UI even tells you. Not everyone is a Linux pro.
Nobody is saying it's impossible to get past the ANSI escape codes. People eventually figure out ways to do it. The claim is how much of your time do you want to lose to friction in that process, which you have to repeated frequently. It's insane for it to be this hard.
You have a tool here, which is noted elsewhere: it's "less --raw". Also there's another tool which analyzes your logs and color codes them: "lnav".
lnav is incredibly powerful and helps understanding what's happening, when, where. It can also tail logs. Recommended usage is "your_command 2>&1 | lnav -t".
Except for GitHub charging you monthly to run your own CI jobs on your own hardware.
I start with a Makefile. The Makefile drives everything. Docker (compose), CI build steps, linting, and more. Sometimes a project outgrows it; other times it does not.
But it starts with one unitary tool for triggering work.
And, many of them lack some of the basic features that 'make' has had for half a century.
I use Fastlane extensively on mobile, as it reduces boilerplate and gives enough structure that the inherent risk of depending on a 3rd-party is worth it. If all else fails, it's just Ruby, so can break out of it.
What you're saying is essentially ”Just Write Bash Scripts”, but with an extra layer of insanity on top. I hate it when I encounter a project like this.
The only sane use for Makefiles is running a few simple commands in independent targets, but do you really need make then?
(The argument that "everyone has it installed" is moot to me. I don't.)
I had this fight for some years in my present work and was really nagging in the beginning about the path we were getting into by not allowing the developers to run the full (or most) of the pipeline in their local machines… the project decided otherwise and now we spend a lot of time and resources with a behemoth of a CI infrastructure because each MR takes about 10 builds (of trial and error) in the pipeline to be properly tested.
Yes, there will always be special exemptions: they suck, and we suffer as developers because we cannot replicate a prod-like environment in our local dev environment.
But I laugh when I join teams and they say that "our CI servers" can run it but our shitty laptops cannot, and I wonder why they can't just... spend more money on dev machines? Or perhaps spend some engineering effort so they work on both?
My experience has been that the problems in CI systems come from exactly these differences “works on my machine” followed by “oops, I guess the build machine doesn’t have access to that random DB”, or “docker push fails in our CI environment because credentials/permissions, but it works when I run it just on my machine”
In my experience at work. Anything that demands too much though, collaboration between teams and enforcing hard development rules, is always an unachievable dream in a medium to big project.
Note, that I don't think it's technically unachievable (at all). I just accepted that it's culturally (as in work culture) unachievable.
It's just that management don't see it as worth it, in terms of development cost and limitations it would introduce in the current workflow, to enable the developers to do that.
Where did i mention security?
> in terms of development cost and limitations it would introduce in the current workflow
Well said. "in the current workflow". As in, not "in the development process". Those are unrelated items.
If your CI invocations are anything more than running a script or a target on a build tool (make, etc.) where the real build/test steps exist and can be run locally on a dev workstation, you're making the CI system much more complex than it needs to be.
CI jobs should at most provide an environment and configuration (credentials, endpoints, etc.), as a dev would do locally.
This also makes your code CI agnostic - going between systems is fairly trivial as they contain minimal logic, just command invocations.
It's correct to design CI pipelines in order to offload much of the logic to subsystems, but pipelines will eventually grow in complexity and the CI config system should be designed in order not to get in the way. I don't know buildkite, but Gitlab CI is the best I know. Template and job composition works brilliantly, top-level object being the job and not the stage result in flat, easier to read config files and the packed features are really good, but it's hard to debug, the conditional logic sometimes fails in unexpected ways, it's exhausting to use the predefined variables reference and the permission system for multi project pipelines is abysmal.
I'd argue that this also dovetails very nicely with having common, shared invocations - if you can run "make test" in any repo and have it work, that makes CI code reuse even easier.
As for the complexity comments, that complexity has to go somewhere, and you should look for how to best factor the system so it's debuggable. Sometimes this may mean restructuring how your code is factored or deployed or has failure tolerance so it's easier to test, and this should be thought of as an architecture task early on.
These tool fails are as a consequence of a failure of proper policy.
Tooling and Methodology!
Here’s the thing: build it first, then optimize it. Same goes for compile/release versus compile/debug/test/hack/compile/debug/test/test/code cycles.
That there is not a big enough distinction between a development build and a release build is a policy mistake, not a tooling ‘issue’.
Set things up properly and anyone pushing through git into the tooling pipeline are going to get their fingers bent soon enough, anyway, to learn how the machine mangles digits.
You can adopt this policy of environment isolation with any tool - it’s a method.
Tooling and Methodology!
* you have any remote resources that are needed during build
* for some reason your company doesn't have standardize build images
There are a bunch of failures of a build that have nothing to do with how your build itself works. Asking teams to rebuild all that orchestration logic into their builds is madness. We shouldn’t ask teams to have to replicate tests for features that are in the CI they use.
Integration of code quality gates, documentation checks, linting, cross architecture builds, etc.
Most of this can be solved by doing the builds in a docker image that we also maintain ourselves. Then what remains is the interaction between the ci config for matrices, the tasks/actions to report back quality metrics, the integration with keyvaults to obtain deploy time secrets, etc.
Then there are the soft failures, missing a cache key causing many packages to be downloaded over and over again, or the same for the docker base images, etc.
We fix this for our 1000+ microservices, across hundreds of teams by maintaining a template that all services are mandated to use. It removes whole classes of errors and introduces whatever shenanigans we introduce. But it works for us.
If GHA, Azure Pipelines, etc., would provide a way of running builds locally that would speed up our development greatly.
Until then we have created linting based on CUE to parse the various yamls, resolving references to keystores, key ids, templates, etc., and making sure they exist. I think this is generic enough to open source even.
In the past week I have seen:
- actions/checkout inexplicably failing, sometimes succeeding on 3rd retry (of the built-in retry logic)
- release ci jobs scheduling _twice_, causing failures, because ofc the release already exists
- jobs just not scheduling. Sometimes for 40m.
I have been using it actively for a few years and putting aside everything the author is saying, just the base reliability is going downhill.
I guess zig was right. Too bad they missed builtkite, Codeberg hasn't been that reliable or fast in my experience.
All of that on top of a rock-solid system for bringing your own runner pools which lets you use totally different machine types and configurations for each type of CI job.
Highly, highly recommend.
But yes, Groovy is a much better language for defining pipelines than YAML. Honestly pretty much any programming language at all is better than YAML. YAML is fine for config files, but not for something as complex as defining a CI pipeline.
Like just use an actual programming language!
imo top 10 best admin/devs free software written in past 25 years.
Jenkins is probably a bit like Java, technically it is fine. The problem is really where/who typically uses it and as there is so much freedom it is really easy to make a monster. Where as for Go it is a lot harder to write terrible unmaintainable code compared to Java.
My pet peeve with Github Actions was that if I want to do simple things like make a "release", I have to Google for and install packages from internet randos. Yes, it is possible this rando1234 is a founding github employee and it is all safe. But why does something so basic need external JS? packages?