GitHub Actions is slowly killing engineering teams

Posted by codesuki 1 day ago

GitHub Actions is slowly killing engineering teams(www.iankduncan.com)

364 points | 196 commentspage 2

rsyring 18 hours ago|

After troubleshooting a couple issues with the GitHub Actions Linux admin team, and their decision to not address either issue, I'm highly skeptical of investing more in GitHub Actions:

- Ubuntu useradd command causes 30s+ hang [1]

- Ubuntu: sudo -u some-user unexpectedly ends up with environment variables for the runner [2]

1: https://github.com/actions/runner-images/issues/13048

2: https://github.com/actions/runner-images/issues/13049

Storment33 17 hours ago|

I mean...

They told you why it takes so long no? the runners come by default with loads of programming languages installed like Rust, Haskell, Node, Python, .Net etc so it sets all that up per user add.

I would also question why your adding users on an ephemeral runner.

Marsymars 3 hours ago||

> I would also question why your adding users on an ephemeral runner.

We use runners for things that aren't quite "CI for software source code" that does some "weird" stuff.

For instance, we require that new developer system setup be automated - so we have a set of scripts to do that, and a CI runner that runs on those scripts.

isoprophlex 1 day ago||

> GitHub Actions is not good. It’s not even fine. It has market share because it’s right there in your repo

Microsoft being microsoft I guess. Making computing progressively less and less delightful because your boss sees their buggy crap is right there so why don't you use it

kdazzle 1 day ago||

Pretty sure someone at MS told me that Actions was rewritten by the team who wrote Azure DevOps. So bureaucracy would be a feature.

That aside, GH Actions doesn’t seem any worse than GitLab. I forget why I stopped using CircleCI. Price maybe? I do remember liking the feature where you could enter the console of the CI job and run commands. That was awesome.

I agree though that yaml is not ideal.

tagraves 1 day ago||

I hope the author will check out RWX -- they say they've checked out most CI systems, but I don't think they've tried us out yet. We have everything they praise Buildkite for, except for managing your own compute (and that's coming, soon!). But we also built our own container execution model with CI specifically in mind. We've seen one too many Buildkite pipelines that have a 10 minute Docker build up front (!) and then have to pull a huge docker container across 40 parallel steps, and the overhead is enormous.

ses1984 1 day ago|

Can you explain how your product solves this problem? I clicked around your site and couldn't figure it out.

fourteenminutes 1 day ago||

As a (very happy) RWX customer:

- Intermediate tasks are cached in a docker-like manner (content-addressed by filesystem and environment). Tasks in a CI pipeline build on previous ones by applying the filesystem of dependent tasks (AFAIU via overlayfs), so you don't execute the same task twice. The most prominent example of this is a feature branch that is up-to-date with main passes CI on main as soon as it's merged, as every task on main is a cache-hit with the CI execution on the feature branch.

- Failures: the UI surfaces failures to the top, and because of the caching semantics, you can re-run just the failed tasks without having to re-run their dependencies.

- Debugging: they expose a breakpoint (https://www.rwx.com/docs/rwx/remote-debugging) command that stops execution during a task and allows you to shell into the remote container for debugging, so you can debug interactively rather than pushing `env` and other debugging tasks again and again. And when you do need to push to test a fix, the caching semantics again mean you skip all the setup.

There's a whole lot of other stuff. You can generate tasks to execute in a CI pipeline via any programming language of your choice, the concurrency control supports multiple modes, no need for `actions/cache` because of the caching semantics and the incremental caching feature (https://www.rwx.com/docs/rwx/tool-caches).

And I've never had a problem with the logs.

ses1984 18 hours ago||

The previous post describes a problem where you do a large docker build, then fan out to many jobs which need to pull this image, and the overhead is enormous. This implies rwx has less overhead. Just saying that there’s content addressable cache doesn’t explain how this particular problem is solved.

If you have a dockerfile where you make a small change in your source results in one particular very large layer that has to be built, then you want to fan out and run many parallel tests using that image, what actually happens when you try to run that new fat layer on a bunch of compute, and how is it better than the implied naive solution? That fat layer exists on a storage system somewhere, and a bunch of computer nodes need to read it, what happens?

tagraves 3 hours ago||

There's three main things we do to solve this, all of which relate to the fact that we have our own (OCI-compatible) container runtime under the hood instead of using Docker.

1. We don't gzip layers like Docker does. Gzip is really slow, and it's much slower than the network. Storage is cheap. So it's much faster to transmit uncompressed layers than to transmit compressed layers and decompress them.

2. We've heavily tuned our agents for pulling layers fast. Disk throughput and IOPS are really important so we provision those higher than you typically would for running workloads in the cloud. When pulling layers we modify kernel parameters like the dirty_ratio to values that we've empirically found with layer pulls. We make sure we completely exhaust our network bandwidth and throughput when pulling layers. And so on.

3. This third one is experimental and something we're actively working on improving, but we have our own underlying filesystem which lazily loads the files from a layer instead of pulling tons of (potentially unneeded) files up front. This is similar to AWS's [Seekable OCI](https://github.com/awslabs/soci-snapshotter) but tuned for our particular needs.

I've been slowly working on improving our documentation to explain these kinds of differentiators that our architecture and container runtime provide, but most of it is unpublished so far. We definitely need to do a much better job of explaining _how_ we are faster and better rather than just stating it :).

The other side of this is that we also made _building_ those layers much much faster. We blogged a little bit about it at https://www.rwx.com/blog/we-deleted-our-dockerfiles but just to hit some quick notes: in RWX you can vary the compute by task, and it turns out throwing a big machine at (e.g.) `npm install` is quite effective. Plus we make using an incremental cache very easy, and layers generated from an incremental cache are only the incremental parts, so they tend to be smaller. And we're a DAG, so you can parallelize your setup in a way that is very painful to do with Docker, even when using multi-stage builds. And our cache registry is global and very hard to mess up, whereas a lot of people misconfigure their Docker caches and have cache misses all over their docker builds. And we have miss-then-hit semantics for caching. Okay, I'm rambling now! But happy to go into more depth on any of this!

asim 19 hours ago||

We all have opinions about ci/cd. Why? Because it's getting between us and what we're attempting to do. In all honesty GitHub actions solves the biggest problem for a lot of Devs, infrastructure management and performance. I have managed a lot of build infrastructure and don't ever want to touch that again. GitHub fixed that for me. My build servers were often more power hungry than my production servers. GitHub fixed that for me. Basically what I'm saying is for 80% of people this is an 80% good enough solution and that's more important than everything else. Can I ship my code quickly. Can I define build deps next my code that everyone can see. Can I debug it, can others contribute to it. It just ticks so many boxes. I hope ci dies a good death because I think people are genuinely just thinking about the wrong problem. Stop making your life more difficult. Appreciate what this solves and move on. We can argue about it until we're blue in the face but it won't change the fact that often the solution that wins isn't the best, it's the one that reduces friction and solves the UX problem. I don't need N ways to configure somehow. I need to focus on what I'm trying to ship and that's not a build server.

deng 1 day ago||

The log viewer thing is what baffles me most.

Back in... I don't know, 2010, we used Jenkins. Yes, that Java thingy. It was kind of terrible (like every CI), but it had a "Warnings Plugin". It parsed the log output with regular expressions and presented new warnings and errors in a nice table. You could click on them and it would jump to the source. You could configure your own regular expressions (yes, then you have two problems, I know, but it still worked).

Then I had to switch to GitLab CI. Everyone was gushing how great GitLab CI was compared to Jenkins. I tried to find out: how do I extract warnings and errors from the log - no chance. To this day, I cannot understand how everyone just settled on "Yeah, we just open thousands of lines of log output and scroll until we see the error". Like an animal. So of course, I did what anyone would do: write a little script that parses the logs and generates an HTML artifact. It's still not as good as the Warnings Plugin from Jenkins, but hey, it's something...

I'm sure, eventually someone/AI will figure this out again and everyone will gush how great that new thing is that actually parses the logs and lets you jump directly to the source...

Don't get me wrong: Jenkins was and probably still is horrible. I don't want to go back. However, it had some pretty good features I still miss to this day.

mFixman 20 hours ago||

Why do we need a log viewer at all?

My browser can handle tens of thousands of lines of logs, and has Ctrl-F that's useful for 99% of the searches I need. A better runner could just dump the logs and let the user take care of them.

Why most web development devolved into a React-like "you can't search for what you can't see" is a mystery.

direwolf20 20 hours ago||

The only thing I can understand is that GHA is awesome because it's YAML and everyone loves YAML. Irrationally. YAML is terrible.

liveoneggs 19 hours ago||

GHA is hosted, works well-enough, and you already pay a github bill so you don't need to onboard a new vendor.

peterldowns 1 day ago||

Agreed with absolutely all of this. Really well written. Right now at work we're getting along fine with Actions + WarpBuild but if/when things start getting annoying I'm going to switch us over to Buildkite, which I've used before and greatly enjoyed.

WatchDog 1 day ago||

I agree with all the points made about GH actions.

I haven't used as many CI systems as the author, but I've used, GH actions, Gitlab CI, CodeBuild, and spent a lot of time with Jenkins.

I've only touched Buildkite briefly 6 years ago, at the time it seemed a little underwhelming.

The CI system I enjoyed the most was TeamCity, sadly I've only used it at one job for about a year, but it felt like something built by a competent team.

I'm curious what people who have used it over a longer time period think of it.

I feel like it should be more popular.

dreamteam1 1 day ago||

tc is probably the best console runner there is and I agree, it made CI not suck. It is also possible to make it very fast, with a bit of engineering and by hosting it on your own hardware. Unfortunately it’s as legacy as Jenkins today. And in contrast to Jenkins it’s not open source or free, many parts of it, like the scheduler/orchestrator, is not pluggable.

But I don’t know about competent people, reading their release notes always got me thinking ”how can anyone write code where these bugs are even possible?”. But I guess that’s why many companies just write nonsense release notes today, to hide their incompetence ;)

kgeist 16 hours ago||

>Unfortunately it’s as legacy as Jenkins today

Why do you consider TeamCity legacy? The latest release was just 2 months ago: https://www.jetbrains.com/help/teamcity/what-s-new-in-teamci...

>To make TeamCity more approachable for everyone, we’ve launched the pipelines initiative, and are investing heavily in reimagining the familiar UX. Complementing these efforts, we are excited to introduce the TeamCity AI Assistant.

Looks like it's under active development.

jamesfinlayson 1 day ago||

I used TeamCity for a while and it was decent - I'm sure defining pipelines in code must be possible but the company I worked at seemed to have made this impossible with some in-house integration with their version control and release management software.

pmontra 1 day ago|

> But Everyone Uses It!

All of my customers are on bitbucket.

One of them does not even use a CI. We run tests locally and we deploy from a self hosted TeamCity instance. It's a Django app with server side HTML generation so the deploy is copying files to the server and a restart. We implemented a Capistrano alike system in bash and it's been working since before Covid. No problems.

The other one uses bitbucket pipelines to run tests after git pushes on the branches for preproduction and production and to deploy to those systems. They use Capistrano because it's a Rails app (with a Vue frontend.) For some reason the integration tests don't run reliably neither on the CI instances nor on Macs, so we run them only on my Linux laptop. It's been in production since 2021.

A customer I'm not working with anymore did use Travis and another one I don't remember. That also run a build on there because they were using Elixir with Phoenix, so we were creating a release and deploying it. No mere file copying. That was the most unpleasant deploy system of the bunch. A lot of wasted time from a push to a deploy.

In all of those cases logs are inevitably long but they don't crash the browser.

More comments...