Posted by codesuki 1 day ago
- Ubuntu useradd command causes 30s+ hang [1]
- Ubuntu: sudo -u some-user unexpectedly ends up with environment variables for the runner [2]
They told you why it takes so long no? the runners come by default with loads of programming languages installed like Rust, Haskell, Node, Python, .Net etc so it sets all that up per user add.
I would also question why your adding users on an ephemeral runner.
We use runners for things that aren't quite "CI for software source code" that does some "weird" stuff.
For instance, we require that new developer system setup be automated - so we have a set of scripts to do that, and a CI runner that runs on those scripts.
Microsoft being microsoft I guess. Making computing progressively less and less delightful because your boss sees their buggy crap is right there so why don't you use it
That aside, GH Actions doesn’t seem any worse than GitLab. I forget why I stopped using CircleCI. Price maybe? I do remember liking the feature where you could enter the console of the CI job and run commands. That was awesome.
I agree though that yaml is not ideal.
- Intermediate tasks are cached in a docker-like manner (content-addressed by filesystem and environment). Tasks in a CI pipeline build on previous ones by applying the filesystem of dependent tasks (AFAIU via overlayfs), so you don't execute the same task twice. The most prominent example of this is a feature branch that is up-to-date with main passes CI on main as soon as it's merged, as every task on main is a cache-hit with the CI execution on the feature branch.
- Failures: the UI surfaces failures to the top, and because of the caching semantics, you can re-run just the failed tasks without having to re-run their dependencies.
- Debugging: they expose a breakpoint (https://www.rwx.com/docs/rwx/remote-debugging) command that stops execution during a task and allows you to shell into the remote container for debugging, so you can debug interactively rather than pushing `env` and other debugging tasks again and again. And when you do need to push to test a fix, the caching semantics again mean you skip all the setup.
There's a whole lot of other stuff. You can generate tasks to execute in a CI pipeline via any programming language of your choice, the concurrency control supports multiple modes, no need for `actions/cache` because of the caching semantics and the incremental caching feature (https://www.rwx.com/docs/rwx/tool-caches).
And I've never had a problem with the logs.
If you have a dockerfile where you make a small change in your source results in one particular very large layer that has to be built, then you want to fan out and run many parallel tests using that image, what actually happens when you try to run that new fat layer on a bunch of compute, and how is it better than the implied naive solution? That fat layer exists on a storage system somewhere, and a bunch of computer nodes need to read it, what happens?
1. We don't gzip layers like Docker does. Gzip is really slow, and it's much slower than the network. Storage is cheap. So it's much faster to transmit uncompressed layers than to transmit compressed layers and decompress them.
2. We've heavily tuned our agents for pulling layers fast. Disk throughput and IOPS are really important so we provision those higher than you typically would for running workloads in the cloud. When pulling layers we modify kernel parameters like the dirty_ratio to values that we've empirically found with layer pulls. We make sure we completely exhaust our network bandwidth and throughput when pulling layers. And so on.
3. This third one is experimental and something we're actively working on improving, but we have our own underlying filesystem which lazily loads the files from a layer instead of pulling tons of (potentially unneeded) files up front. This is similar to AWS's [Seekable OCI](https://github.com/awslabs/soci-snapshotter) but tuned for our particular needs.
I've been slowly working on improving our documentation to explain these kinds of differentiators that our architecture and container runtime provide, but most of it is unpublished so far. We definitely need to do a much better job of explaining _how_ we are faster and better rather than just stating it :).
The other side of this is that we also made _building_ those layers much much faster. We blogged a little bit about it at https://www.rwx.com/blog/we-deleted-our-dockerfiles but just to hit some quick notes: in RWX you can vary the compute by task, and it turns out throwing a big machine at (e.g.) `npm install` is quite effective. Plus we make using an incremental cache very easy, and layers generated from an incremental cache are only the incremental parts, so they tend to be smaller. And we're a DAG, so you can parallelize your setup in a way that is very painful to do with Docker, even when using multi-stage builds. And our cache registry is global and very hard to mess up, whereas a lot of people misconfigure their Docker caches and have cache misses all over their docker builds. And we have miss-then-hit semantics for caching. Okay, I'm rambling now! But happy to go into more depth on any of this!
Back in... I don't know, 2010, we used Jenkins. Yes, that Java thingy. It was kind of terrible (like every CI), but it had a "Warnings Plugin". It parsed the log output with regular expressions and presented new warnings and errors in a nice table. You could click on them and it would jump to the source. You could configure your own regular expressions (yes, then you have two problems, I know, but it still worked).
Then I had to switch to GitLab CI. Everyone was gushing how great GitLab CI was compared to Jenkins. I tried to find out: how do I extract warnings and errors from the log - no chance. To this day, I cannot understand how everyone just settled on "Yeah, we just open thousands of lines of log output and scroll until we see the error". Like an animal. So of course, I did what anyone would do: write a little script that parses the logs and generates an HTML artifact. It's still not as good as the Warnings Plugin from Jenkins, but hey, it's something...
I'm sure, eventually someone/AI will figure this out again and everyone will gush how great that new thing is that actually parses the logs and lets you jump directly to the source...
Don't get me wrong: Jenkins was and probably still is horrible. I don't want to go back. However, it had some pretty good features I still miss to this day.
My browser can handle tens of thousands of lines of logs, and has Ctrl-F that's useful for 99% of the searches I need. A better runner could just dump the logs and let the user take care of them.
Why most web development devolved into a React-like "you can't search for what you can't see" is a mystery.
I haven't used as many CI systems as the author, but I've used, GH actions, Gitlab CI, CodeBuild, and spent a lot of time with Jenkins.
I've only touched Buildkite briefly 6 years ago, at the time it seemed a little underwhelming.
The CI system I enjoyed the most was TeamCity, sadly I've only used it at one job for about a year, but it felt like something built by a competent team.
I'm curious what people who have used it over a longer time period think of it.
I feel like it should be more popular.
But I don’t know about competent people, reading their release notes always got me thinking ”how can anyone write code where these bugs are even possible?”. But I guess that’s why many companies just write nonsense release notes today, to hide their incompetence ;)
Why do you consider TeamCity legacy? The latest release was just 2 months ago: https://www.jetbrains.com/help/teamcity/what-s-new-in-teamci...
>To make TeamCity more approachable for everyone, we’ve launched the pipelines initiative, and are investing heavily in reimagining the familiar UX. Complementing these efforts, we are excited to introduce the TeamCity AI Assistant.
Looks like it's under active development.
All of my customers are on bitbucket.
One of them does not even use a CI. We run tests locally and we deploy from a self hosted TeamCity instance. It's a Django app with server side HTML generation so the deploy is copying files to the server and a restart. We implemented a Capistrano alike system in bash and it's been working since before Covid. No problems.
The other one uses bitbucket pipelines to run tests after git pushes on the branches for preproduction and production and to deploy to those systems. They use Capistrano because it's a Rails app (with a Vue frontend.) For some reason the integration tests don't run reliably neither on the CI instances nor on Macs, so we run them only on my Linux laptop. It's been in production since 2021.
A customer I'm not working with anymore did use Travis and another one I don't remember. That also run a build on there because they were using Elixir with Phoenix, so we were creating a release and deploying it. No mere file copying. That was the most unpleasant deploy system of the bunch. A lot of wasted time from a push to a deploy.
In all of those cases logs are inevitably long but they don't crash the browser.