Posted by richtr 8 hours ago
GHA can’t even be called Swiss cheese anymore, it’s so much worse than that. Major overhauls are needed. The best we’ve got is Immutable Releases which are opt in on a per-repository basis.
You can pin actions versions to their hash. Some might say this is a best practice for now. It looks like this, where the comment says where the hash is supposed to point.
Old --> uses: actions/checkout@v4
New --> uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4
There is a tool to sweep through your repo and automate this: https://github.com/mheap/pin-github-actionLike: https://github.com/actions/checkout/tree/11bd71901bbe5b1630c...
So I'm pretty sure that for the same commit hash, I'll be executing the same content.
This article[0] gives a good overview of the challenges, and also has a link to a concrete attack where this was exploited.
[0]: https://nesbitt.io/2025/12/06/github-actions-package-manager...
TravisCI
Jenkins
scripts dir
Etc
Dependabot, too.
The main desiderata with these kinds of action pinning tools is that they (1) leave a tag comment, (2) leave that comment in a format that Dependabot and/or Renovate understands for bumping purposes, and (3) actually put the full tag in the comment, rather than the cutesy short tag that GitHub encourages people to make mutable (v4.x.y instead of v4).
[1]: https://github.com/suzuki-shunsuke/pinact
Perhaps mixing the CI with the CD made that worse because usually deployment and delivery has complexities of its own. Back in the day you'd probably use Jenkins for the delivery piece, and the E2E nightlies, and use something more lightweight for running your tests and linters.
For that part I feel like all you need, really, is to be able to run a suite of well structured shell scripts. Maybe if you're in git you follow its hooks convention to execute scripts in a directory named after the repo event or something. Forget about creating reusable 'actions' which depend on running untrusted code.
Provide some baked in utilities to help with reporting status, caching, saving junit files and what have you.
The only thing that remains is setting up a base image with all your tooling in it. Docker does that, and is probably the only bit where you'd have to accept relying on untrusted third parties, unless you can scan them and store your own cached version of it.
I make it sound simpler than it is but for some reason we accepted distributed YAML-based balls of mud for the system that is critical to deploying our code, that has unsupervised access to almost everything. And people are now hooking AI agents into it.
These reusable actions are nothing but a convenience feature. This discussion isn't much different than any other supply chain, dependency, or packaging system vulnerability such as NPM, etc.
One slight disclaimer here is the ability of someone to run their own updated copy of an action when making a PR. Which could be used to exfil secrets. This one is NOT related to being dependent on unverified actions though.
(re-reading this came across as more harsh than I intended.. my bad on that. But am I missing something or is this the same issue that every open-source user-submitted package repository runs in to?)
[1] https://app.radicle.xyz/nodes/radicle.dpc.pw/rad%3Az2tDzYbAX...
That's a high bar though. Few things are better than Swiss cheese.
* no ff merge support
* no sha2 commit hash support
ff merge support though....what a world that would be
0_o
I never use Github Copilot; it does go down a lot, if their status page is to be believed; I don't really care when it goes down, because it going down doesn't bring down the rest of Github. I care about Github's uptime ignoring Copilot. Everyone's slice of what they care about is a little different, so the only correct way to speak on Github's uptime is to be precise and probably focus on a lot of the core stuff that tons of people care about and that's been struggling lately: Core git operations, website functionality, api access, actions, etc.
This is definitely true.
At the same time, none of the individual services has hit 3x9 uptime in the last 90 days [0], which is their Enterprise SLA [1] ...
> "Uptime" is the percentage of total possible minutes the applicable GitHub service was available in a given calendar quarter. GitHub commits to maintain at least 99.9% Uptime for the applicable GitHub service.
[0]: https://mrshu.github.io/github-statuses/
[1]: https://github.com/customer-terms/github-online-services-sla
(may have edited to add links and stuff, can't remember, one of those days)
The linked document in my previous comment has more detail.
They're not even struggling to get their average to three 9s, they're struggling to get ANY service to three 9s. They're struggling to get many services to two 9s.
Copilot may be the least stable at one 9, but the services I would consider most critical (Git & Actions) are also at one 9.
On the other hand the baseline minimal Github Enterprise plan with no features (no Copilot, GHAS, etc.) runs a medium sized company $1m+ per annum, not including pay-per-use extras like CI minutes. As an individual I'm not the target audience for that invoice, but I can envisage whomever is wanting a couple of 9s to go with it. As a treat.
Why defend a company that clearly doesn't care about its customers and see them as a money spigot to suck dry?
The five nines tech people usually are talking about is a fiction; the only place where the measure is really real is in networking, specifically service provider networking, otherwise it's often just various ways of cleverly slicing the data to keep the status screen green. A dead giveaway is a gander at the SLAs and all the ways the SLAs are basically worthless for almost everyone in the space.
See also all of the "1 hour response time" SLAs from open source wrapper companies. Yes, in one hour they will create a case and give you case ID. But that's not how they describe it.
Once you dig into the details what does it mean to have 5 9s? Some systems have a huge surface area of calls and views. If the main web page is down but the entire backend API still is responding fine is that a 'down'? Well sorta. Or what if one misc API that some users only call during onboarding is down does that count? Well technically yes.
It depends on your users and what path they use and what is the general path.
Then add in response times to those down items. Those are usually made up too.
> For us, availability is job #1, and this migration ensures GitHub remains the fast, reliable platform developers depend on
That went about as well as everyone thought back then.
Does anyone else remember back in ~2014-2015 sometime, when half the community was screaming at GitHub to "please be faster at adding more features"? I wish we could get back to platforms (or OSes for that matter) focusing in reliability and stability. Seems those days are long gone.
We have since switched to self hosted Forgejo instance. Unsurprisingly the search works.
The improvements to PR review have been nice though
I dunno, probably the worst UX downgrade so far, almost no PRs are "fully available" on page load, but requires additional clicks and scrolling to "unlock" all the context, kind of sucks.
Used to be you loaded the PR diff and you actually saw the full diff, except really large files. You could do CTRL+F and search for stuff, you didn't need to click to expand even small files. Reviewing medium/large PRs is just borderline obnoxious today on GH.
They have somehow found the worst possible amount of context for doing review. I tend to pull everything down to VS Code if I want to have any confidence these days.
That's only a valid sentiment if you only use the big players. Both of those have medium/smaller competitors that have shown (for decades) that they are extremely boring, therefore stable.
I'm at a much smaller outfit now so we have more freedom but I'd dread to think the arguments I would've had at the 4000+ employee companies I was at before.
(Note that "is this company financially viable in the long term future" is an important part of stability. Doesn't matter how rock solid the software is if the startup's bankrupt by the end of next year.)
It's just that everybody is using 100 tools and dependencies which themselves depend on 50 others to be working.
And then on top of all that, their traffic is probably skyrocketing like mad because of everyone else using AI coders. Look at popular projects -- a few minutes after an issue is filed they have sometimes 10+ patches submitted. All generating PRs and forks and all the things.
That can't be easy on their servers.
I do not envy their reliability team (but having been through this myself, if you're reading this GitHub team, feel free to reach out!).
I think this is a really important point that is getting overlooked in most conversations about GitHub's reliability lately.
GitHub was not designed or architected for a world where millions of AI coding agents can trivially generate huge volumes of commits and PRs. This alone is such a huge spike and change in user behavior that it wouldn't be unreasonable to expect even a very well-architected site to struggle with reliability. For GitHub, N 9s of availability pre-AI simply does not mean the same thing as N 9s of availability post-AI. Those are two completely different levels of difficulty, even when N is the same.
it's the 2 nines they aimed for
1-4 incidents per month compared to about 1 daily.
Like they are down to one 9 availability and very very close to losing that to (90.2x%).
This also fit more closely to my personal experience, then the 99.900-99.989 range the article indicates...
Through honestly 99.9% means 8.76h downtime a year, if we say no more then 20min down time per 3 hours (sliding window), and no more then 1h a day, and >50% downtime being (localized) off-working hours (e.g. night, Sat,Sun) then 99.9% is something you can work with. Sure it would sometimes be slightly annoying. But should not cause any real issues.
On the other hand 90.21%... That is 35.73h outage a year. Probably still fine if for each location the working hour availability is 99.95% and the previous constraints are there. But uh, wtf. that just isn't right for a company of that size.
Days, not hours.
- at most 20min per 3 hour
- and 99.9% uptime on a yearly basis
as in your yearly budged of outage is ~8.76h but that budged shouldn't happen all at once and if there is an outage it at most delays works by 20min at a time, and not directly again after you had a downtime
but I did fumble the 90.21% part, which is ~35.73 days i.e. over 857 hours....
That’s… one 9 of reliability. You could argue the title understates the problem.
> You don't need every single service to be online in order to use GitHub.
Well that’s how they want you to use it, so it’s an epic failure in their intended use story. Another way to put this is ”if you use more GitHub features, your overall reliability goes down significantly and unpredictably”.
Look, I have never been obsessed with nines for most types of services. But the cloud service providers certainly were using it as major selling/bragging points until it got boring and old because of LLMs. Same with security. And GitHub is so upstream that downstream effects can propagate and cascade quite seriously.
These days it is very common that something like opening the diff view of a trivial PR takes 15-30 seconds to load. Sure, it will eventually load after a long wait or an F5, but it is still negatively impacting my productivity.
It seems that the same metric is about a magnitude worse than before.
https://github.com/customer-terms/github-online-services-sla
> GitHub commits to maintain at least 99.9% Uptime for the applicable GitHub service.
... and none of the individual services have hit 99.9% uptime in the last 90 days according to this site. 0_o
If you have ever operated GitHub Enterprise Server, it’s a nightmare.
It doesn’t support active-active. It only supports passive standbys. Minor version upgrades can’t be done without downtime, and don’t support rollbacks. If you deploy an update, and it has a bug, the only thing you can do is restore from backup leading to data loss.
This is the software they sell to their highest margin customers, and it fails even basic sniff tests of availability.
Data loss for source code is a really big deal.
Downtime for source control is a really big deal.
Anyone that would release such a product with a straight face, clearly doesn’t care deeply about availability.
So, the fact that their managed product is also having constant outages isn’t surprising.
I think the problem is that they just don’t care.
¹ Glossing over the "what they're getting in return" part. ² https://www.warpbuild.com/