Posted by Simpliplant 5 hours ago
does anyone know where these "detailed root cause analysis" reports are shared? is there maybe an archive?
There are also monthly availability reports: https://github.blog/tag/github-availability-report/
Or could be that the recent 12 months of 100x increase in code and activity is more than they had planned for when they last did capacity planning.
Vibe-coders, many of them here, often boast about the insane amount of KLoC/hour they can generate and merge.
everyone builds off vibes and moves fast! like no, if you are a mature company you don't need to move fast, in fact you need to move slow
the only thing that can kill e.g. github is if they move fast and break things like they do recently
Our health check checks against githubstatus.com to verify 'why' there may be a GHA failure and reports it, e.g.
Cannot run: repo clone failed — GitHub is reporting issues (Partial System Outage: 'Incident with Copilot and Actions'). No cached manifests available.
But, if it's not updated, we get more generic responses. Are there better ways that you all employ (other than to not use GHA, you silly haters :-))
PRs are a defacto communication and coordination bus between different code review tools, its all a mess.
LLMs make it worse because I'm pushing more code to github than ever before, and it just isn't setup to deal with this type of workload when it is working well.
Have you ever considered that this is the problem? GH never planned for this sort of pointless and unpaid activity before. Now they have a large increase (I've seen figures of 100x) in activity and they can't keep up.
It doesn't help that almost none of the added activity is actually useful; it's just thousands and thousands of clones of some other pointless product.
The classic "nobody ever gets fired for buying IBM".
If you pick something else, and there's issue, people will complain about your choice being wrong, should have gone with the biggest player.
Even if you provide metrics showing your solution's downtime being 1% of the big player.
Something like Cloudflare is so big and ubiquitous, that, when there's a downtime, even your grandma is aware of it because they talk about it in the news. So nobody will put the blame on the person choosing Cloudflare.
Even if people decides to go back (I had a few customers asking us to migrate to other solutions or to build some kind of failover after the last Cloudflare incidents), it costs so much to find the solutions that can replace it with the same service level and to do the migration, that, in the end, they prefer to eat the cost of the downtimes.
Meanwhile, if you're a regular player in a very competitive market, yes, every downtime will result in lost income, customers leaving... which can hurt quite a lot when you don't have hundreds of thousands of customers.
GitHub is a distributed version control storage hub with additional add-on features. If peeps can’t work around a git server/hub being down and don’t know to have independent reproducible builds or integrations and aren’t using project software wildly better that GitHubs’, there are issues. And for how much money? A few hundred per dev per year? Forget total revenue, the billions, the entire thing is a pile of ‘suck it up, buttercup’ with ToS to match.
In contrast, I’ve been working for a private company selling patient-touching healthcare solutions and we all would have committed seppuku with outages like this. Yeah, zero downtime or as close to it as possible even if it means fixing MS bugs before they do. Fines, deaths, and public embarrassment were potential results of downtime.
All investments become smart or dumb depending on context. If management agrees that downtime would be lethal my prejudice would be to believe them since they know the contracts and sales perspective. If ‘they crashed that one time’ stops all sales, the 0% revenue makes being 30% faster than those astronauts irrelevant.
Of course, once you have the momentum it doesn't matter nearly as much, at least for a while. If it happens too much though, people will start looking for alternatives.
The key to remember is Momentum is hard to redirect, but with enough force (reasons), it will.
And the frequency they can tolerate is surprisingly high given that we're talking about the 20th or so outage of 2026 for github. (See: https://news.ycombinator.com/from?site=githubstatus.com)