Posted by birdculture 1 day ago
My favorite hill to die on (externality) is user time. Most software houses spend so much time focusing on how expensive engineering time is that they neglect user time. Software houses optimize for feature delivery and not user interaction time. Yet if I spent one hour making my app one second faster for my million users, I can save 277 user hour per year. But since user hours are an externality, such optimization never gets done.
Externalities lead to users downloading extra gigabytes of data (wasted time) and waiting for software, all of which is waste that the developer isn't responsible for and doesn't care about.
I don’t know what you mean by software houses, but every consumer facing software product I’ve worked on has tracked things like startup time and latency for common operations as a key metric
This has been common wisdom for decades. I don’t know how many times I’ve heard the repeated quote about how Amazon loses $X million for every Y milliseconds of page loading time, as an example.
> Helldivers 2 devs slash install size from 154GB to 23GB
https://news.ycombinator.com/item?id=46134178
Section of the top comment says,
> It seems bizarre to me that they'd have accepted such a high cost (150GB+ installation size!) without entirely verifying that it was necessary!
and the reply to it has,
> They’re not the ones bearing the cost. Customers are.
And Skylines rendering teeth on models miles away https://www.reddit.com/r/CitiesSkylines/comments/17gfq13/the...
Sometimes the performance is really ignored.
It cost several human lifetimes if i remember correctly. Still not as bad as windows update which taking the time times wage has set the gdp of a small nation on fire every year..
Gamedev is very different from other domains, being in the 90th percentile for complexity and codebase size, and the 99th percentile for structural instability. It's a foregone conclusion that you will rewrite huge chunks of your massive codebase many, many times within a single year to accomidate changing design choices, or if you're lucky, to improve an abstraction. Not every team gets so lucky on every project. Launch deadlines are hit when there's a huge backlog of additional stuff to do, sitting atop a mountain of cut features.
The inverse, however, is bizarre. That they spent potentially quite a bit of engineering effort implementing the (extremely non-optimal) system that duplicates all the assets half a dozen time to potentially save precious seconds on spinning rust - all without validating it was worth implementing in the first place.
The first was on PS3 and PS4 where they had to deal with spinning disks and that system would absolutely be necessary.
Also if the game ever targeted the PS4 during development, even though it wasn’t released there, again that system would be NEEDED.
They talk about it being an optimization. They also talk about the bottleneck being level generation, which happens at the same time as loading from disk.
Tell me you don't work on game engines without telling me..
----
Modern engines are the cumulative result of hundreds of thousands of engine-programmer hours. You're not rewriting Unreal in several years, let alone multiple times in one year. Get a grip dude.
Though in this case GitHub wasn't bearing the cost, it was gaining a profit...
I think this is uncharitably erasing the context here.
AFAICT, the reason that Helldivers 2 was larger on disk is because they were following the standard industry practice of deliberately duplicating data in such a way as to improve locality and thereby reduce load times. In other words, this seems to have been a deliberate attempt to improve player experience, not something done out of sheer developer laziness. The fact that this attempt at optimization is obsolete these days just didn't filter down to whatever particular decision-maker was at the reins on the day this decision was made.
Are you sure that you’re not the driving force behind those metrics; or that you’re not self-selecting for like-minded individuals?
I find it really difficult to convince myself that even large players (Discord) are measuring startup time. Every time I start the thing I’m greeted by a 25s wait and a `RAND()%9` number of updates that each take about 5-10s.
I made a slight at Word taking like 10 seconds to start and some people came back saying it only takes 2, as if that still isn't 2s too long.
Then again, look at how Microsoft is handling slow File Explorer speeds...
A large part of my friend group use discord as the primary method of communication, even in an in person context (was at a festival a few months ago with a friend, and we would send texts over discord if we got split up) so maybe its not a common use case.
It’s closed unless I get a DM on my phone and then I suffer the 2-3 minute startup/failed update process and quit it again. Not a fan of leaving their broken, resource hogging app running at all times.
>>what you mean by software houses
How about Microsoft? Start menu is a slow electron app.
The decline of Windows as a user facing product is amazing, especially as they are really good at developing things they care about. The “back of house” guts of Windows has improved alot, for example. They should just have a cartoon Bill Gates pop up like clippy and flip you the bird at this point.
It has both indexing failures and multi-day performance issues for mere kilobytes of text!
What falsehoods people believe and spread about a particular topic is an excellent way to tell what the public opinion is on something.
Consider spreading a falsehood about Boeing QA getting bonuses based on number of passed planes vs the same falsehood about Airbus. If the Boeing one spreads like wildfire, it tells you that Boeing has a terrible track record of safety and that it’s completely believable.
Back to the start menu. It should be a complete embarrassment to MSFT SWEs that people even think the start menu performance is so bad that it could be implemented in electron.
In summary: what lies spread easily is an amazing signal on public perception. The SMBC comic is dumb.
If your users are trapped due to a lack of competition then this can definitely happen.
This is true for sites that are trying to make sales. You can quantify how much a delay affects closing a sale.
For other apps, it’s less clear. During its high-growth years, MS Office had an abysmally long startup time.
Maybe this was due to MS having a locked-in base of enterprise users. But given that OpenOffice and LibreOffice effectively duplicated long startup times, I don’t think it’s just that.
You also see the Adobe suite (and also tools like GIMP) with some excruciatingly long startup times.
I think it’s very likely that startup times of office apps have very little impact on whether users will buy the software.
They often have a recognizable delay to user data input compared to local software
Are they evaluating the shape of that line with the same goal as the stonk score? Time spent by users is an "engagement" metric, right?
Then respectfully, uh, why is basically all proprietary software slow as ass?
Commons would be if it's owned by nobody and everyone benefits from its existence.
This isn’t what “commons” means in the term ‘tragedy of the commons’, and the obvious end result of your suggestion to take as much as you can is to cause the loss of access.
Anything that is free to use is a commons, regardless of ownership, and when some people use too much, everyone loses access.
Finite digital resources like bandwidth and database sizes within companies are even listed as examples in the Wikipedia article on Tragedy of the Commons. https://en.wikipedia.org/wiki/Tragedy_of_the_commons
The behavior that you warn against is that of a free rider that make use of a positive externality of GitHub’s offering.
“Commons can also be defined as a social practice of governing a resource not by state or market but by a community of users that self-governs the resource through institutions that it creates.”
https://en.wikipedia.org/wiki/Commons
The actual mechanism by which ownership resolves tragedy of the commons scenarios is by making the resource non-free, by either charging, regulating, or limiting access. The effect still occurs when something is owned but free, and its name is still ‘tragedy of the commons’, even when the resource in question is owned by private interests.
Edit: oh, I do see what you mean, and yes I misunderstood the quote I pulled from WP - it’s talking about non-ownership. I could pick a better example, but I think that’s distracting from the fact that ‘tragedy of the commons’ is a term that today doesn’t depend on the definition of the word ‘commons’. It’s my mistake to have gotten into any debate about what “commons” means, I’m only saying today’s usage and meaning of the phrase doesn’t depend on that definition, it’s a broader economic concept.
Certainly private property is involved in tragedy of the commons. In the classic shared cattle ranching example, the individual cattle are private property, only the field is held in common.
I generally think that tragedy of the commons requires the commons, to, well, be held in common. If someone owns the thing that is the commons, its not a commons but just a bad product. (With of course some nit picking about how things can be de jure private property while being defacto common property)
In the microsoft example, windows becoming shitty software is not a tragedy of the commons, its just MS making a business decision because windows is not a commons. On the other hand, computing in general becoming shitty, because each individual app does attention grabbing dark patterns, as it helps the induvidual apps bottom line while hurting the ecosystem as a whole, would be a tragedy of the commons, as user attention is something all apps hold in common and none of them own.
The idea of the tragedy of the commons relies on this feedback loop of having these unsustainably growing herds (growing because they can exploit the zero-cost-to-them resources of the commons). Feedback loops are notoriously sensitive to small parameter changes. MS could presumably impose some damping if they wanted.
Not linearity but continuity, which I think is a well-founded assumption, given that it's our categorization that simplifies the world by drawing sharp boundaries where no such bounds exist in nature.
> The idea of the tragedy of the commons relies on this feedback loop of having these unsustainably growing herds (growing because they can exploit the zero-cost-to-them resources of the commons)
AIUI, zero-cost is not a necessary condition, a positive return is enough. Fishermen still need to buy fuel and nets and pay off loans for the boats, but as long as their expected profit is greater than that, they'll still overfish and deplete the pond, unless stronger external feedback is introduced.
Given that the solution to tragedy of the commons is having the commons owned by someone who can boss the users around, GitHub being owned by MS makes it more of a commons in practice, not less.
You’re fundamentally misunderstanding what tragedy of the commons is. It’s not that it’s “zero-cost” for the participants. All it requires a positive return that has a negative externality that eventually leads to the collapse of the system.
Overfishing and CO2 emissions are very clearly a tragedy of the commons.
GitHub right now is not. People putting all sorts of crap on there is not hurting github. GitHub is not going to collapse if people keep using it unbounded.
Not surprisingly, this is because it’s not a commons and Microsoft oversees it, placing appropriate rate limits and whatnot to make sure it keeps making sense as a business.
But I would make the following clarifications:
1. A private entity is still the steward of the resource and therefore the resource figures into the aims, goals, and constraints of the private entity.
2. The common good is itself under the stewardship of the state, as its function is guardian of the common good.
3. The common good is the default (by natural law) and prior to the private good. The latter is instituted in positive law for the sake of the former by, e.g., reducing conflict over goods.
I think it's both simpler and deeper than that.
Governments and corporations don't exist in nature. Those are just human constructs, mutually-recursive shared beliefs that emulate agents following some rules, as long as you don't think too hard about this.
"Tragedy of the commons" is a general coordination problem. The name itself might've been coined with some specific scenarios in mind, but for the phenomenon itself, it doesn't matter what kind of entities exploit the "commons"; the "private" vs. "public" distinction itself is neither a sharp divide, nor does it exist in nature. All that matters is that there's some resource used by several independent parties, and each of them finds it more beneficial to defect than to cooperate.
In a way, it's basically a 3+-player prisonner's dilemma. The solution is the same, too: introducing a party that forces all other parties to cooperate. That can be a private or public or any other kind of org taking ownership of the commons and enforcing quotas, or in case of prisonners, a mob boss ready to shoot anyone who defects.
The "tragedy", if you absolutely need to find one, is only for unrestricted, free-for-all commons, which is obviously a bad idea.
But that doesn't mean the tragedy of the commons can't happen in other scenarios. If we define commons a bit more generously it does happen very frequently on the internet. It's also not difficult to find cases of it happening in larger cities, or in environments where cutthroat behavior has been normalized
That works while the size of the community is ~100-200 people, when everyone knows everyone else personally. It breaks down rapidly after that. We compensate for that with hierarchies of governance, which give rise to written laws and bureaucracy.
New tribes break off old tribes, form alliances, which form larger alliances, and eventually you end up with countries and counties and vovoidships and cities and districts and villages, in hierarchies that gain a level per ~100x population increase.
This is sociopolitical history of the world in a nutshell.
You say it like this is a law set in stone, because this is what happened im history, but I would argue it happened under different conditions.
Mainly, the main advantage of an empire over small villages/tribes is not at all that they have more power than the villages combined, but that they can concentrate their power where it is needed. One village did not stand a chance against the empire - and the villages were not coordinated enough.
But today we would have the internet for better communication and coordination, enabling the small entieties to coordinate a defense.
Well, in theory of course. Because we do not really have autonomous small states, but are dominated by the big players. And the small states have mowtly the choice which block to align with, or get crushed. But the trend might go towards small again.
(See also cheap drones destroying expensive tanks, battleships etc.)
NETWORK effect is a real thing
canadians need an anti-imperial radio-canada run alternative. we arent gonna be able to coordinate against the empire when the empire has the main control over the internet.
when the americans come a knocking, we're gonna wish we had chinese radios
Yet we regularly observe that working with millions of people; we take care of our young, we organize, when we see that some action hurt our environment we tend to limit its use.
It's not obvious why some societies break down early and some go on working.
That's more like human universals. These behaviors generally manifest to smaller or larger degree, depending on how secure people feel. But those are extremely local behaviors. And in fact, one of them is exactly the thing I'm talking about:
> we organize
We organize. We organize for many reasons, "general living" is the main one but we're mostly born into it today (few got the chance to be among the founding people of a new village, city or country). But the same patterns show up in every other organizations people create, from companies to charities, from political interests groups to rural housewives' circles -- groups that grow past ~100 people split up. Sometimes into independent groups, sometimes into levels of hierarchies. Observe how companies have regional HQs and departments and areas and teams; religious groups have circuits and congregations, etc. Independent organizations end up creating joint ventures and partnerships, or merge together (and immediately split into a more complex internal structure).
The key factor here is, IMO, for everyone in a given group to be in regular contact with everyone else. Humans are well evolved for living in such small groups - we come with built-in hardware and software to navigate complex interpersonal situations. Alignment around shared goals and implicit rules is natural at this scale. There's no space for cheaters and free-loaders to thrive, because everyone knows everyone else - including the cheater and their victims. However, once the group crosses this "we're all a big family, in it together" size, coordinating everyone becomes hard, and free-loaders proliferate. That's where explicit laws come into play.
This pattern repeats daily, in organizations people create even today.
But if a significant fraction of the population is barely scraping by then they're not willing to be "good" if it means not making ends meet, and when other people see widespread defection, they start to feel like they're the only one holding up their end of the deal and then the whole thing collapses.
This is why the tendency for people to propose rent-seeking middlemen as a "solution" to the tragedy of the commons is such a diabolical scourge. It extracts the surplus that would allow things to work more efficiently in their absence.
It’s easier to explain in those terms than assumptions about how things work in a tribe.
Commons can fail, but the whole point of Hardin calling commons a "tragedy" is to suggest it necessarily fails.
Compare it to, say, driving. It can fail too, but you wouldn't call it "the tragedy of driving".
We'd be much better off if people didn't throw around this zombie term decades after it's been shown to be unfounded.
No it does not. This sentiment, which many people have, is based on a fictional and idealistic notion of what small communities are like having never lived in such communities.
Empirically, even in high-trust small villages and hamlets where everyone knows everyone, the same incentives exist and the same outcomes happen. Every single time. I lived in several and I can't think of a counter-example. People are highly adaptive to these situations and their basic nature doesn't change because of them.
Humans are humans everywhere and at every scale.
Nonetheless, the concept is still alive, and anthropic global warming is here to remind you about this.
Communal management of a resource is still government, though. It just isn’t central government.
The thesis of the tragedy of the commons is that an uncontrolled resource will be abused. The answer is governance at some level, whether individual, collective, or government ownership.
> The "tragedy", if you absolutely need to find one, is only for unrestricted, free-for-all commons, which is obviously a bad idea.
Right. And that’s what people are usually talking about when they say “tragedy of the commons”.
that seems like an unreasonable bar, and less useful than "does this system make ToC less frequent than that system"
This is of course a false dichotomy because governance can be done at any level.
Let's Encrypt is a solid example of something you could reasonably model as "tragedy of the commons" (who is going to maintain all this certificate verification and issuance infrastructure?) but then it turns out the value of having it is a million times more than the cost of operating it, so it's quite sustainable given a modicum of donations.
Free software licenses are another example in this category. Software frequently has a much higher value than development cost and incremental improvements decentralize well, so a license that lets you use it for free but requires you to contribute back improvements tends to work well because then people see something that would work for them except for this one thing, and it's cheaper to add that themselves or pay someone to than to pay someone who has to develop the whole thing from scratch.
The jerks get their free things for a while, then it goes away for everyone.
And out of curiosity, aside from costing more for some people, what’s worse exactly? I’m not a heavy GitHub user, but I haven’t really noticed anything in the core functionality that would justify calling it enshittified.
Probably the worst thing MS did was kill GitHub’s nascent CI project and replace it with Azure DevOps. Though to be fair the fundamental flaws with that approach didn’t really become apparent for a few years. And GitHub’s feature development pace was far too slow compared to its competitors at the time. Of course GitHub used to be a lot more reliable…
Now they’re cramming in half baked AI stuff everywhere but that’s hardly a MS specific sin.
MS GitHub has been worse about DMCA and sanctioned country related takedowns than I remember pre acquisition GitHub being.
Did I miss anything?
As for how the site has become worse, plenty of others have already done a better job than I could there. Other people haven't noticed or don't care and that's ok too I guess.
Remember how GTA5 took 10 minutes to start and nobody cared? Lots of software is like this.
Some Blizzard games download 137 MB file every time you run them and take few minutes to start (and no, this is not due to my computer).
The number of companies that have this much respect for the user is vanishingly small.
I think companies shifted to online apps because #1 it solved the copy protection problem. FOSS apps are not in any hurry to become centralized because they dont care about that issue.
Local apps and data are a huge benefit of FOSS and I think every app website should at least mention that.
"Local app. No ads. You own your data."
Native software being an optimum is mostly an engineer fantasy that comes from imagining what you can build.
In reality that means having to install software like Meta’s WhatsApp, Zoom, and other crap I’d rather run in a browser tab.
I want very little software running natively on my machine.
Yes, there are many cases when condoms are indicative of respect between parties. But a great many people would disagree that the best, most respectful relationships involve condoms.
> Meta
Does not sell or operate respectful software. I will agree with you that it's best to run it in a browser (or similar sandbox).
I think this is sad.
In contrast as long as you have a native binary, one way or another you can make the thing run and nobody can stop you.
I know the browser is convenient, but frankly, its been a horror show of resource usage and vulnerabilities and pathetic performance
The idea that somehow those companies would respect your privacy were they running a native app is extremely naive.
We can already see this problem on video games, where copy protection became resource-heavy enough to cause performance issues.
This is what people mean about speed being a feature. But "user time" depends on more than the program's performance. UI design is also very important.
> "The Macintosh boots too slowly. You've got to make it faster!"
Google and amazon are famous for optimizing this. Its not an externality to them though, even 10s of ms can equal an extra sale.
That said, i don't think its fair to add time up like that. Saving 1 second for 600 people is not the same as saving 10 minutes for 1 person. Time in small increments does not have the same value as time in large increments.
2. Monopolies and situations with the principal/agent dilemma are less sensitive to such concerns.
An externality is usually a cost you don't pay (or pay only a negligible amount of). I don't see how pricing it helps justify optimizing it.
Wait times don’t accumulate. Depending on the software, to each individual user, that one second will probably make very little difference. Developers often overestimate the effect of performance optimization on user experience because it’s the aspect of user experience optimization their expertise most readily addresses. The company, generally, will have a much better ROI implementing well-designed features and having you squash bugs
Prehaps not everyone cares but I've played enough Age of Empires 2 to know that there are plenty of people who have felt value gains coming from shaving seconds off this and that to get compound games over time. It's a concept plenty of folks will be familiar with.
i have to pay less attention to a thing that updates less frequently. idle games are the best in that respect because you can check into the game on your own time rather than the game forcing you to pay attention on its time
First argument would be - take at least two 0's from your estimation, most of applications will have maybe thousands of users, successful ones will maybe run with 10's of thousands. You might get lucky to work on application that has 100's of thousands, millions of users and you work in FAANG not a typical "software house".
Second argument is - most users use 10-20 apps in typical workday, your application is most likely irrelevant.
Third argument is - most users would save much more time learning how to use applications (or to use computer) properly they use on daily basis, than someone optimizing some function from 2s to 1s. But of course that's hard because they have 10-20 apps daily plus god know how many other not on daily basis. Though still I see people doing super silly stuff in tools like Excel or even not knowing copy paste - so not even like any command line magic.
What is the probability of it being used? About 0%, right? Because git is proven and GitHub is free. Engineering aspects are less important.
So how do I start using it if I, for example, want to use it like a decentralized `syncthing`? Can I? If not, what can I use it for?
I am not a mathematician. Most people landing on your repo are not mathematicians either.
We the techies _hate_ marketing with a passion but I as another programmer find myself intrigued by your idea... with zero idea how to even use it and apply it.
I have never been convinced by this argument. The aggregate number sounds fantastic but I don't believe that any meaningful work can be done by each user saving 1 second. That 1 second (and more) can simply be taken by me trying to stretch my body out.
OTOH, if the argument is to make software smaller, I can get behind that since it will simply lead to more efficient usage of existing resources and thus reduce the environmental impact.
But we live in a capitalist world and there needs to be external pressure for change to occur. The current RAM shortage, if it lasts, might be one of them. Otherwise, we're only day dreaming for a utopia.
Not all of those externalizing companies abuse your time but whatever they abuse can be expressed in a $ amount and $ can be converted to a median's person time via median wage. Hell, free time is more valuable than whatever you produce during work.
Say all that boils down to companies collectively stealing 20 minutes of your time each day. 140 minutes each week. 7280 (!) minutes each year, which is 5.05 days, which makes it almost a year over the course of 70 years.
So yeah, don't do what you do and sweettalk the fact that companies externalize costs (private the profits, socialize the losses). They're sucking your blood.
A high usage one, absolutely improve the time of it.
Loading the profile page? Isn't done often so not really worth it unless it's a known and vocal issue.
https://xkcd.com/1205/ gives a good estimate.
Even if all you do with it is just stretching, there's a chance it will prevent you pulling a muscle. Or lower your stress and prevent a stroke. Or any number of other beneficial outcomes.
For 24 years of career I've met the grand total of _two_ such. Both got fired not even 6 months after I got in the company, too.
Who's naive here?
So now you and I both have come across such a manager. Why would you make the claim most engineer’s don’t come across such people?
The article mentions that most of these projects did use GitHub as a central repo out of convenience so there’s that but they could also have used self-hosted repos.
I've looked into self hosting and git repo that has horizontal scalability, and it is indeed very difficult. I don't have the time to detail it in a comment here, but for anyone who is curious it's very informative to look at how GitLab handled this with gitaly. I've also seen some clever attempts to use object storage, though I haven't seen any of those solutions put heavily to the test.
I'd love to hear from others about ideas and approaches they've heard about or tried
From compute POV you can serve that with one server or virtual machine.
Bandwidth-wise, given a 100 MB repo size, that would make it 3.4 GB/s - also easy terrain for a single server.
The git transport protocol is "smart" in a way that is, in some ways, arguably rather dumb. It's certainly expensive on the server side. All of the smartness of it is aimed at reducing the amount of transfer and number of connections. But to do that, it shifts a considerable amount of work onto the server in choosing which objects to provide you.
If you benchmark the resource loads of this, you probably won't be saying a single server is such an easy win :)
Using the slowest clone method they measured 8s for a 750 MB repo, 0.45s for a 40MB repo. appears to be linear so 1.1s for 100MB should be a valid interpolation.
So doing 30 of those per second only takes 33 cores. Servers have hundreds of cores now (eg 384 cores: https://www.phoronix.com/review/amd-epyc-9965-linux-619).
And remember we're using worst case assumptions in a 3 places (full clone each time, using the slowest clone method, and numbers from old hardware). In practice I'd bet a fastish laptop would suffice.
It's a very hacky feeling addon that RKE2 has a distributed internal registry if you enable it and use it in a very specific way.
For the rate at which people love just shipping a Helm chart, it's actually absurdly hard to ship a self contained installation without just trying to hit internet resources.
Explain to me how you self-host a git repo without spending any money and having no budget which is accessed millions of time a day from CI jobs pulling packages.
Oh no no no. Consumer-facing companies will burn 30% of your internal team complexity budget on shipping the first "frame" of your app/website. Many people treat Next as synonymous with React, and Next's big deal was helping you do just this.
The answer is in TFA:
> The underlying issue is that git inherits filesystem limitations, and filesystems make terrible databases.
because it's bad at this job, and sqlite is also free
this isn't about "externalities"
Anyone working in government, banking, or healthcare is still out of luck since the likes of Claude and GPT are (should be) off limits.
Also in case you're not aware, accusing people of shilling or astroTurfing is against the hacker news guidelines
This is perfectly sensible behavior when the developers are working for free, or when the developers are working on a project that earns their employer no revenue. This is the case for several of the projects at issue here: Nix, Homebrew, Cargo. It makes perfect sense to waste the user's time, as the user pays with nothing else, or to waste Github's bandwidth, since it's willing to give bandwidth away for free.
Where users pay for software with money, they may be more picky and not purchase software that indiscriminately wastes their time.
Windows 11 should not be more sluggish than Windows 7.
> The problem was that go get needed to fetch each dependency’s source code just to read its go.mod file and resolve transitive dependencies.
This article is mixing two separate issues. One is using git as the master database storing the index of packages and their versions. The other is fetching the code of each package through git. They are orthogonal; you can have a package index using git but the packages being zip/tar/etc archives, you can have a package index not using git but each package is cloned from a git repository, you can have both the index and the packages being git repositories, you can have neither using git, you can even not have a package index at all (AFAIK that's the case for Go).
It then digresses into implementation details of Github's backend implementation (how is 20k forks relevant?), then complains about default settings of the "standard" git implementation. You don't need to checkout a git working tree to have efficient key value lookups. Without a git working tree you don't need to worry about filesystem directory limits, case sensitivity and path length limits.
I was surprised the author believes the git-equivalent of a database migration is a git history rewrite.
What do you want me to do, invent my own database? Run postgres on a $5 VPS and have everybody accept it as single-point-of-failure?
Unfortunately, when you’re starting out, the idea of running a registry is a really tough sell. Now, on top of the very hard engineering problem of writing the code and making a world class tool, plus the social one of getting it adopted, I need to worry about funding and maintaining something that serves potentially a world of traffic? The git solution is intoxicating through this lense.
Fundamentally, the issue is the sparse checkouts mentioned by the author. You’d really like to use git to version package manifests, so that anyone with any package version can get the EXACT package they built with.
But this doesn’t work, because you need arbitrary commits. You either need a full checkout, or you need to somehow track the commit a package version is in without knowing what hash git will generate before you do it. You have to push the package update and then push a second commit recording that. Obviously infeasible, obviously a nightmare.
Conan’s solution is I think just about the only way. It trades the perfect reproduction for conditional logic in the manifest. Instead of 3.12 pointing to a commit, every 3.x points to the same manifest, and there’s just a little logic to set that specific config field added in 3.12. If the logic gets too much, they let you map version ranges to manifests for a package. So if 3.13 rewrites the entire manifest, just remap it.
I have not found another package manager that uses git as a backend that isn’t a terrible and slow tool. Conan may not be as rigorous as Nix because of this decision but it is quite pragmatic and useful. The real solution is to use a database, of course, but unless someone wants to wire me ten thousand dollars plus server costs in perpetuity, what’s a guy supposed to do?
Every package has its own git repository which for binary packages contains mostly only the manifest. Sources and assets, if in git, are usually in separate repos.
This seems to not have the issues in the examples given so far, which come from using "monorepos" or colocating. It also avoids the "nightmare" you mention since any references would be in separate repos.
The problematic examples either have their assets and manifests colocated, or use a monorepo approach (colocating manifests and the global index).
The thing that scales is dumb HTTP that can be backed by something like S3.
You don't have to use a cloud, just go with a big single server. And if you become popular, find a sponsor and move to cloud.
If money and sponsor independence is a huge concern the alternative would be: peer-to-peer.
I haven't seen many package managers do it, but it feels like a huge missed opportunity. You don't need that many volunteers to peer inorder to have a lot of bandwidth available.
Granted, the real problem that'll drive up hosting cost is CI. Or rather careless CI without caching. Unless you require a user login, or limit downloads for IPs without a login, caching is hard to enforce.
For popular package repositories you'll likely see extremely degenerate CI systems eating bandwidth as if it was free.
Disclaimer: opinions are my own.
That can be moved elsewhere / mirrored later if needed, of course. And the underlying data is still in git, just not actively used for the API calls.
It might also be interesting to look at what Linux distros do, like Debian (salsa), Fedora (Pagure), and openSUSE (OBS). They're good for this because their historic model is free mirrors hosted by unpaid people, so they don't have the compute resources.
However a lot of the "data in git repositories" projects I see don't have any such need, and then ...
> Why not just have a post-commit hook render the current HEAD to static files, into something like GitHub Pages?
... is a good plan. Usually they make a nice static website with the data that's easy for humans to read though.
So you need a decentralized database? Those exist (or you can make your own, if you're feeling ambitious), probably ones that scale in different ways than git does.
It’s really important that someone is able to search for the manifest one of their dependencies uses for when stuff doesn’t work out of the box. That should be as simple as possible.
I’m all ears, though! Would love to find something as simple and good as a git registry but decentralized
You could just make a registry hosted as plain HTTP, with everything signed. And a special file that contains a list of mirrors.
Clients request the mirror list and the signed hash of the last entry in the Merkel tree. Then they go talk to a random mirror.
Maybe, you central service requires user sign-in for publishing and reading, while mirrors can't publish, but mirrors don't require sign-in.
Obviously, you'd have to validate that mirrors are up and populated. But that's it.
You can start by self hosting a mirror.
One could go with signing schemes inspired by: https://theupdateframework.io/
Or one could omit signing all together, so long as you have a Merkel tree with hashes for all publishing events. And the latest hash entry is always fetched from your server along with the mirror list.
Having all publishing go through a single service is probably desirable. You'll eventually need to do moderation, etc. And hosting your service or a mirror becomes a legal nightmare if there is not moderation.
Disclaimer: opinions are my own.
0: https://github.com/mesonbuild/wrapdb/tree/master/subprojects
Interesting! Do you mind sharing a link to the project at this point?
Julia does the same thing, and from the Rust numbers on the article, Julia has about 1/7th the number of packages that Rust does[1] (95k/13k = 7.3).
It works fine, Julia has some heuristics to not re-download it too often.
But more importantly, there's a simple path to improve. The top Registry.toml [1] has a path to each package, and once donwloading everything proves unsustainable you can just download that one file and use it to download the rest as needed. I don't think this is a difficult problem.
[1] https://github.com/JuliaRegistries/General/blob/master/Regis...
Another way to phrase this mindset is "fuck around and find out" in gen-Z speak. It's usually practical to an extent but I'm personally not a fan
Building on the same thing people use for code doesn't seem stupid to me, at least initially. You might have to migrate later if you're successful enough, but that's not a sign of bad engineering. It's just building for where you are, not where you expect to be in some distant future
When you fuck around optimizing prematurely, you find out that you're too late and nobody cares.
Oh, well, optimization is always fun, so there's that.
Software engineers always make the excuse that what they're making now is unimportant, so who cares? But then everything gets built on top of that unimportant thing, and one day the world crashes down. Worse, "fixing the problem" becomes near impossible, because now everything depends on it.
But really the reason not to do it, is there's no need to. There are plenty of other solutions than using Git that work as well or better without all the pitfalls. The lazy engineer picks bad solutions not because it's necessarily easier than the alternatives, but because it's the path of least resistance for themselves.
Not only is this not better, it's often actively worse. But this is excused by the same culture that gave us "move fast and break things". All you have to do is use any modern software to see how that worked out. Slow bug-riddled garbage that we're all now addicted to.
Most software gets to take it to more of an extreme then many engineering fields since there isn't physical danger. Its telling that the counter examples always use the potentially dangerous problems like medicine or nuclear engineering. The software in those fields are more stringent.
As opposed to something like using a flock of free blogger.com blogs to host media for an offsite project.
You realize, there are people who think differently? Some people would argue that if you keep working on problems you don't have but might have, you end up never finishing anything.
It's a matter of striking a balance, and I think you're way on one end of the spectrum. The vast majority of people using Julia aren't building nuclear plants.
Refusing to fix a problem that hasn't appeared yet, but has been/can be foreseen - that's different. I personally wouldn't call it unethical, but I'd consider it a negative.
Literally anybody could forsee that, _if_ something scales to millions of users, there will be issues. Some of the people who forsee that could even fix it. But they might spend their time optimizing for something that will never hit 1000 users.
Also, the problems discussed here are not that things don't work, it's that they get slow and consume too many resources.
So there is certainly an optimal time to fix such problems, which is, yes, OK, _before_ things get _too_ slow and consume _too_ many resources, but is most assuredly _after_ you have a couple of thousand users.
It cannot be the case that software engineers are labelled lazy for not building the at-scale solution to start with, but at the same time everyone wants to use their work, and there are next to no resources for said engineer to actually build the at scale solution.
> the path of least resistance for themselves.
Yeah because they're investing their own personal time and money, so of course they're going to take the path that is of least resistance for them. If society feels that's "unethical", maybe pony up the cash because you all still want to rely on their work product they are giving out for free.
I like OSS and everything.
Having said that, ethically, should society be paying for these? Maybe that is what should happen. In some places, we have programs to help artists. Should we have the same for software?
... Should it be concerning that someone was apparently able to engineer an ID like that?
Right now I don't see the problem because the only criterion for IDs is that they are unique.
Apparently it is the former, and most developers independently generate random IDs because it's easy and is extremely unlikely to result in collisions. But it seems the dev at the top of the list had a sense of vanity instead.
https://en.wikipedia.org/wiki/Universally_unique_identifier
> 00000000-1111-2222-3333-444444444444
This would technically be version 2, which would be built from the date-time and MAC address, and DCE security version.
But overall, if you allow any yahoo to pick a UUID, its not really a UUID, its just some random string that looks like one.
universally unique identifier (UUID)
> 00000000-1111-2222-3333-444444444444
It's unique.
Anyway we're talking about a package that doesn't matter. It's abandoned. Furthermore it's also broken, because it uses REPL without importing it. You can't even precompile it.
https://github.com/pfitzseb/REPLTreeViews.jl/blob/969f04ce64...
https://devblogs.microsoft.com/oldnewthing/20120523-00/?p=75...
This is too naive. Fixing the problem costs a different amount depending on when you do it. The later you leave it the more expensive it becomes. Very often to the point where it is prohibitively expensive and you just put up with it being a bit broken.
This article even has an example of that - see the vcpkg entry.
Mostly to avoid downloading the whole repo/resolve deltas from the history for the few packages most applications tend to depend on. Especially in today's CI/CD World.
It relies on a git repo branch for stable. There are yaml definitions of the packages including urls to their repo, dependencies, etc. Preflight scripts. Post install checks. And the big one, the signatures for verification. No binaries, rpms, debs, ar, or zip files.
What’s actually installed lives in a small SQLite database and searching for software does a vector search on each packages yaml description.
Semver included.
This was inspired by brew/portage/dpkg for my hobby os.
Sure, eventually you run into scaling issues, but that's a first world problem.
As it is, this comment is just letting out your emotion, not engaging in dialogue.
The point being, if you're not sure whether your project will ever need to scale, then it may not make sense to reinvent the wheel when git is right there (and then invent the solution for hosting that git repo, when Github is right there), letting you spend time instead on other, more immediate problems.
I am sure there's value having a vision for what your scaling path might be in the future, so this discussion is a good one. But it doesn't automatically mean that git is a bad place to start.
The issues are only fundamental with that architecture. Using a separate repo for each package, like the Arch User Repos, does not have the same problems.
Nixpkgs certainly could be architected like that and submodules would be a graceful migration path. I'm not aware of discussion of this but guess that what's preventing it might be that github.com tooling makes it very painful to manage thousands of repos for a single project.
So I think it can be a lesson not to that using git as a database is bad but that using github.com as a database is. PRs as database transactions is clunky and GitHub Actions isn't really ACID.
The index could be split from the build and the package build defs could live in independent repos (like go or aur).
It would probably take some change to nix itself to make that work and some nontrivial work on tooling to make the devex decent.
But I don't think the friction with nixpkgs should be seen as damning for backing a package registry with git in general.
Much better to start with an API. Then you can have the server abstract the store and the operations - use git or whatever - but you can change the store later without disrupting your clients.
This has happened to me a few times now. The last one was a fantastic article about how PG Notify locks the whole database.
In this particular case it just doesn’t make a ton of sense to change course. Im a solo dev building a thing that may never take off, so using git for plug-in distribution is just a no brainer right now. That said, I’ll hold on to this article in case I’m lucky enough to be in a position where scale becomes an issue for me.
I don't know if you rely on github.com but IMO vendor lock-in there might be a bigger issue which you can avoid.
Turns out Go module will not accept package hosted on my Forgejo instance because it asks for certificate. There are ways to make go get use ssh but even with that approach the repository needs to be accessible over https. In the end, I cloned the repository and used it in my project using replace directive. It's really annoying.
No, that's false. You don't need anything to be accessible over HTTP.
But even if it did, and you had to use mTLS, there's a whole bunch of ways to solve this. How do you solve this for any other software that doesn't present client certs? You use a local proxy.