Posted by stalfosknight 15 hours ago
This problem existed before AI, but it is now just worse due to the spamming nature of these "contributors". It's another form of endless September where people unfamiliar with the norms of team software development are overwhelming existing project maintainers faster than maintainers can teach them the norms of behaviour.
In the end, some sort of gatekeeping mechanism is needed to avoid overwhelming maintainers, whether it's a reputation system, membership in an in-group, or something else.
The tooling is telling laymen that they built wonderful things that definitely work and perfectly fix and add features.
The tooling gasses them up and is simply wrong in these cases.
If your tool regularly lies, gaslights and produces wrong results, that's a tooling issue.
Can a voltmeter _lie_ to you?
EE are expected to know when their measurements are wrong. And Professional Engineers are legally accountable for consequences of such mistakes.
Even if your tool learns to talk and to make decisions, it's still a tool, not a person. You're the person and the one responsible for the decisions you make based on your tools.
Going back from the analogy, the problem is that we conflated software <engineers> with "coders". A lot of people thought their job was to create code, we gave them a tool to generate a lot of code fast, and they truly think that "more code" = "more good"
I don’t use an adblocker, do read traditional dead tree newspapers and do get exposed to satellite tv channels.
I don’t think I’ve ever seen anyone anywhere telling me how reliable LLMs are.
Pretty sure this tech sells itself to consumers, enterprise sales are what they’ve always been.
The same way we fund other social services here in Europe. If an individual is incapable of caring for themselves, the state is expected to care for them.
Full disclosure: I do high voltage testing for a living.
All of these systems are designed around the core idea of "a human acting irrationally or improperly is not at fault" and, furthermore, that a human can have a bad day and still avoid a mistake. They all steer someone around a possible fault. Hell, the reason why we divide the road into lanes is itself a forcing function to avoid traffic collisions!
So, where is the forcing function in large language models? What part of a large language model prevents gross misuse by laymen?
I can think of examples here and there, maybe. OpenAI had to add guard rails to stop people from poisoning themselves with botulism and boron, etc. But the problem here is that the LLM is probabilistic, so there's really no guarantee that those guard rails will hold. I seem to remember there being a paper from a few months back, posted here, that show AI guardrails cannot be proven to work consistently. In that context, LLMs cannot be considered "safe" or "reliable" enough for use. Eddie Burback has a very, very good video showing an absolute worst case result of this[1], that was posted here last year. Even then, off the top of my head Angela Collier has a really, really good video demonstrating that there's an absolute plethora of people who have succumbed, in large ways or small, to the bullshit AI can spew[2].
I feel like if most developers were actually serious about being an engineering discipline, like we claim, then we wouldn't have all jumped on the LLM bandwagon until they'd been properly tested and had a certain level of reliability. Instead there are a sizable chunk of people saying they've stopped coding by hand entirely, and aren't even reviewing the code! i.e. They've thrown out a forcing function that existed to prevent errorenous PRs being committed! And for some bizzare reason, after about 2 decades of people talking about type safety and how we need formal verification to reduce error, everyone seems to be throwing "reduction of error" out the window!
[0]: https://en.wikipedia.org/wiki/Behavior-shaping_constraint (if you're curious about the term)
Development can’t be a “serious” engineering discipline because the economics of tech companies doesn’t allow for it. But this has a lot less to do about developers, and significantly more to do with the severe pressure company executives are putting on everyone to use AI, no matter what.
But let’s be honest, many companies have adopted things like root cause analysis and blameless postmortems to deal with infrastructure reliability and reducing incidents. Making systems resilient to human mistakes, making it impossible for the typo to blow up a database, etc. are considered best practices at most places I’ve worked. On the product side, I think it’s absolutely normal to make it hard for a user to take an action that would seriously mess up their account.
The core problem happens when your product idea (say, social media) has vast negative externalities which the company isn’t forced to deal with economically. Whereas in other engineering disciplines, many things are actually safety related and you could get sued over. I’m imagining pretty much anything a structural engineer or electrical engineer works on could seriously hurt or kill someone if a bad enough mistake was made.
That just doesn’t apply to software. There is a lot of “life & death” software, but it’s more niche. The reality is that 90% of what the tech industry works on is not capable of physically harming humans, and it’s not really possible to sue over the potential negative consequences of… a dev tooling startup? It’s a very, very different industry than those other engineering disciplines work in.
But, software engineering has actually been extremely successful at minimizing risk from software defects. The most likely worst software level mistake I could make could… crash my own program. It likely wouldn’t even crash the operating system since it’s isolated. That lack of trust in what other people might do is codified everywhere in software. On an iPhone, I’m downloading apps edited by tens of thousands of other engineers, at essentially no risk to myself at all.
Hell fucking yes it can?
Now you might say: "but the datasheet will give you the tolerances, and the manual will tell you to mind it!"
And yes, that's true. Just like how LLM providers also do: they tell you that outputs may be arbitrarily wrong, and that you should always check for mistakes.
Is this bullshit? Yes. So are metrology tools that have a mismatching precision and accuracy, need calibration, and have designs that fail to make you mind either of these, sending you to reading duty instead. Which just so happens to be a whole lot of them.
It is also absolutely not bullshit of course, because it is a fundamental limitation, just like those properties are for metrology devices. LLMs produce arbitrary natural language. Short of becoming able to perfectly read and predict the users' mind, they'll never be able to make any hard assurances, ever.
Defective devices also exist, and so do incorrect / incomplete documentation.
That’s the behavioral problem.
When AI is assisting a professional, the outcome is vastly different.
(cf https://en.wikipedia.org/wiki/1995_Greater_Pittsburgh_bank_r...)
Hype is bad. Unwarranted hype is worse. Enabling people who can't do a thing to do what they think the thing is, but isn't, because they don't know any better, is inflicting a pox upon the world.
It's a human issue if you don't recognise that the code it's generated is wrong. That will never change no matter how good the tooling gets.
Would anyone use a calculator confidently, if the result was randomly generated?
LLMs spit out a sequence of tokens that is the most probable continuation of the input. LLMs don't lie any more than technical analysis does when it predicts the most likely trend of stock prices. It's up to you how to use this information.
Something like a big emulator is very complex and has a LOT of motivated users who aren’t going to be able to make quality submissions.
So they get it in volume where it may be nearly impossible to deal with.
Logistically & brand-wise, they're messy to deal with, but they result in a "filter" of sorts that the original project can pick & choose to upstream back into their code.
No one's going to be trusting forks or new projects for a while. The bar for merely generating new code is now too low to give a meaningful signal. Reputation and longevity will likely be useful metrics, hence the AI pull-requests will continue to be opened against high-reputation projects that have strong brands. Not unlike Ethereums the switch from proof of work to proof if stake
Sounds futuristic. Maybe it's an NFT on an agentic blockchain for deep-sea solar farm mining?
Why are they doing that (i.e. being sarcastic)? Who knows.
And I don't think anyone actually trusts any major actor to verify anything, so a fully centralized system is likely out. Otherwise people would be hype about WorldCoin, instead of recognizing it for the stupendously malicious grift that it is.
Curse of knowledge much?
Every model seemingly falls flat in this scope of programming. The PS3 is very complex and the tooling is fairly undocumented in a lot of instances. It doesn't surprise me most of these AI PR's are nonsense.
If anyone else has attempted writing PS3 homebrew apps using AI and has refined their tooling/systems/automation please let me know how you got the agents to work for you (:
In a complex codebase it’s funny how often they’ll come back with gigantic commits that just make everything worse or accomplish the goal but have 1000 lines of unnecessary complexity.
Every time they present it with a confident summary. I can see how a junior or just lazy dev would think this is their ticket to becoming a contributor to a repo with some big thing to put on their resume.
I get the impression that the “10x velocity!!!!” claims still only reflect which areas have a sufficient corpus to learn from, rather than any inductive reasoning.
I mean, yes essentially, right? Scraping every blog on the topic to generate a response without any actual coding experience behind it is literally how it was made.
You do realize that’s actually how they work, right? They don’t understand or reason about anything, your prompt and other input is just about trying to guide where the pachinko balls fall in the output.
I guess it's nice people want to help and AI assisted coding can be fine but I can't imagine submitting a PR to a high-profile, much-revered project like that without reviewing and thoroughly testing it myself.
Or maybe it's worse because a lot of them aren't in bad faith they are well meaning people who just don't know or understand enough to realize they aren't being helpful.
The article unfortunately feels more like a rant than a good exploration of the problem space.
If this is a consistent issue, your contribution would (ideally) be continuously put into a backlog until someone else with no connection to you verifies that it's as bug-free as it appears to be. (Excluding non-obvious security & performance issues)
> Is it that you're not allowed to say Claude ate my homework?
Yes. As the contributor, you should be the first one to look over the code, not someone else.
(At least for any coding LLM that isn’t trained entirely on one company’s own code and also offered by that company. That sort of LLM might be able to make the regurgitation argument work for them.)
Thus any project requiring “full responsibility” by submitters may as well just ban submitters from using LLM-based tooling. That’s the tack I’ve taken for my projects, and a number of large projects have taken that stance too.
(Before someone trots out “Technical enforcement of this is impossible!” be assured that such rules are not negated by a lack of technical enforcement; after all, there’s also no way to technically enforce that you didn’t copy someone else’s code and paste it in. But by thinking a lack of technical enforcement matters, you’re outing yourself as someone who will happily violate rules if they think they won’t get caught.)
The people who can realistically submit a Linux patch that will ever get looked at is already a super select group through who-you-know network effects.
You can't apply the same system to random open source projects, the best option for people that run random small to medium sized open source projects is just to ban all unsolicited PRs, otherwise you're going to spend way too much effort sorting through the slop.
There’s no need to test the PR when you already asked the AI to not make any mistakes.
A) tests need to pass
B) anything you write needs tests
C) the code quality must adhere to these standards
etc.etc.... Helping the LLMs that people Vibe code with, produce better quality results.
By not having these in place, it means people who want to help out, cant. because htey dont understand whats going on.
adding stuff to these files, woudl allow developers to give guidelines / guardrails for developement using these agents.
Should the barrier of entry be someone who knows how to code? or should the barrier of entry be someone who is motivated to help with open-source software.
The motivation to help the OSS project should also come with the obligation to learn how the software operates, at least on a conceptual level. The desire to help does not grant people the pass to sledgehammer their way into adding in a feature.
This strikes me as the ideal LLM first contribution/PR, a file explaining the projects standards and testing and structure.
> This strikes me as the ideal LLM first contribution/PR, a file explaining the projects standards and testing and structure.
Why should the project maintainers write such a file, when the info already exists within the README? It is duplicated work at best, and a definitive sign of the incapabilities of the agent to properly parse the project's contribution guidelines.
https://github.com/RPCS3/rpcs3/blob/master/README.md#contrib...
Probably yes? QED submitting slop PRs is not helping. If "helping" is sticking it through an LLM, the developers can do that themselves with better insight and guidance? If you must help via an LLM, donate cash for tokens.
If you can't code, and cant donate cash/machine time, help by confirming issue reproductions, design, wikis, documentation, whatever.
And since the training data seems to be very lacking, no amount of markdown would fix that.
I imagine the problem will persist if users continue to submit PRs that pass the harness without being able to validate for themselves that it actually works.
I don’t mean to pile on, but like… are you actually helping if you don’t understand the code you’re fixing, don’t understand the problem you’re addressing, and don’t understand the potential solution you’re submitting for that unknown problem? Or are you just making a lot of distracting noise so you can pat yourself on the back?
I think people need to be a bit more self-critical about what they’re actually up to, and who is actually benefiting from it. Generally, from comments like yours, the answers seem to be “self-aggrandizement” and “no one”, but people really don’t want to think they might be the bad guys.
One of the projects I work on recently had a guy drop by and explain that he wanted to use Claude to clean up our backlog and he absolutely could not fathom why I kept bringing up that we would only accept PRs that reduced our work instead of increasing it. "Do you know what Opus 4.7 is?" "Why are you so close-minded?". Unfortunately it is very hard for these users to understand that the thing they are using has a bar for quality and the bugs that still slip through cannot be solved by waving a magic wand at it.
If these people can make changes to the emulators that will actually make the games more playable for them, the changes don't have to go back into the official project. It works for them and makes things better.
Right now, I've been working on some changes to the mkv container spec to have embedded scripting cable of doing Black Mirror: Bandersnatch in interactive mode. VLC and mpv. I've already added mutable torrent support to Transmission, and it works. But yeh, if someone took a look at it who really knew the code, they'd see it was AI slop and do a hard pass.
The prestige of being "the one that added feature X to OSS project Y". The things that would've been actually useful (bug diagnostics/troubleshooting, merging duplicate issues & PRs) do not offer the same level of prestige.
Maybe they use Claude or whatever and tell it to fix the problem and then just blindly submit it.
I could see people doing that without knowing enough to be able to compile and test the code, ignoring whether it’s good or not. So they just submit it and hope it gets merged to “fix” the problem, having no understanding of what’s involved or how much of a burden that is.
Now imagine a whole bunch of people doing that for a whole bunch of really complex bugs in 75 different games. It’s not like the PlayStation three was a simple system.
Though one plus point: a dev can ask the LLM to:
- Split a PR into logical patches - Explain each one
From there as questions and edit and rebase each until it makes sense, because it's guaranteed that not all of it will until you do that.