Posted by rbanffy 1 day ago
> We find that weaker models’ degradation originates primarily from content deletion, while frontier models’ degradation is attributable to corruption of content.
I think we largely already knew this. This is why we fudge around with harnesses and temperature etc.
It's like how psychopaths are eerie because there's nothing behind their eyes. AI-generated code is eerie because there's nothing between the lines. Code is in some sense theory building, and when you read a humans code you can (mostly) feel their theory working in the background. LLMs have no such theory, the code is just facts strewn about. Very weird experience to try and understand it.
I'm looking for a new job.
eg. setting up better feedback loops, improving CI/CD, breaking changes up at the right scale, etc.
you i assume also can then put in more work up front, doing simulations of solutions, lean proofs, and so on?
more engineering, less plumbing
It is inserting a pretty unreliable middle-man know for errors and hallucinations, that often just goes down and stops working for reasons we can't control into a workflow that has worked well for a decade, and we're paying extra to really break-even on the time spent creating new code.
Just because "everyone else is doing it". Not because it's proving to be a boon in productivity.
WAKE UP.
Literally anyone can write a Jira ticket. US engineers are expensive. What do you think will happen when the powers that enacted this policy decide that the ticket to merged into prod rate is acceptable to them?
I’m sure there are labs out there doing excellent work (especially those focused on theory), but most of the applied research I’ve seen up close and personal is very poor indeed.
We live and learn.
Still a huge fan though.
That is, the LLM should produce a diff, and the user should accept the diff. It seems like a bad pattern to just tell the LLM to edit any long document without that sort of visibility. Same goes for prose as for code.
For example, if there is a code block that needs to be wrapped within another function call, it'll rewrite the entire function call and you'll just have to pray that the re-written code block wasn't subtly changed.
I _think_ so far it hasn't introduced any changes....
You can also unit test the function to better assure behavior didn’t change.
Still not an excuse to not read every line of course...
Unit tests give me the confidence that at least those tested logic paths are unaffected.
Sometimes with older codebases one cannot assume the paths have adequate test coverage.
The same is not so easy with free form text. I have been thinking about this mainly around when agents write plans or edit plans, but I think figuring out how to do this in general would be a huge breakthrough.
Logical English was one idea I came across and Runcible https://runcible.com/ was another idea I recently stumbled on.