The End of Code Review: Coding Agents Supersede Human Inspection

Posted by cribwi 1 hour ago

The End of Code Review: Coding Agents Supersede Human Inspection(arxiv.org)

13 points | 9 comments

sarchertech 17 minutes ago|

This is an undergraduate level persuasive essay masquerading as an academic paper.

The title implies some novel research or a review of existing research that that clearly shows agents are better at code review than humans but then provides this single paragraph on the review capabilities of agents:

> Beyond general software engineering, several strands of work speak specifically to the capabilities that code review re- quires. Pornprasit and Tantithamthavorn evaluate LLM-based automated review in industrial settings and find that agents detect the same categories of defect that human reviewers target: correctness errors, security weaknesses, performance inefficiencies, and style violations [12]. Li et al. demonstrate that CodeReviewer produces actionable inline comments at quality that is at least comparable to those of trained human reviewers on a significant fraction of the evaluation set [11].

SimianSci 16 minutes ago||

> "To support the claim that coding agents can displace human code review, we survey evidence of agent capability across three dimensions: benchmark performance on softwareengineering tasks, review-specific capabilities, and developer productivity with deployed tools"

Not sure I can agree with this premise, especially since there seems to be a complete lack of "real-world results" in this evaluation. This strikes me as being written by a theorist, who's only experience with Quality Assurance exists in studies or papers.

coldtea 7 minutes ago||

>the naive integration in which agents write code and humans remain the mandatory reviewers is a dead end because it neither provides meaningful assurance nor scales with AI-assisted throughput.

Who said it has to "scale with AI-assisted throughput"? AI can produce code all day, the goal is not to fill storage with AI code, is to make products, following product tradeoffs, timelines, and decisions.

jmuguy 14 minutes ago||

This seems to take the view that code review is essentially linting for simple issues. Our team is fairly small so "code review" usually involves QA and everything else you might want to do before something is pushed to production.

But yeah - I can have one LLM check another LLMs work. Kind of a waste of tokens for most PRs.

jpgvm 13 minutes ago||

This is honestly the biggest battle with AI driven development right now. You have these extremely potent tools that can output a ton of really great code if they are wielded correctly but there is simply no way to keep up with their output at human review pace (which was already slower than human code creation pace).

I think the only real solution is to add increasingly strict guardrails that can be enforced with a combination of more AI agents and actual executable contracts. The other aspect is through using languages and tools that densify correctness. i.e languages like Rust that have very rich type system so both review and design can be focused on a small by volume slice which is the core types. The other main tools for densifying correctness are formal methods, (model checking, etc), fuzzing/property based testing and static analysis.

All of these tools are cheaper to use than they once were because of lot of the minutiae can be handled AI agents while core invariants can receive heavy human scrutiny.

IMO generative AI is here to stay in development so may as well get ahead of the game and start using these tools to try get the best out of it.

cbarnes99 16 minutes ago||

This feels like some combination of monumentally stupid and incredibly naive.

devin 16 minutes ago||

Big eye roll from me.

"Can't scale due to too many PRs" neglects answering questions like: Are these PRs valuable? Are they just additional PRs to right the wrongs of previous ill-conceived PRs? How much churn is going on here? Is the influx of PRs a permanent state, or something that we'll only live through temporarily because we have a lot of little things we can set our agents upon, but after they're done we'll return to a normal work cadence?

_se 14 minutes ago||

What in the LLM psychosis is this

SpicyLemonZest 15 minutes ago|

I don't believe that a human being wrote this paper. The "review-specific capabilities" section is obviously the only one that matters to the thesis, and it does not actually point towards any data indicating that coding agents supersede human inspection. An LLM, though, could easily be distracted or prompted into making the leap from "same categories" + "comparable on a significant fraction of the evaluation set" to "superior".