Top
Best
New

Posted by HellsMaddy 12 hours ago

Claude Opus 4.6(www.anthropic.com)
1675 points | 709 commentspage 3
mlmonkey 7 hours ago|
> We build Claude with Claude.

How long before the "we" is actually a team of agents?

mercat 5 hours ago|
Starting today maybe? https://code.claude.com/docs/en/agent-teams
22c 1 hour ago||
I tried teams, good way to burn all your tokens in a matter of minutes.

It seems that the Claude Code team has not properly taught Claude how to use teams effectively.

One of the biggest problems I saw with it is that Claude assumes team members are like a real worker, where once they finish a task they should immediately be given the next task. What should really happen is once they finish a task they should be terminated and a new agent should be spawned for the next task.

DanielHall 9 hours ago||
A bit surprised, the first one released wasn't Sonnet 5 after all, since the Google Cloud API had leaked Sonnet 5's model snapshot codename before.
denysvitali 9 hours ago|
Looks like a marketing strategy to bill more for Opus than Sonnet
silverwind 11 hours ago||
Maybe that's why Opus 4.5 has degraded so much in the recent days (https://marginlab.ai/trackers/claude-code/).
jwilliams 9 hours ago|
I’ve definitely experienced a subjective regression with Opus 4.5 the last few days. Feels like I was back to the frustrations from a year ago. Keen to see if 4.6 has reversed this.
apetresc 11 hours ago||
Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).

But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.

epolanski 9 hours ago||
From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.

I had two different PRs with some odd edge case (thankfully catched by tests), 4.5 kept running in circles, kept creating test files and running `node -e` or `python 3` scripts all over and couldn't progress.

4.6 thought and thought in both cases around 10 minutes and found a 2 line fix for a very complex and hard to catch regression in the data flow without having to test, just thinking.

SubiculumCode 11 hours ago||
Isn't SWE-Bench Verified pretty saturated by now?
tedsanders 11 hours ago||
Depends what you mean by saturated. It's still possible to score substantially higher, but there is a steep difficulty jump that makes climbing above 80%ish pretty hard (for now). If you look under the hood, it's also a surprisingly poor eval in some respects - it only tests Python (a ton of Django) and it can suffer from pretty bad contamination problems because most models, especially the big ones, remember these repos from their training. This is why OpenAI switched to reporting SWE-Bench Pro instead of SWE-bench Verified.
oytis 8 hours ago||
Are we unemployed yet?
derwiki 5 hours ago|
No? The hardest part of my SWE job is not the actual coding.
codexon 4 hours ago||
Even for coding, it seems to still make A LOT of mistakes.

https://youtu.be/8brENzmq1pE?t=1544

I feel like everyone is counting chickens before they hatch here with all the doomsday predictions and extrapolating LLM capability into infinity.

People that seem to overhype this seem to either be non-technical or are just making landing pages.

HacklesRaised 4 hours ago||
I didn't think LLMs will make us more stupid, we were already scraping the bottom of the barrel.
ayhanfuat 11 hours ago||
> For Opus 4.6, the 1M context window is available for API and Claude Code pay-as-you-go users. Pro, Max, Teams, and Enterprise subscription users do not have access to Opus 4.6 1M context at launch.

I didn't see any notes but I guess this is also true for "max" effort level (https://code.claude.com/docs/en/model-config#adjust-effort-l...)? I only see low, medium and high.

makeset 9 hours ago|
> it weirdly feels the most transactional out of all of them.

My experience is the opposite, it is the only LLM I find remotely tolerable to have collaborative discussions with like a coworker, whereas ChatGPT by far is the most insufferable twat constantly and loudly asking to get punched in the face.

data-ottawa 11 hours ago||
I wonder if I’ve been in A/B test with this.

Claude figured out zig’s ArrayList and io changes a couple weeks ago.

It felt like it got better then very dumb again the last few days.

copilot_king_2 11 hours ago|
[dead]
derwiki 5 hours ago||
What companies do you interact with that don’t A/B test?
throwaway2027 9 hours ago||
Do they just have the version ready and wait for OpenAI to release theirs first or the other way around or?
lukebechtel 11 hours ago|
> Context compaction (beta).

> Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.

Not having to hand roll this would be incredible. One of the best Claude code features tbh.

More comments...