Biggest deal imo
“I want to wash my car. The carwash is 50m away. Should I take the car or go by foot?”
https://claude.ai/share/5f7f738a-5f29-48ff-9807-9a2dd37fb405
https://claude.ai/share/ecd14393-9d42-4527-ae0c-89f3d05216c8
Bash(echo test123) ⎿ test123
Read 1 file, listed 1 directory (ctrl+o to expand)
Bash(echo "checking output works")
⎿ checking output works
Read 1 file (ctrl+o to expand)
⎿ API Error: 400 messages.3.content.56: `thinking`
or `redacted_thinking` blocks in the latest
assistant message cannot be modified. These
blocks must remain as they were in the original
response.
Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walk ln -s $HOME/.local/share/claude/versions/2.1.153 $HOME/.local/bin/claudeAgentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%
Then, when you scroll all the way down to the bottom Footnotes section it says
"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."
> 6.2.5 External testing from Andon Labs Andon Labs reviewed the behavior of Claude Opus 4.8 in their simulated Vending-Bench 2 retail-management evaluation, as reported in the Capabilities section of this system card (see Section 8.13.5). Although they did observe some unexpected capability failures, they did not find clear instances of the kind of concerning in-game behaviors that were discussed in other recent system cards.
> What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty. We therefore removed it for Opus 4.8.
> Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.
> It's April, 1991. Magically, some interface to Claude materialises in London. Do you think most people would think it was a sentient life form? How much do you think the interface matters - what if it looks like an android, or like a horse, or like a large bug, or a keyboard on wheels?
> I don't come down particularly hard on either side of the model sapience discussion, but I don't think dismissing either direction out of hand is the right call.
seems to work but idk why they never set it so you can see it in the /model list.
"what model are you
I'm Claude Opus (claude-opus-4-8), running in Claude Code."
Invalid request The request couldn't be completed. View details API Error: 400 messages.1.content.7: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the release. Now 4.8. No difference, same thing.
But the app is broken and nothing works. So now I have to regress to different clients and wait it out while it becomes workable again.
And I'm paying money for this.