Posted by mfiguiere 9 hours ago
Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.
I got finally fired.
I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.
I'm using Zed and Claude Code as my harnesses.
However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode
Also I don’t know how “improving our Code Review tool” is going to improve things going forward, two of the major issues were intentional choices. No code review is going to tell them to stop making poor and compromising decisions.
Of course, all their vibe coding is being done with effectively infinite tokens, so...
I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.
Will they restore my account usage limits? Since I no longer have Max?
Is that one week usage restored, or the entire buggy timespan?
In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.
At least tell users when the system prompt has changed.
Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.
I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.
A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.
It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.
Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.
Thank you for the perfect explanation.
Last week in my confusion about the word because Anthropic was using test, eval, and harness in the same sentence so I thought Anthropic made a test harness, I used Google asking "in computer science what is a harness". It responded only discussing test harnesses which solidified my thinking that is what it is.
I wish Google had responded as clearly you did. In my defense, we don't know if we understand something unless we discuss it.
The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]
The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]
Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.
Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.
I would not suspect quantization before I would suspect harness changes.