Posted by mfiguiere 12 hours ago
Do researchers know correlation between various aspects of a prompt and the response?
LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.
Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.
The artificial creation of demand is also a concerning sign.
The real lesson is that an internal message-queuing experiment masked the symptoms in their own dogfooding. Dogfooding only works when the eaten food is the shipped food.
But if we're vibing... This is the kind of bug that should make it back into a review agent/skill's instructions in a more generic format. Essentially if something is done to the message history, check there tests that subsequent turns work as expected.
But yeah, you'd have to piss off a bunch of users in prod first to discover the blind spot.
Resuming it cost 5% of the current session and 1% of the weekly session on a max subscription.
Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.