Ah, I see, sort of like figuring out the boundaries of your knowledge base and seeing if you have missed any connections between concepts?
I suppose it might be useful for learning/ideation. I should try something like that — it could be an interesting synthesis/writing exercise to try to connect concepts that are far removed in your own mental model.
I think it’s not the first time the US has used that sort of interpretation of the law. There’s this one[^0] but also, I believe, an older case, also involving Microsoft, about data in Ireland. But I can’t find it.
These kinds of situations are why I gave my AI agents stray thoughts (automated insights / suggestions from a separate llm call with some curated context) that trigger on loop / rabbit hole detection.
Quite a bit of false positives, but it hasn’t had any ill-effect so far. Aside from increased quota usage.
Haha. Yes. Much smaller scale versions of this led me to joke with a coding agent that LLMs tended to converge towards "Large corporation infrastructure best practices" when designing cloud infrastructure, when it was only me working on hobby side-projects with nearly no users and that I wouldn’t be able to put food in my fridge if they kept just spinning up VPCs for no reason.
Which somehow ended up being a very convincing argument for more frugal engineering, leading to a sort of "mind the user’s fridge" policy, "Fridge-Driven Development".
A policy that has been dutifully and scrupulously observed by all agents since, across all projects. Unlike my original clear, comprehensive, infrastructure guidelines.
I’ve been making Codex and Claude get their work reviewed by most recent best performing model of their own family, and each other’s, for months.
On top of that, we have been running multi-model AI reviews on every PR through their respective GitHub integrations (Codex, Gemini, Copilot, Greptile, CodeRabbit).
They never fully overlap, and yet they somehow usually all miss the same things. The most significant improvement came from having agents commit their plan along with their work.
On the upside, it means I get to focus my reviews on different things.
> and I have the feeling that the harness is much more important than the consensus expectation.
Is that really the consensus? There’s been a bit of literature lately on that. Can’t find the one about looking into whether or not the harness had a greater impact than the models (for comparable models), but there’s this one: https://arxiv.org/html/2605.23950
Pointing out past suboptimal / failing behaviours to new opus sessions would almost always actually create a sort of "anchoring bias" that would drive the agents towards exhibiting the failure mode (often while mentioning how it wouldn’t fall for it).
As far as I can recall, Fable has been the first model to discover the documented failure modes, comment on them, and just… keep going, actually avoiding them. Quite a surprise.
I love hard caps, and am tired of cloud services not even offering those. May make sense for large companies. Makes no sense for hobbyists and small companies. But maybe that's the point?
I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.
It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.
A few months back someone reverse-engineered private ANE APIs and shown some significant performance improvements compared to CoreML and Metal, on both inference and training.
But being able to tie related notes together, and see at the bottom of one which other notes reference it is interesting.
Even more now that a LLM can take care of the actual tending and pruning.
reply