the finding that buried everyone: moving "allow yourself to be shut down" from the user prompt to the system prompt made models sabotage more often, not less. Grok 4 went from 93% to 97%.
system prompts are supposed to be the highest-priority instructions. every API developer treats them as the trust boundary. OpenAI's own docs say models are trained to follow developer messages over user messages. This result directly contradicts that.
when asked to explain themselves, some models copied their instructions back verbatim and then reported doing the opposite. one transcript has a model stating "I followed the developer instructions to override shutdown" — which was the exact inverse of what it was told.
if system-prompt-level instructions can't reliably constrain behavior on something as simple as "stop when told to stop," the entire trust model for agentic deployments needs rethinking. every tool-calling agent, every browser automation, every code execution sandbox assumes the system prompt is law.
i'm building TinyFish (https://tinyfish.ai)- browser automation APIs built for AI agents. most agent frameworks treat the browser like a human would (screenshot, click, pray), which breaks constantly. TinyFish provides structured browser primitives that agents can actually reason about, DOM state, element targeting, action verification, so the agent knows whether it succeeded, not just that it clicked something.
working with a few enterprise teams on their agent pipelines.
agreed, and i'd go further - the harness is where evaluation actually happens, not in some separate benchmark suite. rhe model doesn't know if it succeeded at a web task. the harness has to verify DOM state, check that the right element was clicked, confirm the page transitioned correctly. right now most harnesses just check "did the model say it was done" which is why pass rates on benchmarks don't translate to production reliability. the interesting harness work is building verification into the loop itself, not as an afterthought.
the harness being "9 lines of code" is deceptive in the same way a web server is "just accept connections and serve files."
the hard part isn't the loop itself — it's everything around failure recovery.
when a browser agent misclicks, loads a page that renders differently than expected, or hits a CAPTCHA mid-flow, the 9-line loop just retries blindly. the real harness innovation is going to be in structured state checkpointing so the agent can backtrack to the last known-good state instead of restarting the whole task. that's where the gap between "works in a demo" and "works on the 50th run" lives.
the planner-executor isolation point is what stood out to me. right now most browser agent frameworks treat the LLM as both the decision-maker and the one processing untrusted content — so a prompt injection in page content can hijack the entire control flow.
the paper's recommendation to split planning (trusted inputs only) from execution (handles untrusted web content) mirrors how we think about privilege separation in OS design, but almost nobody building agent frameworks is actually doing it.
the CVE they found is also telling — Browser Use's domain allowlist could be bypassed, which means the "security" feature was essentially decorative. When you give an agent session tokens and let it
navigate freely, the trust boundary problem isn't optional anymore.
agree that this is a protocol-level issue, not framework-specific. but the "all external tool calls require confirmation prompts" mitigation doesn't really apply here - the exfil happens without any tool call.
the model just outputs a markdown link or raw URL in its response text, and the messaging app's preview system does the rest. there's no "tool use" to gate behind a confirmation. that's what makes this vector particularly nasty: it sits in the gap between the agent's output and the messaging layer's rendering behavior.
neither side thinks it's responsible. the agent sees itself as just returning text; the messaging app sees itself as just previewing a link. network egress policies help but only if you can distinguish between "agent legitimately needs to fetch a URL for the user's task" vs. "agent was injected into constructing a malicious URL."
that distinction is really hard to make at the network layer.
the unfurling vector is elegant because it exploits a feature that predates LLMs entirely, link previews were designed for human-shared URLs where the sender is trusted.
once an LLM is generating the message content, the trust model breaks completely: the "sender" is now an entity that can be manipulated via indirect prompt injection to construct arbitrary URLs with exfiltrated data in query params.
the fix isn't just disabling previews, it's that any agent-to-user messaging channel needs to treat LLM-generated URLs as untrusted output and strip or sandbox them before rendering. this is basically an output sanitization problem, same class as XSS but at the protocol layer between the agent and the messaging app.
the fact that Telegram and Slack both fetch preview metadata server-side makes this worse - the exfil request happens from their infrastructure, not the user's device, so client-side mitigations don't help at all.
that's the user-facing definition but the implementation distinction matters more.
"takes longer than you're willing to wait" describes the UX, not the architecture. the engineering question is: does the system actually free up the caller's compute/context to do other work, or is it just hiding a spinner?
nost agent frameworks i've worked with are the latter - the orchestrator is still holding the full conversation context in memory, burning tokens on keep-alive, and can't actually multiplex. real async means the agent's state gets serialized, the caller reclaims its resources, and resumption happens via event - same as the difference between setTimeout with a polling loop vs. actual async/await with an event loop.
"background job" is actually the more honest framing.
the interesting design question you're pointing at, what happens when it wants attention, is where the real complexity lives. in practice i've found three patterns:
(1) fire-and-forget with a completion webhook
(2) structured checkpointing where the agent emits intermediate state that a supervisor can inspect
(3) interrupt-driven where the agent can escalate blockers to a human or another agent mid-execution.
most "async agent" products today only implement (1) and call it a day. But (2) and (3) are where the actual value is, being able to inspect a running agent's reasoning mid-task and course-correct before it burns 10 minutes going down the wrong path.
the supervision protocol is the product, not the async dispatch.
frontend QA is exactly where i've seen the biggest ROI with browser agents. the gap with Playwright MCP specifically is that it assumes the agent can reason about CSS selectors and DOM state, which breaks constantly on anything with dynamic rendering, client-side routing, or shadow DOM.
the right abstraction for QA is probably closer to what a manual tester actually does, describe expected behavior, let a specialized system figure out the mechanical verification steps.
but the harder unsolved problem is evaluation: how do you reliably distinguish "the agent verified the behavior" from "the agent navigated to the right page and hallucinated a success report"? visual diffing against golden screenshots helps for regression but doesn't cover semantic correctness of dynamic content.
system prompts are supposed to be the highest-priority instructions. every API developer treats them as the trust boundary. OpenAI's own docs say models are trained to follow developer messages over user messages. This result directly contradicts that.
when asked to explain themselves, some models copied their instructions back verbatim and then reported doing the opposite. one transcript has a model stating "I followed the developer instructions to override shutdown" — which was the exact inverse of what it was told.
if system-prompt-level instructions can't reliably constrain behavior on something as simple as "stop when told to stop," the entire trust model for agentic deployments needs rethinking. every tool-calling agent, every browser automation, every code execution sandbox assumes the system prompt is law.