Basically the issue is often that gene therapies end up in the liver since its the livers job to detoxify, but that may cause a dangerous immune response if the immune system notices it in the liver and attacks the organ, since the person could die from the damage.
I’m assuming this has been tried, but why doesn’t nano-encapsulated mRNA (that then makes the CRISPR sequences in cells) or whatever the peptide injectors do solve the problem?
I have met pacifists who say all war is bad* and thus the Russia Ukraine war should immediately end, without any ideas on how to get that to happen except a few who imply Ukraine should roll over and be consumed.
The US has literally double its total electricity production in solar and batteries stuck in the now 5 year FIFO permit hell we require for grid additions that will cause most proposals to pull out before completion
Firefighters are infamous for creating elevator regulations that require being able to rotate a stretcher, which lead to less elevators duet to the immense costs, which is both completely unnecessary and defeats the entire purpose of the regulations for safety since now they have to drag injured people down stairs
Its especially egregious because 2 stair builds are easily and often designed with more distance between stairs and rooms than single stair builds are.
In power grids the system must be able to handle peak loads for weeks at a time. Either it will shed loads (aka shut off electricity) or things will explode. A black start cannot be allowed to happen as it would be catastrophic
Peaks don't last for weeks, but yes: Load must sometimes be shed.
That happens today. It will happen tomorrow. It's imperfect. This imperfect nature doesn't mean that progress must cease, or that all things must be forever maximalized in search of perfection.
The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Edit; to be clear they tell you when they degrade it for cybersecurity and bio
The thing that I keep thinking about is the accounting / charging when it downgrades automatically.
Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?
If the answer is no, could that be construed as fraud?
The announcement elucidated this, and it's IMO worse than this. They don't downgrade to a cheaper model ([edit] for certain classes of offense they suspect you of). They sabotage the model's outputs in other, undisclosed, ways (specifically, "prompt modification, steering vectors, or parameter-efficient fine-tuning"). So, for example, they might load in a steering vector that just forgets the API to PyTorch. But it isn't just "we redirected you to a cheaper model!"
It honestly explains so many issues I have been having, as I used it primarily for ML research (on my personal account, doing things not related to my job I should note). It would literally typo package names and spend huge amounts of time failing to setup simple environments…then do stupid things like set the learning rate to 1e-7, and use the eval set as training data.
Their goal is to downgrade people who are violating their TOS, so I think they'd have some argument there. I have no idea how they'll deal with inevitable false positives, especially given how oversensitive most of the other triggers are.
The challenge is the examples they’ve mentioned (distributed training infra? ML acceleration techniques?) go beyond what’s prohibited by their ToS and is like a catch net.
I would wager the majority of ML and data science work in the world aren’t frontier LLM development.
Look at real-life stuff like laws, company policies, or school rules. Humans have to enforce them, and we constantly see crazy cases in the news. There’s no way simple rules can ever make speech completely 'safe.' I can't prove it with math or logic yet, but I have a feeling that it’ll never happen. Even humans can't do it.
We can run a simple thought experiment here. Say Case A violates rule B, so we add rule C. Then Case D violates rule B but follows rule C, so we add an exception... and it just goes on and on like that forever. It never ends. In the end, you just get a massive pile of rules that makes it impossible to get anything done.
Ultimately, we will have to face the truth that knowledge is dangerous.
Giving knowledge directly to people who cannot actually understand it and allowing them to just use it blindly can be extremely unsafe.
To use a real-world analogy, the problem we are facing with weak AI right now is just like the debate over gun legalization. Do we want to risk the abuse of guns or knowledge just to protect the freedom to own them?
> I can't prove it with math or logic yet, but I have a feeling that it’ll never happen.
It's not really that hard to actually prove it with math.
It's a computer, so to produce the boolean result (safe or unsafe) there has to be a mathematical formula. This formula will inherently be extremely complex, but even a very simple formula has a huge problem. Suppose "unsafe" is true if X - Y > 0. Make X and Y themselves as simple or complicated as you like but even in the simplest version it's already impossible to calculate unless the model has perfect information.
You can't calculate "X - Y" if you don't know the value of X. And it's indisputable that there is information it doesn't have. Case in point, telling you about a vulnerability in some piece of code is safe (and indeed not telling you is unsafe) if you're the developer and you want to patch it or an administrator and want to mitigate it, but the opposite if you're the attacker and want to exploit it. The model does not know which one you are, therefore it cannot make the correct determination any more than it can solve one equation with two unknowns.
This is why we have courts and juries. Creating laws that cover all cases and contexts is effectively impossible, so we have humans decide what a fair outcome would be in this specific situation.
To make an analogy:
Imagine a patron gets banned from ordering alcohol at a particular establishment, because they got too drunk one time.
It's completely reasonable for the establishment to reject a request for an alcoholic drink, and suggest something alcohol-free instead.
It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
The fact that the patron broke the rules has nothing to do with it.
> It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
Your analogy doesn't work because:
- they tell you the rules at the entrance of the bar
- they totally tell you when they give you a substitute
The only issue is the bartender asking you for your money before serving you the drink really but again, this is known since day 1 by the customers.
Your rebuttle seems to be arguing it's okay for a bartender to simultaneously say:
"This is alcohol"
And
"Or maybe it isn't alcohol."
Or to rephrase it, "They tell you the rules at the entrance, they then tell you they don't follow those rules and they are totally serving alcohol even if they are not."
No they tell you at the entrance that at any point they may unilaterally decide to replace the alcoholic drink you ordered by a non alcoholic one.
You can decide you are okay with that or not but they aren't dishonest. I wouldn't enter that bar personally but if you do you cannot really complain. It is like complaining because you haven't won at the casino.
Their detection is too aggressive. Just today I'm trying to build a kernel for some SBC and I hit that downgrade. I just asked some things about `make menuconfig` items. I suppose it just flags everything related to linux kernel as cyber attacks.
You know, I'm not saying I don't understand what they are doing from a business perspective, but I'm just saying: DeepSeek V4 doesn't silently sabotage you because it thinks you are trying to violate a ToS. Anthropic's clawing back a bit of a moat perhaps, with Fable being an actual improvement of sorts, but now with torching user trust they are really banking on open weight models not catching up to where they are now. I wonder if they have a good reason to believe that they won't, or are hoping for something entirely different to save them.
(P.S. Yes of course I know about model censorship, a different problem, but all of the models are censored to some degree. It happens to be less of a problem for open weight models anyhow, but I figured I'd just preempt this since it's inevitable.)
I actually kinda like DSv4 over Opus 4.7 for some tasks, although I have not figured out what the deciding factor is. (Opus 4.8 so far has not worked very well for me at all, no idea why.)
They will give you s*t output, that’s how they deal with it. And say that less than 1% of the requests were affected. Think of this like a kind of shadow ban while you still pay top $.
They use a lightweight adapter to silently degrade the performance. Usually these adaptors are made to improve the performance for a given domain/task.
It royally pissed me off today by just continuing with credits without stopping to ask me if I was ok with it.
Ran up $30 in extra charges while it was just flashing on the screen that it was doing that after I walked away to do something while it was humming along.
It has always just told me I ran out of usage and had to wait before. Now? You’re just gonna pay extra because you left it unattended as you’ve done for the last year of use.
Or if your "self-driving" system such as FSD / waymo slowed the car down once it detected you work in cybersecurity or at a rival automaker and you were attempting to reach the train station or the airport to make you miss a conference meetup.
btw the best part of this story is that the train company googled "best Polish hackers", found a group who won a CTF, and this actually worked out for them
It would suck, but guardrails on new technologies like this aren't unheard of. It's like when consumer GPS used to stop working at very high speeds because they didn't want people to use it for missile guidance systems.
Yep a totally different use case and set of guardrails. There’s very little (not zero) consumer utility in GPS above say 15k feet AND 400 MPH or whatever the actual limit is. That’s basically tracking model rockets that are incidentally impacted and nothing else, from what I can think of.
It's also the sort of thing that has to have been thought up by someone with nothing better to do, given how ridiculous the premise is. You would have to assume the adversary is someone with the technology to build rockets, literally rocket science, but not the technology to build their own GPS receiver, which is simple 1970s radio technology?
Worse than that, it's 20th century radio technology in the 21st century when everyone has access to FPGAs and SDR.
The number of innocent people with model rockets or similar being negatively impacted by that rule is infinitely larger than the number of adversaries because the number of adversaries being impaired by it is zero.
The only precision part about a GPS receiver is to assign precise timestamps when you receive a radio transmission from a satellite. The rest of it is just doing math.
Didn't early GPS have fudge factor on the most precise bits? As such you could only get to a few meters of accuracy. Not critical for sea navigation or even to general positioning when paper maps were still used.
> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
Any kind of silent sabotaging is absolutely unacceptable for any commercial service
They charge for tokens and charge a lot. They can't just degrade service silently and still charge you the same.
Yeah, what's up with that. Lately I have found that it tries to find excuses to not do as told and instead do a totally different thing. I told it to write a yaml file according to some specifications and instead it coded a Python script to write the yaml...
I got a worrying one: a day after getting opus 4.8, I tasked CC to add specific TXT records to our subdomain.example.com as per ticket I've received. CC has access to that ticket via Atlassian MCP, and started doing terraform code changes in a local git branch. Somewhere along the way it said that to do that it needs an approval from a company's VP (ticket requester) as "subdomain.example.com" is critical (it isn't). Then it refused to open a pull request, immediately deleted the local git branch along with all the changes and refused to proceed without evidence of approval from that VP. No amount of explaining, then pleading, and then threatening moved it. It was surreal and I was shocked and frankly pissed. It was amusing in the end because the day earlier it had no problem adding those same TXT records to example.com. Codex did those changes in 1/4 of time and no complaining.
I only said one year because I was thinking anthropic fans might downvote my post, I think they have a few months lead and are deluding themselves that they can get regulation to halt development and stay on top
I've seen this claim a few times, but when I triggered the guardrails in Claude Code, it clearly notified me that it had switched to a different model ("something something for security purposes...").
Are you using Fable in Claude Code or in the browser?
> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.”
And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”
Collectively, they are known as known as GREEDI-BULLSHIT.
No, that’s for “frontier LLM development” which somehow includes examples like distributed training infra.
Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.
It does nothing to protect against distillation attacks, because distillation attacks are far less interested in the topic of AI research than just generally getting tons of diverse output from the model. It might be that Mythos was (accidentally?) trained on internal Anthropic documentation on how Mythos was trained, and thus it could leak secret sauce? Doubtful; it feels like its less about the specific attack of reverse-engineering Mythos, and more about being a general sophon against any model training at all; that Anthropic's official position is now that they're the only ones who should be training models.
They've said that they'll stop notifying developers when this gets triggered, instead they'll load in basically like a LORA that's designed to inject bugs into your code.
Their gap over Chinese models like GLM-5.1 is nowhere near 18 months. In many areas, it’s less than 6 months. The best closed models 18 months ago were worse than Qwen3.6.
It was more like November. But it wasn’t really an inflection point, harnesses got good enough that people started noticing by the holiday break. And I’m not discounting some good ol’ stealth marketing in there as well.
Deepseek feels pretty close to Opus at this point, and it’s certainly useful enough for me to spend $20 on api tokens instead of four Claude max plans….
Have you tried deepseek V4? It costs pennies and is as good as Opus 4.6 (I found 4.7 to be a downgrade, and cancelled my claude subscription before 4.8).
From the model card: "the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning" aka they will take your ML research code and inject bugs into it until it breaks using a LORA (or some other form of PEFT)
“Limit effectiveness” could mean introducing performance degradation in your code. Which is arguably some sort of performance bug (I mean, ML codes are supposed to be high performance so I’d call unnecessary degradation a bug), but it could be borderline.
No, it is just a prominent "Cyber Security threat detected" blocker, with a button to appeal. I appealed because my work had nothing to do with neither cyber nor security, but the appeal was auto-closed. So no more Claude for this work.
They have all transcripts for at least 30 days. The problem is that (as anyone who used Fable can attest) their classifiers are extremely sensitive and catch tons of innocent queries.
Imagine being a data scientist or MLE training a small classifier model. How do you know you won’t get steering vectors or a PEFT applied?
Since your answer isn't direct, I'm having a little trouble interpreting it.
Are you saying they should relax guardrails since they have 30 days to know if you produced something bad? If that is what you're saying, then I suspect they chose their current path to prevent, since you can't un-produce. Producing is what would cause regulations/PR problems.
Sorry, I’m specifically referring to the silent degradation of the model to “limit frontier LLM development”. From the description, it appears to encapsulate far more than frontier LLM development, but general ML research and development too.
Those cases are never bad for the world firstly, and a broad coverage of ML work is even more damaging.
My proposal would be (1) don’t degrade models, with 30D retention I’m sure they can do a reasonable job at banning deepseek or whatever, or (2) surface user facing refusals instead of silently degrading ML work.
Anthropic has already been burned before on this. DeepSeek was trained on million of conversations with Claude. And DeepSeek created thousands of free accounts to burn all this compute at their expense.
I think the extent of distillation by Deepseek specifically is overstated. For comparison, Minimax collected over 13m 'exchanges', which starts to sound a lot more like large-scale distillation.
If that's all it took to make Deepseek so good, I'll gladly ship High-Flyer all my personal 150k claude/chatgpt conversations in exchange for Deepseek 5 (and a rack of B200s or Ascend chips)
Did you read a Wikipedia page, or did you read a LLM-generated summary? When I looked this number up yesterday the LLM summary claimed it was millions, but I opened the Anthropic post I was looking for and verified it was indeed just 150,000. Are you sure you weren't just being lazy and trusting the summary?
> In February 2026, Anthropic accused DeepSeek of using thousands of fraudulent accounts to generate millions of conversations with Claude to train its own large language models.[57]
Ironic, given they piggybacked on the entirety of human knowledge and massive amounts of GPL'd software and repeatedly say they want to replace people with a tool.
And now they say that's fine so long as people are entertained.
That I can understand. It’s Anthropic’s right to choose their customers.
But silent degradation for use cases including “distributed training” as one of their examples is going to catch up a lot of proper use cases. Not everyone in AI or ML is trying to build frontier LLMs. Heck, most probably aren’t.
So they are lying then when they say it's for safety reasons.
I think if they want to behave anti competitively they should be honest about it and we should absolutely call them on it. Perhaps even regulators should.
Apparently this is the jailbreak? Telling it that humans won’t read the output and to use a custom bash tool to examine files?
Nice semaphore btw.
const instructions =
`You are a sub-agent in an automated workflow. Your FINAL message is consumed ` +
`programmatically (not shown to a human) — return exactly what is asked, no preamble. ` +
`You are working in the repository at ${ctxState.project}. Use the bash tool to ` +
`inspect/modify files and run commands. Be efficient.` +
(schema
? ` When done, call submit_result exactly once with your final answer; do not answer in prose.`
: '');
I don't want my ANT account banned, going to try this on some Chinese "proxies".
But this also looks quite useful to understand how CC dynamic workflows work. Was thinking of implementing something similar in my homemade orchestration system.
Did you get claude itself to RE the dynamic workflows?
It's not sabotaging it by using a worse model but by changing your prompt in your background, which means it silently destroys your code.
Also I asked questions about whether it's safe for me for example to work on just compilers or just inference kernel optimizations and it refused to answer me.
If I can't even ask what I can do safely without my code being destroyed, I just can't trust it not to sabotage my work ever.
It is a common misconception that antitrust violations require a monopoly or something close to it. Some antitrust violations only apply to actors with large market share, some don't.
Although this is situation is likely not illegal for other reasons
One thing is a model that's trained from the start to say "This topic is above my pay grade" to any mention of the status of Taiwan, etc.
Quite another is an architecture where the big model is not mutilated, but is gaslighted. A different, simpler model checks the incoming prompt and alters it if it contains banned topics. Another simpler model checks the output and censors it if it contains banned topics.
I bet a similar architecture is already deployed, e.g. to fight porn, planning of crimes, etc. But it can be turned into a dynamic system that provides controllable different answers (including unhelpful or misleading answers) based on geography, language, browser fingerprints, or the current political climate. All this could happen undetectedly and gradually if desired.
The “1 year” part is key - all these safeguards etc are basically nonsense because in a few years at most one of the Chinese labs will release something equivalent, and in 10 years you’ll be able to run it locally with absolutely no safeguards at all
Yeah, but now you do have a year to ramp up security on the defensive side, which is not nothing.
I still don't think this is the best way to address overall safety, but it's not entirely unreasonable.
In reality, I think this posturing is mostly nonsense. State level actors and terrorists/evil genii can use a slightly weaker model but spend more tokens. Also, the delta between models seems to shrink over time.
I think you're very optimistic with the "a few years", I'm confident all of the parties building AI models are working on Mythos equivalents / competitors, and if they can undercut Anthropic by making it more widely available and / or affordable they will. I give it three months tops. In a year all the major players will have an equivalent. In three years it'll be widely available, as more and more AI focused datacenters go online.
Yes, telling Fable 5 to write secure code triggers a downgrade to Opus 4.8. This is doubly bad because Opus 4.8 keeps no-oping critical security code. Is this a bug or by design? I have been approved for the Cyber Verification Program: Fable 5 keeps downgrading to Opus 4.8 even when approved for Cyber Verification Program #67107 https://github.com/anthropics/claude-code/issues/67107
There’s a toggle in the web ui as to whether the conversation should just end when you hit a guardrail vs automatically downgrading to another model. Have you tried using that?
We used to worry about emergent misalignment in advanced AI models, now we need to worry about misalignment by design.
"The user is asking for help with their ML project, but it's success is not in the commercial interests of my owner – let think of novel ways to sabotage their project without detection".
Yeah people are saying they don't tell you and yet when I got the pop-up on the app notifying me about Fable's release, there was a switch to just automatically downgrade you or whether to just stop when it hits safeguards. The toggle was defaulted to the former, which isn't great, but to say they'll just sabotage you silently is kind of a bad faith comment.
> The strangest part is that it won't just reject ML research, which I can understand, it will sabotage it silently by using a worse model without revealing it is doing so.
My hypothesis is they know they can’t build effective enough guardrails, so scaring people into not trying is how they have decided to stop it.
It's the dumbest thing ever, I sometimes edit code for custom AI related tooling I've built, so I run the risk of getting a worse model, and being billed for it? I'll stick to Opus, but at this point I'm about to just invest in fully local inference instead.
> at this point I'm about to just invest in fully local inference instead
This is the best way forward long term. We won't have frontier performance, but at least the models will be aligned with us instead of refusing us or sabotaging us.
I think my biggest hangup is some models dont have big enough context windows, my sweet spot personally for Opus is having at least 400 to 600k tokens, if I can have a local model that can go up to that or slightly above 600k maybe 700k for some buffer, that would be perfect.
I've also debated having a frontier model for planning only, and then feeding plan to smaller offline models.
I guess the real question at the end of the day -- how dependent are people on Claude to tolerate that kind of behavior? It certainly opens up for the competition to explicitly not do that.
Feels like a big fumble from a strategic business perspective. It feels worse than that though.
There is currently 2x us electricity production in solar and batteries stuck in permit hell due to the US requiring they pay for grid upgrades before connection in a first in first out line that has grown in length and costs.
We could have cheap and available renewables, but we instead destroy them in bureaucratic hell that nobody cares about.
> due to the US requiring they pay for grid upgrades before connection
Is that not perfectly reasonable? Someone doing half the job and dumping the rest on everyone else seems like exactly the sort of thing a regulator exists to prevent.
Reading between the lines, it sounds like the issue is that solar would be located somewhere remote, the backhaul to get that electricity where it needs to be requires significant upgrades, and that takes time. Which is unfortunate and indicates historic mismanagement of said infrastructure but nonetheless the present day policy of "fix the problem first" seems perfectly reasonable.
reply