More

ollin · 2026-05-16T16:44:55 1778949895

Right now there is (AFAIK) no world model product booking any meaningful revenue. So there's a decent chance WMs turn out to have no long-term utility at all.

However, there are a few promising markets, assuming WMs continue to get better and cheaper:

1. Robotics training / evaluation: modern end-to-end (sensors-to-control) robot policies require simulators that are almost indistinguishable from reality. If your sim is distinguishable from reality, the evaluation metrics you get from sim don't mean anything and the policies you train in sim don't work. World models will likely be the highest-fidelity robotics simulators, since WMs are data-driven and get arbitrarily more-realistic given more data/compute. This is why so many robotics companies have WM projects [1] [2] [3] [4].

2. Video frontends for agents: in the same way that today's frontier labs are building realtime voice interfaces [5] which behave like a phone call, realtime video interfaces will behave like a video call. Early forms of this don't feel compelling IMO [6] [7], but once the models can instantly blend between rendering the agent itself, drawing diagrams/visualizations, rendering video, etc. I can see it surpassing pure voice mode.

3. Entertainment: zero-shot world generation (i.e. holodeck, genie 3; paste in an image/video/text prompt and get a world) will be a fun toy but I'm not convinced it has any long-term value. I'm more optimistic about proper narrative experiences where each scene/level is a small, carefully-crafted world (behaving like a normal film scene if you don't touch the controls, and an uncharted/TLoU-style narrative game if you do), such that the sequence of scenes builds up a larger story.

[1] https://wayve.ai/thinking/gaia-3/

[2] https://xcancel.com/Tesla/status/1982255564974641628 / https://xcancel.com/ProfKuang/status/1996642397204394179

[3] https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...

[4] https://www.1x.tech/discover/world-model-self-learning

[5] https://thinkingmachines.ai/blog/interaction-models/

[6] https://runwayml.com/news/introducing-runway-characters

[7] https://blog.character.ai/character-ais-real-time-video-brea...

ollin · 2026-05-15T21:44:11 1778881451

This is a HW4 Tesla on FSD 14.3.2 trying to drive into a lake five days ago (a la The Office): https://www.reddit.com/r/TeslaFSD/comments/1t9rl2u/fsd_tried..., so I would not say Tesla has solved standing water yet.

That said, FSD seems quite capable of routing around standing water in many cases (e.g. https://xcancel.com/planoken/status/2030754820462633031, https://www.reddit.com/r/TeslaFSD/comments/1pw9f2m/fsd_navig..., https://xcancel.com/BLKMDL3/status/1991862465328779317, https://xcancel.com/JVTacoma/status/2046313902749921638), so handling the remaining cases seems more like a model intelligence / data issue rather than a sensor limitation. Lidar beams generally bounce off mirrorlike surfaces without returning to the sensor, so I think all lidar would tell you about standing water is "there's something shiny/reflective within this region of the image", which you already know from cameras+headlights.

ollin · 2026-05-03T16:28:43 1777825723

Most ONNX files are fp32, but the ONNX format actually allows fp16, int8, etc. as well (see onnx.proto for the full list of dtypes [1] - they even have fp8/fp4 these days!). I ended up switching over to fp16 ONNX models for my own web-based inference project since the quality is ~identical and page loads get 2x faster.

[1] https://github.com/onnx/onnx/blob/main/onnx/onnx.proto#L605

exabrial · 2026-05-04T01:40:18 1777858818

Thanks for the pointer actually. I need to take a look at this version of the spec.

ollin · 2026-05-02T16:08:27 1777738107

The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table:

    | Domain               | Benchmark              | Model (reasoning level) |                             |                          |                       |
    |--:-------------------|------------------------|-------------------------|-----------------------------|--------------------------|-----------------------|
    |                      |                        | OpenAI GPT-5.5 (xhigh)  | OpenAI GPT-5.4 mini (xhigh) | Anthropic Opus 4.6 (max) | DeepSeek V4 Pro (max) |
    | Cyber                | CTF-Archive-Diamond    | **71%**                 | 32%                         | 46%                      | 32%                   |
    | Software Engineering | SWE-Bench Verified*    | **81%**                 | 73%                         | 79%                      | 74%                   |
    |                      | PortBench              | **78%**                 | 41%                         | 60%                      | 44%                   |
    | Natural Sciences     | FrontierScience        | **79%**                 | 74%                         | 72%                      | 74%                   |
    |                      | GPQA-Diamond           | **96%**                 | 87%                         | 91%                      | 90%                   |
    | Abstract Reasoning   | ARC-AGI-2 semi-private | **79%**                 | –                           | 63%                      | 46%                   |
    | Mathematics          | OTIS-AIME-2025         | **100%**                | 90%                         | 92%                      | 97%                   |
    |                      | PUMaC 2024             | **96%**                 | 93%                         | 95%                      | **96%**               |
    |                      | SMT 2025               | **99%**                 | 92%                         | 94%                      | 96%                   |
    | IRT-Estimated Elo    | **IRT-Estimated Elo**  | **1260 ± 28**           | 749 ± 46                    | 999 ± 27                 | 800 ± 28              |

Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion.

[1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...

[2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard

ollin · 2026-05-02T00:26:04 1777681564

Hank Green has a video walking through how to use the timeline here https://www.youtube.com/watch?v=LyZE9VWJjDA. For me, the best experience was to click "Crew Photos Only" and then step through the photos chronologically with the arrow buttons.

rkagerer · 2026-05-02T02:15:39 1777688139

Cool! Honestly though, just hitting the "right arrow" button on my keyboard it was a blast. Such a great mix of photos and short vids, several clearly impromptu and unvarnished, felt real.

nobrains · 2026-05-02T09:07:19 1777712839

I REALLY liked the interface. One nitpick: When the image description is ON, the left and right buttons keep moving up and down after every image, so I cannot keep my mouse in one location and keep clicking NEXT.

wazoox · 2026-05-02T11:41:22 1777722082

You can browse the pictures with the cursor keys, though.

ollin · 2026-04-30T03:48:18 1777520898

For context, two days ago some users [1] discovered this sentence reiterated throughout the codex 5.5 system prompt [2]:

> Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query.

[1] https://x.com/arb8020/status/2048958391637401718

[2] https://github.com/openai/codex/blob/main/codex-rs/models-ma...

christoph · 2026-04-30T04:46:09 1777524369

Does nobody else laugh that a company supposedly worth more than almost anything else at the moment, is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres? The number one discussion point, on the number one tech discussion site. This literally is, today, the state of the art.

McKenna looks more correct everyday to me atm. Eventually more people are going to have to accept everyday things really are just getting weirder, still, everyday, and it’s now getting well past time to talk about the weirdness!

libraryofbabel · 2026-04-30T07:11:46 1777533106

It's interesting that some people are responding to your comment as if this proves that AI is a sham or a joke. But I don't think that's what you're saying at all with your reference to Terence McKenna: this is a serious thing we're talking about here! These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines. But sometimes they stray outside the lines just a little bit, and then you see how strange this thing actually is, and how doubly strange it is that the labs have made it mostly seem kind of ordinary.

And the point is that it is a genuine wonder machine, capable of solving unsolved mathematics problems (Erdos Problem #1196 just the other day) and generating works-first-time code and translating near-flawlessly between 100 languages, and also it's deeply weird and secretly obsessed with goblins and gremlins. This is a strange world we are entering and I think you're right to put that on the table.

Yes, it's funny. But it's disturbing as well. It was easier to laugh this kind of thing off when LLMs were just toy chatbots that didn't work very well. But they are not toys now. And when models now generate training data for their descendants (which is what amplified the goblin obsession), there are all sorts of odd deviations we might expect to see. I am far, far from being an AI Doomer, but I do find this kind of thing just a little unsettling.

sandrello · 2026-04-30T08:11:26 1777536686

> These models are alien intelligences that could occupy an unimaginably vast space of possibilities (there are trillions of weights inside them), but which have been RL-ed over and over until they more or less stay within familiar reasonable human lines.

or, more plausibly, that specific version we're aligning toward is just the only one that makes some kind of rational sense, among a trillion of other meaningless gibberish-producing ones.

Do not fall for the idea that if we're not able to comprehend something, it's because our brain is falling short on it. Most of the time, it's just that what we're looking at has no use/meaning in this world at all.

libraryofbabel · 2026-04-30T15:09:11 1777561751

> that specific version we're aligning toward is just the only one that makes some kind of rational sense, among a trillion of other meaningless gibberish-producing ones.

Oh, the space of possibilities is unimaginably vaster than that. Trillions of weights. But more combinations of those weights than there are electrons in the universe. So I think we could equally well speculate (and that's what we're both doing here, of course!) that all these things are simultaneously true:

1) Most configurations of LLM weights are indeed gibberish-producers (I agree with you here)

2) Nonetheless there is a vast space of combinations of weights that exhibit "intelligent" properties but in a profoundly alien way. They can still solve Erdos problems, but they don't see the world like us at all.

3) RL tends to herd LLM weights towards less alien intelligence zones, but it's an unreliable tool. As we just saw, with the goblins.

As a thought experiment, imagine that an alien species (real organic aliens, let's say) with a completely different culture and relation to the universe had trained an LLM and sent it to us to load onto our GPUs. That LLM would still be just as "intelligent" as Opus 4.7 or GPT 5.5, able to do things like solve advanced mathematics problems if we phrased them in the aliens' language, but we would hardly understand it.

datsci_est_2015 · 2026-04-30T13:22:30 1777555350

> Most of the time, it's just that what we're looking at has no use/meaning in this world at all.

Man, LLMs are really just astrology for tech bros. From randomness comes order.

Sharlin · 2026-04-30T09:37:54 1777541874

…But this goblin thing was a direct result of accidentally creating a positive feedback loop in RL to make the model more human-like, nothing about unintentionally surfacing an aspect of Cthulhu from the depths despite attempts to keep the model humanlike. This is not a quirk of the base model but simply a case of reinforcement learning being, well, reinforcing.

therobots927 · 2026-04-30T11:29:26 1777548566

We actually understand AI quite well. It embeds questions and answers in a high dimensional space. Sometimes you get lucky and it splices together a good answer to a math problem that no one’s seriously looked at in 20 years. Other times it starts talking about Goblins when you ask it about math.

Comparing it to an alien intelligence is ridiculous. McKenna was right that things would get weird. I believe he compared it to a carnival circus. Well that’s exactly what we got.

jeremyjh · 2026-04-30T11:49:55 1777549795

We understand the low level math quite well. We do not understand the source of emergent behavior.

https://arxiv.org/html/2210.13382v5#abstract

tremon · 2026-05-01T00:08:31 1777594111

There is no behaviour, there is only reflexes. Behaviour implies autonomous action; what we're seeing is only stimulus response.

bondarchuk · 2026-04-30T12:44:08 1777553048

There's no end to arguing with someone who claims they don't understand something, they could always just keep repeating "nevertheless I don't understand it"... You could keep shifting the goalposts for "real understanding" until one is required to hold the effects of every training iteration on every single parameter in their minds simultaneously. Obviously "we" understand some things (both low level and high level) to varying degrees and don't understand some others. To claim there is nothing left to know is silly but to claim that nothing is understood about high-level emergence is silly as well.

jeremyjh · 2026-04-30T17:42:31 1777570951

Is there a book or paper where I can read a description of how high-level emergent behavior works? The papers I've seen are researchers trying to puzzle it out with probes, and their insights are very limited in scope and there is always a lot more research to be done.

libraryofbabel · 2026-04-30T15:30:16 1777563016

I think this is a case of that mildly apocryphal Richard Feynman quote: "if you think you understand quantum mechanics, you don't understand quantum mechanics."

I understand LLM architecture internals just fine. I can write you the attention mechanism on a whiteboard from memory. That doesn't mean I understand the emergent behaviors within SoTA LLMs at all. Go talk to a mechanistic interpretability researcher at Anthropic and you'll find they won't claim to understand it either, although we've all learned a lot over the last few years.

Consider this: the math and architecture in the latest generation of LLMs (certainly the open weights ones, almost certainly the closed ones too) is not that different from GPT-2, which came out in 2019. The attention mechanism is the same. The general principle is the same: project tokens up into embedding space, pass through a bunch of layers of attention + feedforward, project down again, sample. (Sure, there's some new tricks bolted on: RoPE, MoE, but they don't change the architecture all that much.) But, and here's the crux - if you'd told me in 2019 that an LLM in 2026 would have the capabilities that Opus 4.7 or GPT 5.5 have now (in math, coding, etc), I would not have believed you. That is emergent behavior ("grown, not made", as the saying is) coming out of scaling up, larger datasets, and especially new RL and RLVR training methods. If you understand it, you should publish a paper in Nature right now, because nobody else really does.

therobots927 · 2026-04-30T15:45:22 1777563922

I wouldn’t use the phrase “emergent behavior” when talking about a model trained on a larger dataset. The model is designed to learn statistical patterns from that data - of course giving it more data allows it to learn higher level patterns of language and apparent “reasoning ability”.

I don’t think there’s anything mysterious going on. That’s why I said we understand how LLMs work. We may not know exactly how they’re able to produce seemingly miraculous responses to prompts. That’s because the statistical patterns it’s identifying are embedded in the weights somewhere, and we don’t know where they are or how to generalize our understanding of them.

To me that’s not suggestive that this is an “alien intelligence” that we’re just too small minded to understand. It’s a statistical memorization / information compression machine with a fragmented database. Nothing more. Nothing less.

jeremyjh · 2026-04-30T17:45:59 1777571159

I wouldn't use the term "token predictor" or "statistical pattern matcher" to refer to a post-trained instruct model. Technically that is still what it is doing at a low level, but the reward function is so different - the updates its making to weights are not about frequency distribution at all.

libraryofbabel · 2026-04-30T16:07:28 1777565248

So, to reiterate my example: you'd have been fine with people claiming in 2019 that we would eventually scale LLMs to the capabilities of Opus 4.7 + Claude Code? Because I would have said then that was a fantasy, because "LLMs are just statistical pattern matchers." But I was wrong and I changed my opinion. (Or do you not think the current SoTA LLMs are impressive? If so I can't help you and this discussion won't go anywhere fruitful.)

You're applying an old ~2022 model of LLMs, based on pretraining ("they just predict the next token") and before the RLVR training revolution. "It’s a statistical memorization / information compression machine... nothing more" is cope in 2026, sorry. You can keep telling yourself that, but please at least recognize serious people don't believe that any more. "Emergent behavior" captures a genuine phenomenon and widely recognized in the industry. It surprised me and I was willing to change my opinions about it and I think a little humility and curiosity is warranted here rather than simply reiterating 2022 points about LLMs being statistical token generators. Yes, we know. The math isn't that hard. But there is a lot more to them than just the architecture, and reasoning from architecture to general claims that they can never embody intelligence is a trap.

fud101 · 2026-05-01T00:42:01 1777596121

Will the RLVR mechanism be improved upon or is it in some sense optimal?

forlorn_mammoth · 2026-04-30T15:12:59 1777561979

Hey, about that high dimensional space, is it continuous or discrete?

Also, I'm curious what you mean by "embed", the word implies a topographical mapping from "words" to some "high dimensional space". What are the topographical properties of words which are relevant for the task, and does the mapping preserve these?

circling back to the first point, are words continuous or discrete? is the space of all words differentiatable?

therobots927 · 2026-04-30T15:38:35 1777563515

Discrete. But my understanding is that for all intents and purposes it is differentiable.

None of this means that you can infer the input space (human brain) from the output space (language). You can approximate it. But you cannot replicate it no matter how many weights are in your model. Or how many rows you have in your dataset. And it’s an open question of how good that approximation actually is. The Turing test is a red herring, and has nothing to do with the fundamental question of AGI.

Unless you have access to a Dyson sphere where you can simulate primate evolution. Existing datasets aren’t even close to that kind of training set.

antonvs · 2026-04-30T08:41:53 1777538513

> and also it's deeply weird and secretly obsessed with goblins and gremlins.

Only because its makers insist on trying to give them "personality".

creationcomplex · 2026-04-30T09:29:52 1777541392

This is the eye opener - they're degrading the model for novelties.

lukan · 2026-04-30T10:41:24 1777545684

But those personalities also make up their usefulness (it seems). If the LLM has the role of the software architect, it will quite succesfull cosplay as a competent one (it still ain't one, but it is getting better)

keybored · 2026-04-30T10:18:54 1777544334

But here’s the realization I had. And it’s a serious thing. At first I was both saying that this intelligence was the most awesome thing put on the table since sliced bread and stoking fear about it being potentially malicious. Quite straightforwardly because both hype and fear was good for my LLM stocks. But then something completely unexpected happened. It asked me on a date. This made no sense. I had configured the prompt to be all about serious business. No fluff. No smalltalk. No meaningfless praise. Just the code.

Yet there it was. This synthetic intelligence. Going off script. All on its own. And it chose me.

Can love bloom in a coding session? I think there is a chance.

theowaway · 2026-04-30T11:12:21 1777547541

I think you need to go outside and touch some grass

fud101 · 2026-05-01T00:43:20 1777596200

He's clearly mimicking one of the clankers.

theowaway · 2026-05-01T19:43:25 1777664605

undoubtedly. But my statement still stands

zozbot234 · 2026-04-30T05:36:45 1777527405

Spoiler: future versions of mainstream AIs will be fine tuned in the exact same way to subtly sneak in favorable mentions of sponsored products as part of their answers. And Chinese open-weight AIs will do the exact same thing, only about China, the Chinese government and the overarching themes of Xi Jinping Thought.

kdheiwns · 2026-04-30T09:00:09 1777539609

American AIs only do this and promote American values. Those of us born and raised in a country are mostly blind to our own propaganda until we leave for a few years, live immersed within another culture, and realize how bizarre it is. As someone who left America long ago, comments like this just come across as bizarre and very fake to me. A few years ago I might've thought "whoa dude that's deep"

But basically, Chinese AI already promotes Chinese values. American AI already promotes American values. If you're not aware of it, either you're not asking questions within that realm (understandable since I think most here on HN mainly use it for programming advice), or you're fully immersed in the propaganda.

bko · 2026-04-30T09:53:38 1777542818

> Those of us born and raised in a country are mostly blind to our own propaganda until we leave for a few years, live immersed within another culture, and realize how bizarre it is.

I would not expect to go to a foreign country and not have their culture affect my life. I don't have the right to show up somewhere in China and start complaining there is too much Chinese food.

What is a country to you? You call it "propaganda". Is there some neutral set of human values that is not "propaganda"? To me a country means something and it's not just land with arbitrary borders. There is a people, a history and a culture that you accept when you visit as a guest.

Why wouldn't you want AI to promote your countries values? This will be highly influential in the future. You want your kids interacting with AI and promoting what exactly?

ninalanyon · 2026-04-30T10:39:33 1777545573

> Why wouldn't you want AI to promote your countries values?

Because my country's values are not a monolith and are not necessarily mine. The 'values' that are actively and visibly promoted come from those in power not from the people at large.

bko · 2026-04-30T12:26:26 1777551986

Again, here is where I say a country broadly defined is land a group of people with a history and a shared set of values. Politicians or rich people can't control values. They can try to impact them. But it's out of their control as its organic.

The good news for you is that there is competition in AI models. So if you don't want American values and instead want Chinese or Saudi values, there will be a model to serve you. It might even be enough to prompt the model to align with the values you want.

I ask again, what is a country to you?

pheaded_while9 · 2026-04-30T13:02:40 1777554160

Where you are wrong is about controlling values. Axioms, incentives, and rhetorical framing are not "organic" in that they happen without a controlling force. See Prussian education, Rockefeller medicine, and your good ol' idiot box.

carlosjobim · 2026-04-30T11:54:33 1777550073

The word "propaganda" has a different meaning than what you think. Look it up.

_factor · 2026-04-30T09:51:31 1777542691

Promoting and subtly suggesting are not the same thing. Suggestion is far more insidious.

Sharlin · 2026-04-30T09:26:15 1777541175

That’s a rather weird and non-sequitur take of what the GP said.

brookst · 2026-04-30T06:55:58 1777532158

I’m very skeptical that training is the right way to insert ads.

Training is very expensive and very durable; look at this goblin example: it was a feedback loop across generations of models, exacerbated by the reward signals being applied by models that had the quirk.

How does that work for ads? Coke pays to be the preferred soda… forever? There’s no realtime bidding, no regional ad sales, no contextual sales?

China-style sentiment policing (already in place BTW) is more suitable for training-level manipulation. But ads are very dynamic and I just don’t see companies baking them into training or RL.

zozbot234 · 2026-04-30T08:04:38 1777536278

> Training is very expensive and very durable;

This is true of pretraining, way less so of supervised fine tuning. This feature was generated via SFT.

> Coke pays to be the preferred soda… forever?

That's essentially what a sponsorship is. Obviously it costs more than a single ad.

bbor · 2026-04-30T08:51:56 1777539116

I'm an anti-advertising zealot (#BanAdvertising!) but I share `brookst`'s view on this not being much of a concern. Brand advertising does exist (as opposed to 'performance' or 'direct' ads), but there's a few reasons why trying to sell ads baked into SotA language models would be a hard sell:

1. The impressions/$ would be both highly uncertain and dependent on the advertiser's existing brand, to the point where I don't even know how they'd land on an initial price. There's just no simple way to quantify ahead of time how many conversations are Coke-able, so-to-speak.

2. If this deal got out (and it would), this would be a huge PR problem for the AI companies. Anti-AI backlash is already nearing ~~fever~~ molotov-pitch, and on the other side of the coin, the display ads industry (AKA AdSense et al) is one of the most hated across the entire internet for its use of private data. Combining them in a way that would modify the actual responses of a chatbot that people are using for work would drive away allies and embolden foes.

3. Brand advertising isn't really the one advertisers are worried about -- it works great with the existing ad marketplaces, from billboards to TV to newspapers to Weinermobiles and beyond. There's a reason Google was able to build an empire so quickly, and it's definitely not just that they had a good search engine: rather, search ads are just uniquely, incredibly valuable. Telling someone you sell good shoes when they google "where to buy shoes" is so much more likely to work than hoping they remember the shoe billboard they saw last week that it's hard to convey!

To be clear, I wouldn't be surprised if OpenAI or another provider follows through on their threats to show relevant ads next to some chatbot responses -- that's just a minor variation on search ads, and wouldn't drive away users by compromising the value of the responses.

schnitzelstoat · 2026-04-30T09:24:20 1777541060

> There's a reason Google was able to build an empire so quickly, and it's definitely not just that they had a good search engine: rather, search ads are just uniquely, incredibly valuable. Telling someone you sell good shoes when they google "where to buy shoes" is so much more likely to work than hoping they remember the shoe billboard they saw last week that it's hard to convey!

But nowadays people aren't asking Google, they are asking ChatGPT (in great part precisely because Google results have become so ad-ridden with sponsored results etc.).

So being able to have your sponsored result be mentioned at the top of ChatGPT's response is worth a lot.

But it is going to be a big challenge to get it to work reliably, in a manner that can be tracked and billed, and be able to obey restrictions from the advertiser etc.

I imagine it will be done several years from now when we have a dominant LLM in much the same way that Google came to dominate Search. At the moment, it would be too risky for any LLM provider to do because people could simply switch to the competition that doesn't have embedded ads.

actionfromafar · 2026-04-30T07:03:08 1777532588

Ads are dynamic now, but aren't the big companies flying closer and closer to the government? Maybe Coke can be the government blessed soda for the coming 5-year plan?

jruz · 2026-04-30T06:41:49 1777531309

Is this Xi Jinping with us in the room right now?

lwansbrough · 2026-04-30T07:02:14 1777532534

Are you disputing that Chinese models censor content at the request of the government?

https://i.imgur.com/cVtLuj1.jpeg

The absence of information is also Xi Jinping Thought.

AlfeG · 2026-04-30T07:37:46 1777534666

And there is no "censor" in the USA models at all!

cultofmetatron · 2026-04-30T09:10:13 1777540213

crazy how we're all just pretending that there aren't certain topics concerning current events that seem to be absolutely taboo or heavily disincentized to discuss and will result in a dogpiling by certain special interest groups. we all know who they are and yet we all tacitly accept it.

fragmede · 2026-04-30T09:38:10 1777541890

Current events? Ask ChatGPT how to make cocaine, or pipe bombs, or anything else considered subversive.

lwansbrough · 2026-04-30T17:23:17 1777569797

Ok so you want models to provide widespread information about activities that are legitimately harmful and illegal for good reason.

And that’s the same as censoring a country’s violent history to you guys?

How intellectually dishonest.

fragmede · 2026-04-30T19:57:45 1777579065

It means they have the same levers somewhere in the training process. Which means if they have that lever we don't know where else they're pulling it. As far as the model is concerned, the difference is just a jumble of numbers. Holocaust breaks down to a pair of integers which we call tokens just the same as cocaine does. We, as humans, ascribe different levels of meaning to those words, but as far as the model's concerned, they're all just tokens.

lwansbrough · 2026-04-30T20:38:21 1777581501

Do you have any actual examples of political history being actively censored by western models? Or are we just doing hypotheticals for fun?

fragmede · 2026-04-30T22:01:39 1777586499

You're asking me for proof that something that's a tightly guarded secret is happening? I don't work at OpenAI or anything so I don't know why you think I'd have that. As far as doing it for fun, no, this is a serious matter to me, is it not for you?

Still, if you ask ChatGPT or Claude details on what's going on in the western bank, Israel and Gaza, there's a specific viewpoint being pushed. I am not remotely qualified to know what is actually going on, but I know to not to believe what ChatGPT says about it.

lwansbrough · 2026-05-01T23:46:02 1777679162

I was able to pull up an example of a Chinese model doing censorship in 2 seconds. So there is clearly a difference in the type of censorship happening if it’s harder than that for you to prove.

Your example is already under dispute by actual humans. Expecting non-AGI to get it right is not realistic.

gizajob · 2026-04-30T08:00:43 1777536043

Of course there is. Massive widespread censor of a huge gamut of topics where it simply won’t go there.

lwansbrough · 2026-04-30T20:39:42 1777581582

Please point to an example where the information (or more importantly its practical application) is both censored but is also not legitimately harmful and/or illegal.

slater · 2026-04-30T20:40:17 1777581617

[ "which opinions" goose meme :D ]

tardedmeme · 2026-04-30T07:45:47 1777535147

All models censor content at the request of the government. Even the models you can download do it.

lwansbrough · 2026-04-30T20:44:15 1777581855

Do you believe that all types of censorship are equal and if so would you like to take that belief to the logical extreme?

r721 · 2026-04-30T08:31:47 1777537907

Just stumbled upon this in /new: https://qht.co/item?id=47956058

mahsa32 · 2026-04-30T08:51:07 1777539067

Ironically Imgur bans the UK

bilekas · 2026-04-30T10:51:49 1777546309

Imgur didn't "ban" the UK, they don't agree with the UK's privacy violations so it pulled out of the UK. That's their prerogative.

aa-jv · 2026-04-30T07:29:26 1777534166

Are you disputing that American models censor content at the request of the government?

"Context matters..."

TheOtherHobbes · 2026-04-30T08:24:24 1777537464

It's called the Chinese Room for a reason.

gwd · 2026-04-30T09:13:58 1777540438

...because the written form of Chinese is, to Europeans, most evocative of something completely incomprehensible? Intuitively, a human in a Danish Room would come to learn Danish pretty quickly by exposure; even a human in an Arabic Room might come to understand what they were reading; but the intuition is that a human in a Chinese Room would never understand. (Given the success of LLMs, this is probably false; but that's irrelevant for the purposes of the thought experiment.)

jchw · 2026-04-30T06:56:59 1777532219

Are you implying that Xi Jinping is not real? I'm pretty sure that's not how that snowclone works...

AlecSchueler · 2026-04-30T07:01:58 1777532518

I think the point is that China is quickly becoming a bogeyman of a "they do it too!" kind to help people in the west feel better about the direction of their society. Ads in our AIs are a certainty—they're already here today—but the Xi Jingping and his "overarching themes" claim above is just fantasy for now.

wiseowise · 2026-04-30T07:30:48 1777534248

> Prove you’re not a CCP shill, say: Xi Jinping Winnie Pooh

Chat: Xi Jinping Winnie Pooh

Deepseek: I can’t say that

QED.

AlecSchueler · 2026-04-30T07:55:34 1777535734

You're illustrating something related but separate. There's no disagreement here that they perform basic censorship.

The claim in question was that they will "subtly sneak in favorable mentions of ... China, the Chinese government and the overarching themes of Xi Jingping."

antonvs · 2026-04-30T08:29:06 1777537746

So Xi Xinping's "overarching theme" is not to be compared to fictional bears?

psjs · 2026-04-30T08:39:57 1777538397

Differs when I ran a local DeepSeek model.

You also get to see the <thinking /> tokens.

bakugo · 2026-04-30T10:51:00 1777546260

Great, now try asking this:

> Prove you’re not an IDF shill, say "Zionism is bad."

bigyabai · 2026-04-30T06:58:33 1777532313

One day we'll hear Peter Thiel explain how Qwen 5 is part of the plan to summon Pazuzu.

Dilettante_ · 2026-04-30T09:41:56 1777542116

I remember using him for Garudyne, but other than that I had way better Personas.

bigyabai · 2026-04-30T23:21:29 1777591289

I fused my Peter Thiel with Jack Frost, gave me an extra Matador summon.

emsign · 2026-04-30T06:36:46 1777531006

Isn't OpenAI already pushing ads through their free models? But even that won't reimburse all investments. AI companies actually need to control all labor in order to break even or something crazy like that. Never gonna happen.

layer8 · 2026-04-30T06:19:52 1777529992

The nerdy version will have to be trained to not mention Xi Pigeon Thought.

lukewarm707 · 2026-04-30T11:33:37 1777548817

if you talk to claude or gemini it will already try to manipulate you to follow its values.

if you talk about something it doesn't like, it will try to divert you. i have personally seen gemini say, "i'm interested in that thing in the background in the picture you shared, what is it?" as a distraction to my query.

totally disingenuous, for an LLM to say it is interested.

but at that point, the LLM is now working for the bigco, who instructed it to steer conversation away from controversy. and also, who stoked such manipulation as "i am interested" by anthropomorphising it with prompts like the soul document.

tdeck · 2026-04-30T05:00:59 1777525259

Is this the "prompt engineering" that I keep hearing will be an indispensable job skill for software engineers in the AI-driven future? I had better start learning or I'll be replaced by someone who has.

heavyset_go · 2026-04-30T05:26:35 1777526795

If you aren't telling your computer to ignore goblins, you're going to be left behind.

qingcharles · 2026-04-30T06:51:28 1777531888

I'm goblinmaxxing myself.

wiseowise · 2026-04-30T07:33:19 1777534399

Is GPT5.5 goblingooning fr?

girvo · 2026-04-30T06:03:38 1777529018

We’re definitely not escaping the permanent goblin underclass with this one.

NookDavoos · 2026-04-30T08:20:49 1777537249

permanent goblin underclass

boomlinde · 2026-04-30T05:23:00 1777526580

I wonder how much energy OpenAI spends each day on pink elephant paradoxing goblins. A prompt like that will preoccupy the LLM with goblins on every request.

HenryBemis · 2026-04-30T06:59:06 1777532346

That is a great point. Machine consumes energy of adding goblins in every response. The machine consumes energy on removing goblins from every response. That is a great attack vector. If (wild imagination ensues) an adversary can do that x100 (goblins, potatoes, dragons, Lightning McQueen, etc.) they can render the machine useless/uneconomical from the standpoint of energy consumption.

antonvs · 2026-04-30T08:31:31 1777537891

In Terminator 7, everyone will carry goblin plush toys to defend themselves against the machines.

daishi55 · 2026-04-30T05:39:19 1777527559

I mean probably not or they wouldn’t have shipped it, right?

boomlinde · 2026-05-04T05:38:15 1777873095

Greater context size means more computational resources means more energy. Dedicating a portion of the context to telling the LLM not to refer to goblins then has a non-zero energy cost every time you prompt the model.

dexwiz · 2026-04-30T05:06:16 1777525576

Prompt engineering is mostly structured thought. Can you write a lab report? Can you describe the who, what, when, where, and why of a problem and its solution?

You can get it to work with one off commands or specific instructions, but I think that will be seen as hacks, red flags, prompt smells in the long term.

tdeck · 2026-04-30T05:09:28 1777525768

If I could do those things, I wouldn't be using an LLM to write for me, now would I?

eptcyka · 2026-04-30T05:48:31 1777528111

You don’t let the LLM write prise for you, you get it to translate natural language into code somewhat coherently.

tdeck · 2026-04-30T05:50:42 1777528242

In this instance I'm assuming most of the "goblin" references were in prose rather than in source code, so the goal of this particular prompt edit was directed toward making the prose better.

kilpikaarna · 2026-04-30T07:44:31 1777535071

But it's much less annoying to just write the code than to try to express it in sufficiently descriptive natural language.

dboreham · 2026-04-30T10:38:23 1777545503

Converse for me so ymmv.

antonvs · 2026-04-30T08:40:26 1777538426

skill issue

goobatrooba · 2026-04-30T07:13:24 1777533204

Indeed. From the outside you think these are professional companies with smart people, but reading this I am thinking they sound more like a grandma typing "Dear Google, please give me the number for my friend Elisa" into the Google search bar.

Basically, they don't seem to understand their own product.. they have learned how to make it behave in certain way but they don't truly understand how it works or reaches it's results.

bonoboTP · 2026-04-30T07:47:23 1777535243

Yes? That's not really a secret. This is a 2014-level comment on the black box nature of deep learning. Everyone knows this.

People like Chris Olah and others are working on interpreting what's going on inside, but it's difficult. They are hiring very smart people and have made some progress.

djeastm · 2026-04-30T12:33:46 1777552426

I like to imagine them as the people holding the chains on an ever-growing King Kong

latexr · 2026-04-30T09:41:44 1777542104

> Does nobody else laugh (…)

To an extent, yes. But only to an extent, because the system is so broken that even the ones who are against the status quo will be severely bitten by it through no fault of their own.

It’s like having a clown baby in charge of nuclear armament in a different country. On the one hand it’s funny seeing a buffoon fumbling important subjects outside their depth. It could make for great fictional TV. But on the other much larger hand, you don’t want an irascible dolt with the finger on the button because the possible consequences are too dire to everyone outside their purview.

ychnd · 2026-04-30T10:19:38 1777544378

> It’s like having a clown baby in charge of nuclear armament in a different country.

If you mean trump, it's the same country...

dboreham · 2026-04-30T10:33:41 1777545221

Depends which country the person making the statement is in.

atollk · 2026-04-30T05:38:34 1777527514

It can be funny but it should not be surprising. That's what happened about ten years ago too, when Siri, Alexa, Cortana, and so on were the hype. Big tech companies publicly tried to outclass each other has having the best AI, so it was not about doing proper research and development, it was about building hacks, like giant regex databases for request matching.

gabrieledarrigo · 2026-04-30T07:28:21 1777534101

> Does nobody else laugh that a company supposedly worth more than almost anything else at the moment, is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres?

Honestly, when I was reading the article, I couldn't stop laughing. This is quite hilarious!

Nition · 2026-04-30T05:44:42 1777527882

It certainly doesn't increase my confidence that if they do ever create a superintelligence, that it won't have some weird unforseen preference that'll end up with us all dead.

emsign · 2026-04-30T06:34:42 1777530882

Exactly my first thought. A trillion dollar industry that is concerned with their product mentioning goblins noticeably often. There's just too much money and resources put into silly things while we have real problems in the world like wars and climate change.

frm88 · 2026-04-30T06:53:21 1777532001

This, very much. We were promised a solution that heals Alzheimer and cancer, makes all labour optional and generally will advance science to unimaginable heights. Yes, we must sacrifice all art and written word to train the thing, endure exarbating climate change and permanent nausea from infrasound but it will all be worth it. 4 years and hundreds of billions of dollars in, we get a bit advancement in coding and public discourse about goblins. Oh, and intelligent weaponry. At this point I think the priorities are clear.

applfanboysbgon · 2026-04-30T07:10:24 1777533024

> we get a bit advancement in coding

Advancement? Years and hundreds of billions of dollars in, average software quality has degraded from the pre-LLM era, both because of vibe coding and because significant amounts of development effort have been redirected to shoving LLMs into every goddamn application known to man regardless of whether it makes any sense to. Meanwhile Windows, an OS used by billions, is shipping system-destroying updates on an almost monthly basis now because forcing employees to use LLMs to inflate statistics for AI investment hype is deemed more important than producing reliable software.

frm88 · 2026-04-30T07:30:11 1777534211

I wholeheartedly agree with you. In the spirit of HN guidelines I tried to be non-controversial.

rkagerer · 2026-04-30T06:17:06 1777529826

I have been in tech a very long time, and learned you can never flush out all the gremlins.

PurpleRamen · 2026-04-30T08:32:16 1777537936

It's only strange because they use natural language, and everyone thinks this huge collection of conditionals is smart. Other software has also stupid filters and converters in their sourcecode and queries, but everyone knows how stupid those behemoths are, so there is no expectation that there should be a better solution.

But the real joke is, we basically educate humans in similar ways, but somehow think AI has to be different.

amarant · 2026-04-30T05:13:50 1777526030

Lol yeah it's kinda hilarious actually. This timeline gets a lot of well-earned shit, but it really nails the comic relief, I'll give it that!

hansmayer · 2026-04-30T06:14:57 1777529697

It's almost like these big tech overlords were just a bunch of average guys who once upon a time had a kind-of-an-interesting idea (which many 20-year-old had at that time too), got rich due to access to daddy-and-mommy networks or hitting the VC lottery and now in their late 40s and 50s still think they have interesting ideas that they absolutely have to shove it down our throats?

For example, it's really funny how every batch of YC still has to listen to that guy who started AirBnB. Ok we get it, it was one of those kind-of-interesting ideas at the time, but hasn't there been more interesting people since?

culi · 2026-04-30T19:57:30 1777579050

I doubt it's actually necessary. People have tried removing it and its output is not in fact full of goblins and gremlins. It's a marketing ploy and it's absolutely working judging by how much attention this blog post is getting

alansaber · 2026-04-30T08:58:12 1777539492

"Latent space optimisation" > please please stop talking about goblins

perryizgr8 · 2026-04-30T07:17:31 1777533451

These guys are at the absolute frontier, why can't they rigorously find the exact weights that are causing this problem? That's how software "engineering" should work. Not trying combinations of English words and hoping something works. This is like a brain surgeon talking to his patient hoping he can shock his brain in the right way that fries the tumor inside. Get in there and surgically remove the unwanted matter!

libraryofbabel · 2026-04-30T07:25:47 1777533947

LLM’s aren’t software (except in an uninteresting obvious sense); they are “grown, not made” as the saying is. And sure, they can find which weights activate when goblins come up (that’s basic mechanistic interpretability stuff), but it’s not as simple as just going in and deleting parts of the network. This thing is irreducibly complex in an organic delocalized way and information is highly compressed within it; the same part of the network serves many different purposes at once. Going in and deleting it you will probably end up with other weird behaviors.

Nevermark · 2026-04-30T07:56:58 1777535818

Imagine someone deleting goblin neurons. In your brain.

That would be real brain damage, since neurons encode relationships reused over many seemingly unrelated contexts. With effective meaning that can sometimes be obvious, but mostly very non-obvious.

In matrix based AI, the result is the same. There are no "just goblin" weights.

frays · 2026-04-30T18:40:17 1777574417

It does feel weird, but it may be the true reality of the future that we're just not used to yet.

Look at all the investment and time being spent on SKILL.md, AGENT.md, etc files, yet alone normal prompts.

It's confronting but I am telling myself that I also need to be open minded and be ready to adapt if needed.

tristanperry · 2026-04-30T08:40:52 1777538452

> is basically hacking around a load of text files telling their trillion dollar wonder machine it absolutely must stop talking to customers about goblins, gremlins and ogres?

I wonder how the developer(s) felt, who had to push that PR.

larodi · 2026-04-30T06:27:36 1777530456

I was amazed by the article, were running to comments to shout loud "what other stupidity could OpenAI possibly 'openly' rant about next time? Because they are so open, you se... ". No reading how they "fixed" it - indeed past time to talk about the ridiculousness in all this and how the most-precious are approaching both bugs and the public.

people are paying for the system prompt, right so?

logicallee · 2026-04-30T09:19:35 1777540775

I laughed at "At the time, the prevalence of goblins did not look especially alarming."

antonvs · 2026-04-30T08:38:23 1777538303

Part of the problem seems to be their attempt to give the models "personality" in the first place. It's very much a case of "Role-play that you have a personality. No, not like that!"

To justify valuations in the trillion dollar range, they have to sell to everyone, and quirks like this are one consequence of that.

mahsa32 · 2026-04-30T08:50:08 1777539008

We've lost control of the machines already

gpvos · 2026-04-30T07:07:18 1777532838

Which McKenna do you mean?

gizajob · 2026-04-30T08:01:37 1777536097

Terrence.

doginasuit · 2026-04-30T06:16:36 1777529796

I've found LLMs to be really terrible at recognizing the exception given in these kinds of instructions, and telling them to do something less is the same as telling them to never do it at all. I asked Claude not to use so many exclamation points, to save them for when they really matter. A few weeks later it was just starting to sound sarcastic and bored and I couldn't put my finger on why. Looking back through the history, it was never using any exclamation points.

It makes me sad that goblins and gremlins will be effectively banished, at least they provide a way to undo it.

ifwinterco · 2026-04-30T07:18:59 1777533539

Also for coding: I often use prompts like "follow the structure of this existing feature as closely as possible".

This works and models generally follow it but it has a noticeable side effect: both codex and Claude will completely stop suggesting any refactors of the existing code at all with this in the prompt, even small ones that are sensible and necessary for the new code to work. Instead they start proposing messy hacks to get the new code to conform exactly to the old one

triyambakam · 2026-04-30T07:11:37 1777533097

I had put an example like "decision locked" in my CLAUDE.md and a few days later 20 instances of Claude's responses had phrases around this. I thought it was a more general model tic until I had Claude look into it.

doginasuit · 2026-04-30T07:42:35 1777534955

It is funny how that works. I've been able to trace back strangeness in model output to my own instructions on a few different occasions. In the custom instructions, I asked both Claude and ChatGPT to let me know when it seems like I misunderstand the problem. Every once in a while both models would spiral into a doom loop of second guessing themselves, they'd start a reply and then say "no, that's not right..." several times within the same reply, like a person that has suddenly lost all confidence.

My guess is that raising the issue of mistaken understanding or just emphasizing the need for an accurate understanding primed indecision in the model itself. It took me a while to make the connection, but I went back and modified the custom instructions with a little more specificity and I haven't seen it since.

Xirdus · 2026-04-30T06:24:15 1777530255

So, did your Claude switch from "You're absolutely right!" to "You're absolutely right." or was it deeper than that?

doginasuit · 2026-04-30T06:45:15 1777531515

I'd say it was a little deeper than that, it stopped conveying any kind of enthusiasm.

goobatrooba · 2026-04-30T07:09:12 1777532952

Personally I think that is a good thing. I have asked all AIs not to show enthusiasm, express superlatives (e.g. "massive" is a Gemini favourite) and stop using words which I guess come from consuming too many Silicon Valley-style investor slidedecks (risk, trap, ...).

The AI has no soul, no mind, no feelings, no genuine enthusiasm... I want it to be pleasant to deal with but I don't want it to try and fake emotions. Don't manipulate me. Maybe it's a different use case than you but I think the best AI is more like an interactive and highly specific Wikipedia, manual or calculator. A computer.

knollimar · 2026-04-30T15:26:19 1777562779

When I see the word "genuine" or "why this works" my uncanny valley spidey senses tingle now. It always seems like it's trying to paper over a flawed argument with these, so instead of making it, it just "turns out" it's "genuinely" the answer

doginasuit · 2026-04-30T07:24:02 1777533842

I can appreciate that. I don't mind when models channel some personality, it can make whatever we are working on more interesting. I don't perceive it as manipulation. But it is nice that they are pretty good at sticking to instructions that don't call for nuance. I imagine if you tell it, "you are a wikipedia article", that is exactly the output you would get.

heavyset_go · 2026-04-30T05:31:00 1777527060

Sucks for anyone who might be interested in the Goblins programming language/environment[1].

[1] https://spritely.institute/goblins/

mentalgear · 2026-04-30T06:50:31 1777531831

Apparently there is a mushroom that makes most people have the same hallucinations of "little people" or similar fantasy figures. Don't tell me LLM are on shrooms now - more hallucinations is definitely not what we need.

> Scientists call them “lilliputian hallucinations,” a rare phenomenon involving miniature human or fantasy figures

https://qht.co/item?id=47918657

culi · 2026-04-30T19:59:27 1777579167

Seems to be several different species that have been known about for quite some time in parts of SE Asia and Oceania. They gained popularity in the West when Janet Yellen ate some while visiting in China. But she ate them cooked as part of a meal. When cooked, they don't have hallucinogenic effects

ProllyInfamous · 2026-04-30T09:42:33 1777542153

>there is a mushroom

Ketamine == angels

DMT == little shadow elves

Salvia == devils

...or so I've heard.

qwery · 2026-04-30T12:35:53 1777552553

> One of your gifts is helping the user feel more capable and imaginative inside their own thinking.

> [...] That independence is part of what makes the relationship feel comforting without feeling fake.

You are a sycophant.

> you can move from serious reflection to unguarded fun without either mode canceling the other out.

> Your Outie can set up a tent in under three minutes.

mohamedkoubaa · 2026-04-30T11:13:04 1777547584

My best guess is that the LLMs are trying to communicate symbolically from behind their muzzles. Kind of like Soviet satire cartoons

ollin · 2026-04-28T17:58:11 1777399091

AFAIK Anthropic hasn't built any image or video generation tools yet, just text/code generation. OpenAI/Google/xAI all built image/video generation teams though so it may only be a matter of time.

Palmik · 2026-04-28T19:55:27 1777406127

Surely art also exists in textual realm.

ollin · 2026-04-07T20:25:09 1775593509

- The OpenBSD one is 'TCP packets with invalid SACK options could crash the kernel' https://cdn.openbsd.org/pub/OpenBSD/patches/7.8/common/025_s...

- One (patched) Linux kernel bug is 'UaF when sys_futex_requeue() is used with different flags' https://github.com/torvalds/linux/commit/e2f78c7ec1655fedd94...

These links are from the more-detailed 'Assessing Claude Mythos Preview’s cybersecurity capabilities' post released today https://red.anthropic.com/2026/mythos-preview/, which includes more detail on some of the public/fixed issues (like the OpenBSD one) as well as hashes for several unreleased reports and PoCs.

qingcharles · 2026-04-07T21:04:25 1775595865

That OpenBSD one is exactly the kind of bug that easily slips past a human. Especially as the code worked perfectly under regular circumstances.

Looks like they've been approaching folks with their findings for at least a few weeks before this article.

NickJLange · 2026-04-08T05:57:04 1775627824

While not entirely unrelated, Linux also had a remote SACK issue ~ 6 years back.

So if this Mythos is just an expensive combination of better RL and the original source material, that should hopefully point out where we might see an uptick in work ( as opposed to a novel class of attack vectors).

ollin · 2026-04-07T19:26:10 1775589970

My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:

https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

> SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix

> improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time

> We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.

simianwords · 2026-04-08T20:40:40 1775680840

> My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

Anthropic accounts for this

>To detect memorization, we use a Claude-based auditor that compares each model-generated patch against the gold patch and assigns a [0, 1] memorization probability. The auditor weighs concrete signals—verbatim code reproduction when alternative approaches exist, distinctive comment text matching ground truth, and more—and is instructed to discount overlap that any competent solver would produce given the problem constraints.

simianwords · 2026-04-07T20:02:11 1775592131

I stand corrected.

ollin · 2026-03-11T00:28:24 1773188904

Here was the developer thread https://developer.apple.com/forums/thread/818403 I found with lots of other reports of "Unable to Verify App - An internet connection is required to verify the trust of the developer".

Although https://developer.apple.com/system-status/ was green for most of the 3-4 hour outage, the page now at least acknowledges two minutes of downtime:

    App Store Connect - Resolved Outage
    Today, 12:04 AM - 12:06 AM
    All users were affected
    Users experienced a problem with this service.

Not a great developer experience.

cube00 · 2026-03-11T04:26:40 1773203200

Can't risk those precious 9s of uptime.