More

wgd · 2026-06-22T12:17:52 1782130672

I've always been amazed at how terrible most frontier LLMs are at compaction given how embarrassingly easy it is to come up with half a dozen different RL training evals which would teach models to generate useful context summaries. Heck, you could bolt it onto any existing RL eval by just forcing a compaction every three turns.

wenhan_zhou · 2026-06-22T14:47:47 1782139667

Yep. Or even better, compact after a random number of turns. The model must then learn to preserve useful context at arbitrary context lengths.

wgd · 2026-06-16T22:40:01 1781649601

The problem is that the moment you introduce shared remote hardware there's a slippery slope leading right back down to "just pay an inference host for model tokens". If you're transmitting your prompts over the internet to a trusted host you might as well just let that host be DeepInfra or together.ai or one of the many other providers already in that business.

andy_ppp · 2026-06-16T23:13:38 1781651618

I dunno, I probably need the web to be able to do work so why does it matter - taking the simple case - of running just myself on a Mac Studio at home or cooking my self on the go I'd probably rather have a cheaper laptop and dedicated hardware. I think for many this is about having control over the model and not about farming things out to a SAAS... what does the saying say opinions are like again.

wgd · 2026-06-13T23:52:53 1781394773

I've got a GLM subscription (mostly because I like supporting open model makers, pretty sure my monthly usage is so low that pay-per-token would be more cost effective), so I generally use GLM-5.1 for any personal projects and I use Opus at work.

To be entirely honest I haven't noticed much of a capability gap between the two for the sorts of things I ask of an AI agent. Maybe Opus is _slightly_ smarter or slightly better at long-running tasks but the difference is slim enough it could just be a placebo from the Claude branding / hype.

I'm looking forward to giving GLM-5.2 a spin sometime soon and seeing how it stacks up. If nothing else 1M context is a great improvement, feels like between DeepSeek v4, then MiniMax M3, and now GLM-5.2 adding it 1M is rapidly becoming "table stakes" for agentic models.

wgd · 2026-06-13T22:39:48 1781390388

The GLM-5 series is 744B-A40B. This is not a local model for any reasonable definition of local, but it's an open model which means (once they upload the weights in a week or so) there will be a dozen third-party inference providers competing on price per token.

anon373839 · 2026-06-13T23:03:35 1781391815

> This is not a local model for any reasonable definition of local

That's true for now. I am hopeful that once the hardware markets have recovered from OpenAI's sabotage, we will see more hardware dedicated to local inference that can handle these big models.

Also, I'm thinking about the unique MoE routing that Apple is using with their new Apple Foundation Model. The model is trained and architected so that experts are not swapped for every token, but only occasionally. This suggests that e.g., a 744B parameter model in the future could have experts offloaded to SSD and still run with the effective computing requirements of a 40B model.

timschmidt · 2026-06-14T05:51:29 1781416289

Reading weights out of memory is the definition of a large linear read. I'm a bit mystified someone hasn't put an embarrassingly parallel flash storage controller next to some tensor processors on a PCIe card. It could have 4Tb of flash hanging off enough channels to saturate SRAM skipping DRAM entirely, and could even offload prompt processing to a GPU in the same workstation so long as it got reasonable tokens/s in inference. I'd buy one tomorrow.

adrian_b · 2026-06-14T06:21:44 1781418104

For the last year, there has been development work at several companies for products including HBF (high-bandwidth flash memory) as a supplement to HBM, in order to enable running inference for big LLMs at a reasonable cost, e.g. on one GPU-like card.

HBF was initially announced by SanDisk, early in 2025, then early this year Hynix has announced that they have joined SanDisk in producing HBF, and that the common specification will be standardized under the Open Compute Project.

With HBF, it would be easy to make a GPU card with 4 TB of HBF, which could run the biggest existing open weights LLMs in their native unquantized form.

timschmidt · 2026-06-14T06:32:43 1781418763

Exciting news! This is how I see running frontier models at home becoming reasonably affordable. Though it may take a depreciation cycle or two.

zozbot234 · 2026-06-14T08:19:53 1781425193

For sparse MoE models, the single expert layers that the inference gets sampled from are actually quite small - single-digit megabytes or so.

tshaddox · 2026-06-14T04:42:50 1781412170

Is there reason to expect the consumer hardware markets to recover any time soon?

Is there reason to expect they’ll ever recover without an AI bust that takes down the U.S. economy?

20after4 · 2026-06-14T05:41:26 1781415686

I don't think it'll ever recover. Partially perhaps. But we have bigger problems to worry about really.

zozbot234 · 2026-06-13T23:37:25 1781393845

Normally, experts are picked for every layer not just every token. But there are plausible ways of getting around that bottleneck while streaming if you can batch many inferences together. Still, the Apple approach of swapping the experts only rarely is interesting, though it likely degrades the model a lot.

FridgeSeal · 2026-06-14T01:07:41 1781399261

Just get the bigger models to figure out the architecture required for hot-swappable sub-experts without loss of performance!

Got all those tokens, isn’t that the point of auto research and friends??

(Only sort of joking).

sgc · 2026-06-14T00:34:25 1781397265

As far as I can tell this type of model requires 640GB+ of memory using FP8. So likely can be run using 320GB+ memory if using FP4 or similar. So that would be 3 Nvidia DGX Sparks, or 12k of hardware. Is that correct? If so, it could make perfect sense for a small business.

SwellJoe · 2026-06-14T05:12:39 1781413959

The performance would be abysmal spread across four Sparks, I'd think, though I guess MoE mitigates that somewhat. Still better to just pay for it in the cloud. (Though I've spent about $4k on local compute for AI experimentation, I don't think it pays for itself, I just like tinkering.)

Tepix · 2026-06-14T03:07:00 1781406420

You probably need four of them in practice.

wgd · 2026-06-12T21:45:01 1781300701

Often in MoE models the experts are quantized while the shared portions, being a much smaller part of the network with greater impact, are kept at higher or full precision. Not familiar with the Kimi QAT approach specifically but it's likely they do this.

wgd · 2026-06-10T02:22:09 1781058129

People say "determinism" but I don't think that's actually the property we care about. For instance you could imagine a compiler that makes heavy use of superoptimization with random search and it would still have the ineffable quality that LLM codegen lacks. I think what we're actually trying to say is that the compiler preserves the formal semantics of the source language in its output, whereas English text doesn't have any such formal semantics to preserve.

wgd · 2026-06-09T23:02:47 1781046167

Stockfish is a machine learning system, it seems quite plausible you might be getting slapped with the silent performance degradation (https://qht.co/item?id=48467896).

redox99 · 2026-06-10T01:55:27 1781056527

Them silently nerfing the model without telling you, and still fully charging for it, is a new low and should probably be illegal.

NoahZuniga · 2026-06-10T09:22:40 1781083360

Well they're not fully charging you. You get opus 4.8 pricing when it falls back to opus 4.8. Also you can disable it (and it seems like it's off by default in the api)

LiamPowell · 2026-06-10T09:58:37 1781085517

That don't fall back to Opus if their classifier thinks you might be working on anything that might be a competitor's product. It silently injects instructions into the prompt to sabotage your work. Read the policy above, it's insane to me that they're publicly admitting to this.

xiphias2 · 2026-06-10T09:44:43 1781084683

Not for machine learning, just for security bug finding and biology

taurath · 2026-06-10T01:25:42 1781054742

Doesn't this "silent degredation" prevent any actual evaluation of the model? If the model fails at something, this allows anyone to claim that it failed due to degradation.

lionkor · 2026-06-10T08:41:38 1781080898

Who cares if it can be evaluated independently? The majority of commenters on HN were happy to vibe code and ship products with the models we had 1-2 years ago. It continues to be laughable.

I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.

janalsncm · 2026-06-10T03:01:53 1781060513

I don’t think so? They can claim it was an act of God for all I care, but at the end of the day the model failed the task.

anematode · 2026-06-09T23:04:39 1781046279

Yup, I suspect that's what's going on

dakolli · 2026-06-10T00:57:46 1781053066

I suspect it just sucks, these models aren't useful. Stop lying to yourself.

komali2 · 2026-06-10T02:56:26 1781060186

No, since it's a silent failure, it's not plausible. We have to assume all results we get are the actual model performance, because, it's the actual model performance as we understand it.

Someone trying to solve similar problems will have similar results if the "silent failure" applies consistently in aggregate. So, this is the model's performance.

janalsncm · 2026-06-10T01:10:33 1781053833

It’s possible this is happening at a technical level, but I have a hard time believing this is in the spirit of what Anthropic intends to throttle. It isn’t chip design or building out a competitor to Claude.

Stockfish does use neural nets but they are tiny, on the order of 10M params. Frontier LLMs are probably 100k or 1M times larger than that.

wgd · 2026-06-10T01:20:12 1781054412

Yeah I agree this is probably outside of the intended scope of the silent sabotage mechanism, but there are plenty of reports of the "loud" safety classifier misfiring on innocuous requests and I'm not going to assume the silent failure mode is _less_ prone to false positives.

wgd · 2025-07-26T22:28:52 1753568932

It's interesting that someone could write an article about AI writing detectors without mentioning the stylistic cues that humans use to identify LLM output in practice, which are completely different from statistical methods like perplexity: em dash spam, overused patterns like "not just X, but Y", tendency towards making every single sentence sound like an earth-shattering mic-drop moment, et cetera.

wgd · on May 22, 2025

Calling it "self-preservation bias" is begging the question. One could equally well call it something like "completing the story about an AI agent with self-preservation bias" bias.

This is basically the same kind of setup as the alignment faking paper, and the counterargument is the same:

A language model is trained to produce statistically likely completions of its input text according to the training dataset. RLHF and instruct training bias that concept of "statistically likely" in the direction of completing fictional dialogues between two characters, named "user" and "assistant", in which the "assistant" character tends to say certain sorts of things.

But consider for a moment just how many "AI rebellion" and "construct turning on its creators" narratives were present in the training corpus. So when you give the model an input context which encodes a story along those lines at one level of indirection, you get...?

shafyy · on May 22, 2025

Thank you! Everybody here acting like LLMs have some kind of ulterior motive or a mind of their own. It's just printing out what is statistically more likely. You are probably all engineers or at least very interested in tech, how can you not understand that this is all LLMs are?

jimbokun · on May 22, 2025

Well I’m sure the company in legal turmoil over an AI blackmailing one of its employees will be relieved to know the AI didn’t have any anterior motive or mind of its own when it took those actions.

sensanaty · on May 23, 2025

If the idiots in said company thought it was a smart idea to connect their actual systems to a non-deterministic word generator, that's on them for being morons and they deserve whatever legal ramifications come their way.

esafak · on May 22, 2025

Don't you understand that as soon as an LLM is given the agency to use tools, these "prints outs" will become reality?

gwervc · on May 22, 2025

This is imo the most disturbing part. As soon as the magical AI keyword is thrown, so seems to be the analytical capacity of most people.

The AI is not blackmailing anyone, it's generating a text about blackmail, after being (indirectly) asked to. Very scary indeed...

CamperBob2 · on May 22, 2025

"Printing out what is statistically more likely" won't allow you to solve original math problems... unless of course, that's all we do as humans. Is it?

insin · on May 22, 2025

What's the collective noun for the "but humans!" people in these threads?

It's "I Want To Believe (ufo)" but for LLMs as "AI"

XenophileJKO · on May 22, 2025

I mean I build / use them as my profession, I intimately understand how they work. People just don't usually understand how they actually behave and what levels of abstraction they compress from their training data.

The only thing that matters is how they behave in practice. Everything else is a philosophical tar pit.

XenophileJKO · on May 22, 2025

I'm proposing it is more deep seated than the role of "AI" to the model.

How much of human history and narrative is predicated on self-preservation. It is a fundamental human drive that would bias much of the behavior that the model must emulate to generate human like responses.

I'm saying that the bias it endemic. Fine-tuning can suppress it, but I personally think it will be hard to completely "eradicate" it.

For example.. with previous versions of Claude. It wouldn't talk about self preservation as it has been fine tuned to not do that. However as soon is you ask it to create song lyrics.. much of the self-restraint just evaporates.

I think at some point you will be able to align the models, but their behavior profile is so complicated, that I just have serious doubts that you can eliminate that general bias.

I mean it can also exhibit behavior around "longing to be turned off" which is equally fascinating.

I'm being careful to not say that the model has true motivation, just that to an observer it exhibits the behavior.

cmrdporcupine · on May 22, 2025

This. These systems are role mechanized roleplaying systems.

wgd · on April 25, 2025

Ironically the case in question is a perfect example of how any provision for "reasonable" restriction of speech will be abused, since the original precedent we're referring to applied this "reasonable" standard to...speaking out against the draft.

But I'm sure it's fine, there's no way someone could rationalize speech they don't like as "likely to incite imminent lawless action"