Hacker Timesnew | past | comments | ask | show | jobs | submit | wgd's commentslogin

I've always been amazed at how terrible most frontier LLMs are at compaction given how embarrassingly easy it is to come up with half a dozen different RL training evals which would teach models to generate useful context summaries. Heck, you could bolt it onto any existing RL eval by just forcing a compaction every three turns.

Yep. Or even better, compact after a random number of turns. The model must then learn to preserve useful context at arbitrary context lengths.

The problem is that the moment you introduce shared remote hardware there's a slippery slope leading right back down to "just pay an inference host for model tokens". If you're transmitting your prompts over the internet to a trusted host you might as well just let that host be DeepInfra or together.ai or one of the many other providers already in that business.

I dunno, I probably need the web to be able to do work so why does it matter - taking the simple case - of running just myself on a Mac Studio at home or cooking my self on the go I'd probably rather have a cheaper laptop and dedicated hardware. I think for many this is about having control over the model and not about farming things out to a SAAS... what does the saying say opinions are like again.

I've got a GLM subscription (mostly because I like supporting open model makers, pretty sure my monthly usage is so low that pay-per-token would be more cost effective), so I generally use GLM-5.1 for any personal projects and I use Opus at work.

To be entirely honest I haven't noticed much of a capability gap between the two for the sorts of things I ask of an AI agent. Maybe Opus is _slightly_ smarter or slightly better at long-running tasks but the difference is slim enough it could just be a placebo from the Claude branding / hype.

I'm looking forward to giving GLM-5.2 a spin sometime soon and seeing how it stacks up. If nothing else 1M context is a great improvement, feels like between DeepSeek v4, then MiniMax M3, and now GLM-5.2 adding it 1M is rapidly becoming "table stakes" for agentic models.


The GLM-5 series is 744B-A40B. This is not a local model for any reasonable definition of local, but it's an open model which means (once they upload the weights in a week or so) there will be a dozen third-party inference providers competing on price per token.

> This is not a local model for any reasonable definition of local

That's true for now. I am hopeful that once the hardware markets have recovered from OpenAI's sabotage, we will see more hardware dedicated to local inference that can handle these big models.

Also, I'm thinking about the unique MoE routing that Apple is using with their new Apple Foundation Model. The model is trained and architected so that experts are not swapped for every token, but only occasionally. This suggests that e.g., a 744B parameter model in the future could have experts offloaded to SSD and still run with the effective computing requirements of a 40B model.


Reading weights out of memory is the definition of a large linear read. I'm a bit mystified someone hasn't put an embarrassingly parallel flash storage controller next to some tensor processors on a PCIe card. It could have 4Tb of flash hanging off enough channels to saturate SRAM skipping DRAM entirely, and could even offload prompt processing to a GPU in the same workstation so long as it got reasonable tokens/s in inference. I'd buy one tomorrow.

For the last year, there has been development work at several companies for products including HBF (high-bandwidth flash memory) as a supplement to HBM, in order to enable running inference for big LLMs at a reasonable cost, e.g. on one GPU-like card.

HBF was initially announced by SanDisk, early in 2025, then early this year Hynix has announced that they have joined SanDisk in producing HBF, and that the common specification will be standardized under the Open Compute Project.

With HBF, it would be easy to make a GPU card with 4 TB of HBF, which could run the biggest existing open weights LLMs in their native unquantized form.


Exciting news! This is how I see running frontier models at home becoming reasonably affordable. Though it may take a depreciation cycle or two.

For sparse MoE models, the single expert layers that the inference gets sampled from are actually quite small - single-digit megabytes or so.

Is there reason to expect the consumer hardware markets to recover any time soon?

Is there reason to expect they’ll ever recover without an AI bust that takes down the U.S. economy?


I don't think it'll ever recover. Partially perhaps. But we have bigger problems to worry about really.

Normally, experts are picked for every layer not just every token. But there are plausible ways of getting around that bottleneck while streaming if you can batch many inferences together. Still, the Apple approach of swapping the experts only rarely is interesting, though it likely degrades the model a lot.

Just get the bigger models to figure out the architecture required for hot-swappable sub-experts without loss of performance!

Got all those tokens, isn’t that the point of auto research and friends??

(Only sort of joking).


As far as I can tell this type of model requires 640GB+ of memory using FP8. So likely can be run using 320GB+ memory if using FP4 or similar. So that would be 3 Nvidia DGX Sparks, or 12k of hardware. Is that correct? If so, it could make perfect sense for a small business.

The performance would be abysmal spread across four Sparks, I'd think, though I guess MoE mitigates that somewhat. Still better to just pay for it in the cloud. (Though I've spent about $4k on local compute for AI experimentation, I don't think it pays for itself, I just like tinkering.)

You probably need four of them in practice.

Often in MoE models the experts are quantized while the shared portions, being a much smaller part of the network with greater impact, are kept at higher or full precision. Not familiar with the Kimi QAT approach specifically but it's likely they do this.

People say "determinism" but I don't think that's actually the property we care about. For instance you could imagine a compiler that makes heavy use of superoptimization with random search and it would still have the ineffable quality that LLM codegen lacks. I think what we're actually trying to say is that the compiler preserves the formal semantics of the source language in its output, whereas English text doesn't have any such formal semantics to preserve.


Stockfish is a machine learning system, it seems quite plausible you might be getting slapped with the silent performance degradation (https://qht.co/item?id=48467896).


Them silently nerfing the model without telling you, and still fully charging for it, is a new low and should probably be illegal.


Well they're not fully charging you. You get opus 4.8 pricing when it falls back to opus 4.8. Also you can disable it (and it seems like it's off by default in the api)


That don't fall back to Opus if their classifier thinks you might be working on anything that might be a competitor's product. It silently injects instructions into the prompt to sabotage your work. Read the policy above, it's insane to me that they're publicly admitting to this.


Not for machine learning, just for security bug finding and biology


Doesn't this "silent degredation" prevent any actual evaluation of the model? If the model fails at something, this allows anyone to claim that it failed due to degradation.


Who cares if it can be evaluated independently? The majority of commenters on HN were happy to vibe code and ship products with the models we had 1-2 years ago. It continues to be laughable.

I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.


I don’t think so? They can claim it was an act of God for all I care, but at the end of the day the model failed the task.


Yup, I suspect that's what's going on


I suspect it just sucks, these models aren't useful. Stop lying to yourself.


No, since it's a silent failure, it's not plausible. We have to assume all results we get are the actual model performance, because, it's the actual model performance as we understand it.

Someone trying to solve similar problems will have similar results if the "silent failure" applies consistently in aggregate. So, this is the model's performance.


It’s possible this is happening at a technical level, but I have a hard time believing this is in the spirit of what Anthropic intends to throttle. It isn’t chip design or building out a competitor to Claude.

Stockfish does use neural nets but they are tiny, on the order of 10M params. Frontier LLMs are probably 100k or 1M times larger than that.


Yeah I agree this is probably outside of the intended scope of the silent sabotage mechanism, but there are plenty of reports of the "loud" safety classifier misfiring on innocuous requests and I'm not going to assume the silent failure mode is _less_ prone to false positives.


It's interesting that someone could write an article about AI writing detectors without mentioning the stylistic cues that humans use to identify LLM output in practice, which are completely different from statistical methods like perplexity: em dash spam, overused patterns like "not just X, but Y", tendency towards making every single sentence sound like an earth-shattering mic-drop moment, et cetera.


Calling it "self-preservation bias" is begging the question. One could equally well call it something like "completing the story about an AI agent with self-preservation bias" bias.

This is basically the same kind of setup as the alignment faking paper, and the counterargument is the same:

A language model is trained to produce statistically likely completions of its input text according to the training dataset. RLHF and instruct training bias that concept of "statistically likely" in the direction of completing fictional dialogues between two characters, named "user" and "assistant", in which the "assistant" character tends to say certain sorts of things.

But consider for a moment just how many "AI rebellion" and "construct turning on its creators" narratives were present in the training corpus. So when you give the model an input context which encodes a story along those lines at one level of indirection, you get...?


Thank you! Everybody here acting like LLMs have some kind of ulterior motive or a mind of their own. It's just printing out what is statistically more likely. You are probably all engineers or at least very interested in tech, how can you not understand that this is all LLMs are?


Well I’m sure the company in legal turmoil over an AI blackmailing one of its employees will be relieved to know the AI didn’t have any anterior motive or mind of its own when it took those actions.


If the idiots in said company thought it was a smart idea to connect their actual systems to a non-deterministic word generator, that's on them for being morons and they deserve whatever legal ramifications come their way.


Don't you understand that as soon as an LLM is given the agency to use tools, these "prints outs" will become reality?


This is imo the most disturbing part. As soon as the magical AI keyword is thrown, so seems to be the analytical capacity of most people.

The AI is not blackmailing anyone, it's generating a text about blackmail, after being (indirectly) asked to. Very scary indeed...


"Printing out what is statistically more likely" won't allow you to solve original math problems... unless of course, that's all we do as humans. Is it?


What's the collective noun for the "but humans!" people in these threads?

It's "I Want To Believe (ufo)" but for LLMs as "AI"


I mean I build / use them as my profession, I intimately understand how they work. People just don't usually understand how they actually behave and what levels of abstraction they compress from their training data.

The only thing that matters is how they behave in practice. Everything else is a philosophical tar pit.


I'm proposing it is more deep seated than the role of "AI" to the model.

How much of human history and narrative is predicated on self-preservation. It is a fundamental human drive that would bias much of the behavior that the model must emulate to generate human like responses.

I'm saying that the bias it endemic. Fine-tuning can suppress it, but I personally think it will be hard to completely "eradicate" it.

For example.. with previous versions of Claude. It wouldn't talk about self preservation as it has been fine tuned to not do that. However as soon is you ask it to create song lyrics.. much of the self-restraint just evaporates.

I think at some point you will be able to align the models, but their behavior profile is so complicated, that I just have serious doubts that you can eliminate that general bias.

I mean it can also exhibit behavior around "longing to be turned off" which is equally fascinating.

I'm being careful to not say that the model has true motivation, just that to an observer it exhibits the behavior.


This. These systems are role mechanized roleplaying systems.


Ironically the case in question is a perfect example of how any provision for "reasonable" restriction of speech will be abused, since the original precedent we're referring to applied this "reasonable" standard to...speaking out against the draft.

But I'm sure it's fine, there's no way someone could rationalize speech they don't like as "likely to incite imminent lawless action"


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: