I love everything about this direction except for the insane inference costs. I don’t mind the training costs, since models are commoditized as soon as they’re released. Although I do worry that if inference costs drop, the companies training the models will have no incentive to publish their weights because inference revenue is where they recuperate the training cost.
Either way… we badly need more innovation in inference price per performance, on both the software and hardware side. It would be great if software innovation unlocked inference on commodity hardware. That’s unlikely to happen, but today’s bleeding edge hardware is tomorrow’s commodity hardware so maybe it will happen in some sense.
If Taalas can pull off burning models into hardware with a two month lead time, that will be huge progress, but still wasteful because then we’ve just shifted the problem to a hardware bottleneck. I expect we’ll see something akin to gameboy cartridges that are cheap to produce and can plug into base models to augment specialization.
But I also wonder if anyone is pursuing some more insanely radical ideas, like reverting back to analog computing and leveraging voltage differentials in clever ways. It’s too big brain for me, but intuitively it feels like wasting entropy to reduce a voltage spike to 0 or 1.
> I love everything about this direction except for the insane inference costs.
If this direction holds true, ROI cost is cheaper.
Instead of employing 4 people (Customer Support, PM, Eng, Marketing), you will have 3-5 agents and the whole ticket flow might cost you ~20$
But I hope we won't go this far, because when things fail every customer will be impacted, because there will be no one who understands the system to fix it
I worry about the costs from an energy and environmental impact perspective. I love that AI tools make me more productive, but I don't like the side effects.
Environmental impact of ai is greatly overstated. Average person will make bigger positive impact on environment by reducing his meat intake by 25% compared with combined giving up flying and AI use.
This is the wrong way to see it. If a technology gets cheaper, people will use more and more and more of it. If inference costs drop, you can throw way more reasoning tokens and a combination of many many agents to increase accuracy or creativity and such.
No company at the moment has enough money operate with 10x the reasoning tokens of their competitors because they're bottlenecked by GPU capacity (or other physical constraints). Maybe in lab experiments but not for generally available products.
And I sense you would have to throw orders of magnitude more tokens to get meaningfully better results (If anyone has access to experiments with GPT 5 class models geared up to use marginally more tokens with good results please call me out though).
No, the key (novel) element here is the two-tiered approach to sandboxing and inter-agent communication. That’s why he spends most of the post talking about it and only a few sentences on which models he selected.
RFC 1459 originally stipulated that messages not exceed 512 bytes in length, inclusive of control characters, which meant the actual usable length for message text was less. When the protocol's evolution was re-formalized in 2000 via RFCs 2810-13 the 512-byte limit was kept.
However, most modern IRC implementations support a subset of the IRCv3 protocol extensions which allow up to 8192 bytes for "message tags", i.e. metadata and keep the 512-byte message length limit purely for historical and backwards-compatibility reasons for old clients that don't support the v3 extensions to the protocol.
So the answer, strictly speaking, is yes. IRC does still have message length limits, but practically speaking it's because there's a not-insignificant installed base of legacy clients that will shit their pants if the message lengths exceed that 512-byte limit, rather than anything inherent to the protocol itself.
I bet there’s gonna be a banger of a Mac Studio announced in June.
Apple really stumbled into making the perfect hardware for home inference machines. Does any hardware company come close to Apple in terms of unified memory and single machines for high throughput inference workloads? Or even any DIY build?
When it comes to the previous “pro workloads,” like video rendering or software compilation, you’ve always been able to build a PC that outperforms any Apple machine at the same price point. But inference is unique because its performance scales with high memory throughput, and you can’t assemble that by wiring together off the shelf parts in a consumer form factor.
It’s simply not possible to DIY a homelab inference server better than the M3+ for inference workloads, at anywhere close to its price point.
They are perfectly positioned to capitalize on the next few years of model architecture developments. No wonder they haven’t bothered working on their own foundation models… they can let the rest of the industry do their work for them, and by the time their Gemini licensing deal expires, they’ll have their pick of the best models to embed with their hardware.
> But inference is unique because its performance scales with high memory throughput, and you can’t assemble that by wiring together off the shelf parts in a consumer form factor.
Nvidia outperforms Mac significantly on diffusion inference and many other forms. It’s not as simple as the current Mac chips are entirely better for this.
Give that most of mine, and probably yours, and probably most of the world's computers are in fact made in China one way or another, some higher percentage than others, I'm guessing most of us trust our hardware enough to continue using it.
Where are you gonna find Apple hardware with 128GB of memory at enthusiast-compatible price?
The cheapest Apple desktop with 128GB of memory shows up as costing $3499 for me, which isn't very "enthusiast-compatible", it's about 3x the minimum salary in my country!
Seems I misunderstood what a "enthusiast" is, I thought it was about someone "excited about something" but seems the typical definition includes them having a lot of money too, my bad.
Right, I think maybe we're then talking about "upper class enthusiasts" or something in reality then? I understood that to juts be about the person, not what economic class they were in, maybe I misunderstood.
A 128GB 2TB Dell Pro Max with Nvidia GB10 is about $4200, a Mac Studio with 128GB RAM and 2TB storage is $4100. So pretty comparable. I think Dell's pricing has been rocked more by the RAM shortage too.
From the spec sheets I’m looking at, it is not. I’m seeing models of the Dell Pro Max with 128 GB of DDR5-6400 as CAMM2, then a separate memory of up to 24 GB on the GPU. CAMM2 does not make the memory unified.
You're not looking at the right thing. Dell's naming is horrible. Dell Pro Max with GB10 (https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...). It's a very different computer than what you're looking at and has 128GB LPDDR5X unified memory.
AFAIK, for the unified bandwidth, it depends mostly on the CPU, for M4 Max (I think it's the default today?) it does ~550 GB/s, while GB10 does ~270 GB/s, so about a 2x difference between the two. For comparison, RTX Pro 6000 does 1.8 TB/s, pretty much the same as what a 5090 does, which is probably the fastest/best GPUs a prosumer reasonable could get.
It has a HDMI port and its USB-C ports also support display out. But I believe most who buy it intend to use it headless. The machine runs Ubuntu 24.04 and has a slightly customised Gnome (green accents and an nvidia logo in GDM) as its desktop.
Jeff Geerling doing that 1.5TB cluster using 4 Mac Studios was pretty much all the proof needed to demo how the Mac Pro is struggling to find any place any more.
But those Thunderbolt links are slower than modern PCIe. If there's actually a M5-based Mac Studio with the same Thunderbolt support, you'll be better off e.g. for LLM inference, streaming read-only model weights from storage as we've seen with recent experiments than pushing the same amount of data via Thunderbolt. It's only if you want to go beyond local memory constraints (e.g. larger contexts) that the Thunderbolt link becomes useful.
Why everyone wants to live in dongle/external cabling/dock hell is beyond me. PCIe cards are powered internally with no extra cables. They are secure. They do not move or fall off of shit. They do not require cable management or external power supplies. They do not have to talk to the CPU through a stupid USB hub or a Thunderbolt dock. Crappy USB HDMI capture on my Mac led me to running a fucking PC with slots to capture video off of a 50 foot HDMI cable, that then streamed the feed to my Mac from NDI, because it was more reliable than the elgarbo capture dongle I was using. This shit is bad. It sucks. It's twice the price and half the quality of a Blackmagic Design capture card. But, no slots, so I guess I can go get fucked.
For anything that's even somewhat in the consumer space rather than pure workstation/professional, the main reason is that dongles can be used with a laptop but add-in cards can't. When ordinary consumer PCs (or even office PCs) are in the picture, laptops are a huge chunk of the target audience.
The market segments that can afford to ignore laptops and only target permanently-installed desktops are mostly those niches where the desktop is installed alongside some other piece of equipment that is much more expensive.
Wasn't streaming models from storage into limited memory a case where it was impressive that you could make the elephant dance at all?
If you want to get usable speeds from very large models that haven't been quantitized to death on local machines, RDMA over Thunderbolt enables that use case.
Consumer PC GPUs don't have enough RAM, enterprise GPUs that can handle the load very well are obscenely expensive, Strix Halo tops out at 128 Gigs of RAM and is limited on Thunderbolt ports.
The bad performance you saw was with very limited memory and very large models, so streaming weights from storage was a huge bottleneck. If you gradually increase RAM, more and more of the weights are cached and the speed improves quite a bit, at least until you're running huge contexts and most of the RAM ends up being devoted to that. Is the overall speed "usable"? That's highly subjective, but with local inference it's convenient to run 24x7 and rely on non-interactive use. Of course scaling out via RDMA on Thunderbolt is still there as an option, it's just not the first approach you'd try.
The proposition of a Mac Pro in the Apple Silicon world wasn't necessarily about performance, it was about the existence of the PCIe slots. I don't think AI becoming a workload for pro Macs means the Mac Pro doesn't have a place, people who were using Mac Pros for audio or video capture didn't stop doing that media work and switched to AI as a profession. That market just wasn't big enough to sustain the Mac Pro in the first place and Apple has finally acknowledged that fact
I had a U-Audio PCI card in a Mac Pro during the Intel era of Macs. It was a chip to run their software plugins and the plugins are top of the line. I have a U-Audio box that runs over Thunderbolt now. I know there are people who need device slots, but it's vanishingly few. I'm disappointed that this category of machine is going away, but it stopped being for me in the Apple Silicon era.
so many peripherals now come in external boxes that communicate _incredibly quickly_ over Thunderbolt 4/5 that the need for PCIe is marginal, while the cost to support it is significant.
> Apple really stumbled into making the perfect hardware for home inference machines
For LLMs. For inference with other kinds of models where the amount of compute needed relative to the amount of data transfer needed is higher, Apple is less ideal and systems worh lower memory bandwidth but more FLOPS shine. And if things like Google’s TurboQuant work out for efficient kv-cache quantization, Apple could lose a lot of that edge for LLM inference, too, since that would reduce the amount of data shuffling relative to compute for LLM inference.
I'm not a big fan of reducing computing as a whole to just inference. Apple has done quite a bit besides that and it deserves credit. Mac Pro disappearing from the product line is a testament to it, that their compact solutions can cover all needs, not just local inference, to a degree that an expandable tower is not required at all.
> Mac Pro disappearing from the product line is a testament to it
Apple removing/adding something to their product line matters nothing, for all we know, they have a new version ready to be launched next month, or whatever. Unless you work at Apple and/or have any internal knowledge, this is all just guessing, not a "testament" to anything.
It's hilarious that not a single one of these has pricing listed anywhere public.
I don't think they expect anyone to actually buy these.
Most companies looking to buy these for developers would ideally have multiple people share one machine and that sort of an arrangement works much more naturally with a managed cloud machine instead of the tower format presented here.
Confirming my hypothesis, this category of devices more or less absent in the used market. The only DGX workstation on ebay has a GPU from 2017, several generations ago.
'Important' people in organizations get them. They either ask for them, or the team that manages the shared GPU resources gets tired of their shit and they just give them one.
Nvidia doesn’t list prices because they don’t sell the machines themselves. If you click through each of those links, the prices are listed on the distributor’s website. For example the Dell Pro Max with GB10 is $4,194.34 and you can even click “Add to Cart.”
Because that's a different price point, that's getting near 100K, and the availability is very limited. I don't think they're even selling it openly, just to a bunch of partners...
The MSI workstation is the one that is showing some pricing around. Seems like some distributors are quoting USD96K, and have a wait time of 4 to 6 weeks [0]. Other say 90K and also out of stock [1]
The interesting question is whether they'll lean into it intentionally (better tooling, more ML-focused APIs) or just keep treating it as a side effect of their silicon design
I think we’ll see a much more robust ecosystem develop around MLX now that agentic coding has reduced the barrier of porting and maintaining libraries to it.
Agreed. I’m planning on selling my 512GB M3 Ultra Studio in the next week or so (I just wrenched my back so I’m on bed-rest for the next few days) with an eye to funding the M5 Ultra Studio when it’s announced at WWDC.
I can live without the RAM for a couple of months to get a good price for it, especially since Apple don’t sell that model (with the RAM) any more.
But it feels really good to have more ram than you can think of a use for.
I have a faint memory of an interview ages ago with Stoutsrup I think where he mentioned as an aside he was using a workstation with 3.2 Gb of storage and 4 Gb of ram :)
Just out of curiosity, where do you think is the best place to sell a machine like that with the lowest risk of being scammed, while still getting the best possible price?
> Just out of curiosity, where do you think is the best place to sell a machine like that with the lowest risk of being scammed, while still getting the best possible price?
There are none currently on eBay.co.uk, so I'm going to try there. I'll also try some of the reddit UK-specific groups.
As far as not being scammed - it's a really high value one-off sale, so it'll either be local pickup (and cash / bank-transfer at the time, which happens in seconds in the UK) or escrow.com (for non-eBay) with the buyer paying all the fees etc.
I'd prefer local pickup because then I have the money, the buyer can see it working, verify everything to their satisfaction etc. etc.
> Wish you a speedy recovery for your back!
Thank you :) It is a little better today. Sitting down is now tolerable for short periods... :)
doesn't escrow.com charge a 50$/pound minimum fees.
I do know that Escrow.com is one of the most reputable escrow platforms, on a more personal note, I would love to know a escrow service where I can just sell the spare domains I have (I have got some .com/.net domains for 1$ back during a deal for a provider), is there any particular escrow service which might not charge a lot and I can get a few dollars from selling them as some of those domains aren't being used by me.
> Thank you :) It is a little better today. Sitting down is now tolerable for short periods... :)
I am wishing you speedy recovery as well. A cowboy gotta have a strong back :-)
According to the calculator, it’d be about £280 assuming the purchase cost was £11k. I think that’s probably an upper-bound on the sale-price, though I can see bids of $20k on eBay.com for the same model.
I sold a domain via escrow.com a long time ago now (20 years or so) but the buyer paid fees, so I don’t know what they charge for that. You could try the calculator they have though (https://www.escrow.com/fee-calculator)
As to better or cheaper homelab: depends on the build. AMD AI Max builds do exist, and they also use unified memory. I could argue the competition was, for a long time, selling much more affordable RAM, so you could get a better build outside Apple Silicon.
The typical inference workloads have moved quite a bit in the last six months or so.
Your point would have been largely correct in the first half of 2025.
Now, you're going to have a much better experience with a couple of Nvidia GPUs.
This is because of two reasons - the reasoning models require a pretty high number of tokens per second to do anything useful. And we are seeing small quantized and distilled reasoning models working almost as well as the ones needing terabytes of memory.
For now. In a few years it will be part of every day life, because people will see Apple users enjoying it without thinking about it. You won’t consider it a “home inference machine,” just a laptop with more capabilities than any other vendor offers without a cloud subscription.
> Apple really stumbled into making the perfect hardware for home inference machines
Apple are winning a small battle for a market that they aren’t very good in. If you compare the performance of a 3090 and above vs any Apple hardware you would be insane to go with the Apple hardware.
When I hear someone say this it’s akin to hearing someone say Macs are good for gaming. It’s such a whiplash from what I know to be reality.
Or another jarring statement - Sam Altman saying Mario has an amazing story in that interview with Elon Musk. Mario has basically the minimum possible story to get you to move the analogue sticks. Few games have less story than Mario. Yet Sam called it amazing.
It’s a statement from someone who just doesn’t even understand the first thing about what they are talking about.
Sorry for the mini rant. I just keep hearing this apple thing over and over and it’s nonsense.
The framework desktop is quite cool, but those Ryzen Max CPUs are still a pretty poor competitor to Apple's chips if what you care about it running an LLM. Ryzen Max tops out at 256 GB/s of memory bandwidth, whereas an M4 Max can hit 560 GB/s of bandwidth.
So even if the model fits in the memory buffer on the Ryzen Max, you're still going to hit something like half the tokens/second just because the GPU will be sitting around waiting for data.
Personally, I'd rather have the Framework machine, but if running local LLMs is your main goal, the offerings from Apple are very compelling, even when you adjust for the higher price on the Apple machine.
128gb is the max RAM that the current Strix Halo supports with ~250GB/s of bandwidth. The Mac Studio is 256GB max and ~900GB/s of memory bandwidth. They are in different categories of performance, even price-per-dollar is worse. (~$2700 for Framework Desktop vs $7500 for Mac Studio M3 Ultra)
That won’t work for the home hobbyist 2.4KW of GPU alone plus a 350W threadripper pro with enough PCIe lanes to feed them. You’re looking at close to twice the average US household electricity circuit’s capacity just to run the machine under load.
A cluster of 4 Apple’s M3 ultra Mac studios by comparisons will consume near 1100W under load.
I don't think Apple just stumbled into it, and while I totally agree that Apple is killing it with their unified memory, I think we're going to see a pivot from NVidia and AMD. The biggest reason, I think, is: OpenAI has committed to enormous amount capex it simply cannot afford. It does not have the lead it once did, and most end-users simply do not care. There are no network effects. Anthropic at this point has completely consumed, as far as I can tell, the developer market. The one market that is actually passionate about AI. That's largely due to huge advantage of the developer space being, end users cannot tell if an "AI" coded it or a human did. That's not true for almost every other application of AI at this point.
If the OpenAI domino falls, and I'd be happy to admit if I'm wrong, we're going to see a near catastrophic drop in prices for RAM and demand by the hyperscalers to well... scale. That massive drop will be completely and utterly OpenAI's fault for attempting to bite off more than it can chew. In order to shore up demand, we'll see NVidia and AMD start selling directly to consumers. We, developers, are consumers and drive demand at the enterprises we work for based on what keeps us both engaged and productive... the end result being: the ol' profit flywheel spinning.
Both NVidia and AMD are capable of building GPUs that absolutely wreck Apple's best. A huge reason for this is Apple needs unified memory to keep their money maker (laptops) profitable and performant; and while, it helps their profitability it also forces them into less performant solutions. If NVidia dropped a 128GB GPU with GDDR7 at $4k-- absolutely no one would be looking for a Mac for inference. My 5090 is unbelievably fast at inference even if it can't load gigantic models, and quite frankly the 6-bit quantized versions of Qwen 3.5 are fantastic, but if it could load larger open weight models I wouldn't even bother checking Apple's pricing page.
tldr; competition is as stiff as it is vicious-- Apple's "lead" in inference is only because NVidia and AMD are raking in cash selling to hyperscalers. If that cash cow goes tits up, there's no reason to assume NVidia and AMD won't definitively pull the the rug out from Apple.
> A huge reason for this is Apple needs unified memory to keep their money maker (laptops) profitable and performant
None of the things people care about really get much out of "unified memory". GPUs need a lot of memory bandwidth, but CPUs generally don't and it's rare to find something which is memory bandwidth bound on a CPU that doesn't run better on a GPU to begin with. Not having to copy data between the CPU and GPU is nice on paper but again there isn't much in the way of workloads where that was a significant bottleneck.
The "weird" thing Apple is doing is using normal DDR5 with a wider-than-normal memory bus to feed their GPUs instead of using GDDR or HBM. The disadvantage of this is that it has less memory bandwidth than GDDR for the same width of the memory bus. The advantage is that normal RAM costs less than GDDR. Combined with the discrete GPU market using "amount of VRAM" as the big feature for market segmentation, a Mac with >32GB of "VRAM" ended up being interesting even if it only had half as much memory bandwidth, because it still had more than a typical PC iGPU.
The sad part is that DDR5 is the thing that doesn't need to be soldered, unlike GDDR. But then Apple solders it anyway.
> Not having to copy data between the CPU and GPU is nice on paper but again there isn't much in the way of workloads where that was a significant bottleneck.
Isn't that also because that's world we have optimized workloads for?
If the common hardware had unified memory, software would have exploited that I imagine. Hardware and software is in a co-evolutionary loop.
> tldr; competition is as stiff as it is vicious-- Apple's "lead" in inference is only because NVidia and AMD are raking in cash selling to hyperscalers. If that cash cow goes tits up, there's no reason to assume NVidia and AMD won't definitively pull the the rug out from Apple.
These companies always try to preserve price segmentation, so I don’t have high hopes they’d actually do that. Consumer machines still get artificially held back on basic things like ECC memory, after all . . .
Can we also stop giving Apple some prize for unified memory?
It was the way of doing graphics programming on home computers, consoles and arcades, before dedicated 3D cards became a thing on PC and UNIX workstations.
Can we please stop treating this like some 2000s Mac vs PC flame war where you feel the need go full whataboutism whenever anyone acknowledges any positive attribute of any Apple product? If you actually read back over the comments you’re replying to, you’ll see that you’re not actually correcting anything that anyone actually said. This shit is so tiring.
> That boundary is deliberate: the public box has no access to private data.
Challenge accepted? It’d be fun to put this to the test by putting a CTF flag on the private box at a location nully isn’t supposed to be able to access. If someone sends you the flag, you owe them 50 bucks :)
The gap in your example is that a human had to realize the system is broken so that he could nudge the agent into fixing it. He can fix that gap by updating the agent to recognize when the system breaks. This now becomes the level at which he debugs… did the agent recognize the failure and self-heal, or not?
And at that point, if the autonomous system breaks, realized it’s broken, and fixes itself before you even notice… then do you need to care whether you learn from it? I suppose this could obfuscate some shared root cause that gets worse and worse, but if your system is robust and fault-tolerant _and_ self-heals, then what is there to complain about? Probably plenty, but now you can complain about one higher level of abstraction.
I’m from the US but reside indefinitely in the UK, and I’ve dodged all this crap (age verification, disabling advanced protection) by simply remaining in the US App Store. It has some downsides like I can’t download the Vodafone UK app but nowadays most apps are available globally.
And herein lies the absurdity of the whole legal framework in the first place. Does it apply to tourists? Residents? Citizens? Citizens traveling abroad?
reply