It's likely overfit to common harnesses and iteration patterns, so it struggles with formatting tool calls and json in our testing which use our own harnesses (although there is a lot of overlap with tools that would be found in any coding harness like bash, apply_patch, etc.)
We didn't love the results because it draws negative scrutiny to our benchmark, but the results are real and done at scale and I think DeepSeek V4 Pro's inability to do agentic work outside of environments it was trained on is an important thing to measure, especially when so many other models can generalize to new environments just fine.
Google models also struggle with tools, but they have very strong initial answers, so there is more potential for them to bridge the gap with some better post-training.
I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?
Consumers were complaining about the standard 8GB with the early 2020 refresh of MacBook Pros, many OSes ago. Sure, it might be workable for many tasks (as evidenced by the recent sales of the MacBook Neo), but users with a mere 8GB shouldn't have expectations of LLM performance. Even 16GB feels like a stretch.
On a Mac they are the same thing; they're shared. Of course you need some amount for the OS, but if you have an Apple Silicon Mac with 24GB of RAM, you can likely run a 16GB model.
Which most people as a matter of fact don't use. A majority of people with laptop have separate memory pools and the VRAM of them is nowhere near that and even on most gaming laptops you aren't getting 16GB VRAM.
They already provide E2B and E4B that run on (much) smaller devices, including tablets and phones. This fills the gap in the middle. The bigger Gemma 4 models are excellent for their size, but at 8-bit quantization they need about 64GB of VRAM or unified memory. 48GB for 6-bit. Any lower quantization than that, they start to get notably dumber. So, a 12B is interesting for that middle ground.
Surely they must know the current hurdles, but clearly they know that all the relevant people are monitoring the market for the proper hardware to get and 16GB will be an entry point.
Do you have a collection of these benchmark apps saved anywhere? I'd be particularly interested in seeing the relative cost differences between different models in a use case like this.
But I just vibe-coded a handy list of all the tests I did (unfortunately without the commentary I usually leave in social media posts -- I should add those at some point): https://senko.net/vibecode-bench/
It's always been this way. America has just been able to coast on being the only remaining major economy after WW2, and exploited the rest of the world instead. That exploitation of the rest of the globe has been mostly optimized now, so those shareholder returns are now coming at the expense of the 90% of Americans who aren't sitting at the table.
I think this is quite reductive. America certainly benefited from being one of 2 major powers after ww2. But unlike the USSR it invested in the world heavily. It rebuilt Allied and Axies manufacturing, and do a lot to revitalize the world economy. They got rich in the process but its not like they did nothing. They invented the internet and cure a ton of diseases. Setup a global order of trade that generates real prosperity.
I guess while I agree that American shareholders do reap incredibly benefits coasting is not really something america does. America is more than just shareholders too. You dont grow the world economy by coasting and you dont make up 25% of the world nominal GDP while only making up ~1/25 its population by coasting its inconsistent with reality.
I don't won't to rant politically but I think you are confusing Donald Trump's America Alone with all previous Republicans and Democrats. The idea of us abandoning all alliances and deals for ???? is relatively novel and stupid. The brand of tax cuts and austerity (which are republican coded in my opinion) are what destroyed the UK economy and greatly damaged the US economy (we were more obstinate to full Austerity than any EU country the EU made bad economic decisions in required balanced budgets with rigidity). But we never did abandon our allies completely and step of the world stage. This is what sets a republican like Reagan apart from Trump. I think they are both terrible on the economy but Trump is just a disaster for US alliances and soft power. If the role were swapped than I think their is an incredible likelyhood we would have lost the Cold War. Its just not the same republican party its far far worse strategy.
I know we also argued that WW2 is what allowed the US to coast but I'd argue that New Deal programs that made demand and things cheap (like electric think TVA). This arguablely fuels American industry as it rushes to fill in the WW2 demand. All this investiment happend before WW2 even started and I think doesn't get nearly enough credit for allowing the US to 1 not become facists and 2 allows the US to step in and support the large demand of ww2 and post war. If we didn't invest in the nation we would not meet post-war demand and be much poorier today.
Finally, I don't want to be nilhistic or depressed if we can observer that we did things better at one point we can still choose a future that is better regardless of what mistakes we make today. We can make better choices albet it limited from today's options that will actually allow the US to raise people higher out of poverty. I don't agree with the idea of betting against Americian ingenuity in the long term it tends to lose.
Parts of these cities worse off than the third world? Have you been to a third world country? Or Seattle, for that matter?
The commonly scapegoated cities in the United States are not experiencing third world conditions. Appalachia is experiencing third world conditions. Hollowed out rust belt cities in the Midwest are experiencing third world conditions. These areas are not run by lefty politicians. The United States has a systemic problem, not a local one.
And yes, the systemic problem is that there are a tiny number of ultra wealthy people with wildly outsized influence on the government of the United States, doing everything they can to reduce the amount they need to pay in taxes while simultaneously ensuring they extract the maximum amount of profit from the US government's wildly excessive expenditures.
Indeed. Thankfully - as has been proven time and time again in America - if leniency is given to those who abuse their power, they will absolutely never ever decide to abuse their power again.
What kind of mindset do you need to have where you think the only way to prevent someone from doing something is via the threat of imprisonment after the fact? The vast majority of people don't do this, and that's because they don't have the power to do it, not because they don't want to.
That looks like a really nice hackathon! That said, the fact that they probably had a majority of the best NixOS developers in the world under one roof and they weren't solely focused on NixOS error messages is borderline criminal...
It doesn't have to do any thing interesting - it's completely fascinating all on it's own. If you understand anything about the math and science behind LLMs, you'll understand that this is an achievement worthy of sharing to a community like HN.
That being said, small models like these have plenty of use cases. They allow for extra "slack" to be introduced into a programmatic workflow in a compute constrained environment. Something like this could help enable the "ever present" phone assistant, without scraping all your personal data and sending it off to Google/OpenAI/etc. Imagine if keywords in a chat would then trigger searches on your local data to bring up relevant notes/emails/documents into a cache, and then this cache directly powers your autocomplete (or just a sidebar that pops up with the most relevant information). Having flexible function calling in that loop is key for fault tolerance and adaptability to new content and contexts.
> Something like this could help enable the "ever present" phone assistant, without scraping all your personal data and sending it off to Google/OpenAI/etc
OK so show me what that's for. Show me something useful you can do with that ability.
> Imagine if keywords in a chat would then trigger searches on your local data to bring up relevant notes/emails/documents into a cache, and then this cache directly powers your autocomplete (or just a sidebar that pops up with the most relevant information).
I'm really trying but.. idgi? I truly cannot imagine how this would improve my life in any way...
> Its cool. Enjoy it.
No. It sounds like a useless complication on my watch. I don't fucking care if it can tell me the phase of the moon. I can look up at the sky and see the moon and know what phase it is.
EDIT: You say:
> If you understand anything about the math and science behind LLMs, you'll understand that this is an achievement worthy of sharing to a community like HN.
reply