I enjoyed reading this at the start, the language is very... inspiring. By the end, I was disappointed. I don't disagree with what they're saying, but the opening style and statements made me expect some more specific or groundbreaking conclusions.
The point seems to be that generative AI just generates stuff, and that real discovery requires variation, evaluation and selective retention.
The call to arms seems based on the assumption that people only every talk about generative AI as discovery machines themselves. I think it's pretty widely accepted that's not the case by everyone apart from cliche out-of-touch CEOs.
But the talk makes me realise that generative AI are incredible tools to do the discovery cycle with, and this is what I imagine professionally successful AI users are doing: variation, evaluation and selective retention of their inputs and outputs to generative AI.
Sorry, this post gets me irrationally irritated and makes me want to shake you and shout.
That website is 95% not you, it's AI, and I feel that's causing you to way over-represent the value of it in your response here, or you're completely misunderstanding what the person you're responding to is asking. If you put all of your effort into that site, without AI, it would be infinitely more valuable and useful.
The person you responded to asked for specific things, including:
- obvjective, unbiased measurements, but all that page has is side by side visual comparison of outputs.
- their different generations, but all you included was the outputs
- details on the prompts and little things people are adding because they feel they need to, but you didn't include any of that
This is slop, it's the exact sort of self confirming fluffy AI stuff that other either inexperience or over-invested-in-AI engineers will look at briefly, skim, see quick visual validation, and nod, noting down how much better Fable must be without getting any actual data.
Sorry, it's early, and maybe this is a misplaced rant, but the person you responded to specifically asked for precise, quantitative things precisely because everything else is fluffy slop like this, and people don't even recognise they're doing it any more.
check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.
Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.
If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.
How is a side by side direct comparison NOT precise?
I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.
I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.
Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.
My good lord Tezza. You still have claim and composed response after that sort of insults being throw at you. Haven't seen one this bad for quite sometime on HN. I hope you have a great day.
I reads like an unhinged rant about AI and the engineers who use it, with the entitled tone of people who think they have permission to insult someone's competence and work because AI was used.
In my opinion, if one cannot express themselves civilly, they should refrain from commenting.
I disagree. I wouldn't consider it unhinged. I'm clearly aware of my own frustration. It's also relatively civil, since I was able to temper it with appropriate apologies and acknowledgements. Many other people agree and support the sentiment of what I'm saying.
AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.
It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.
I found the website you ranted about interesting, comparing the quality of the visualization between the different models.
I don't think it was "a huge waste of time" or needed your rant.
You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.
What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.
This is slop, in the sense that it looks like a lot of useful work and effort, and AI is heavily involved, and it was offered up when the opposite was requested, meaning it's not at all helpful in this context.
I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.
I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.
The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.
You think it was civil when the comment started with:
> this post gets me irrationally irritated and makes me want to shake you and shout
Yes, criticism of my work would not generally be a personal insult.
However, if you were to call my work 'slop', and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level. This is not a civil or respectful way to talk to someone.
> You think it was civil when the comment started with:
>> this post gets me irrationally irritated and makes me want to shake you and shout
Did you read the rest of the comment? The rest of it is civil. It's normal for people to start by saying something like "this makes me frustrated" as a preface to indicate their feelings, and then not actually act frustrated and instead calmly work through their thoughts. That is a meatspace social convention (not just an online one) - are you not aware of it?
> However, if you were to call my work 'slop'
And, as previously established, if you use AI, it's not your work.
> and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level
...and those are still criticisms of your work, not yourself.
The actual problem here is that you are taking offense to things that are not offensive, not that the parent poster was being uncivil. Thinking that calling someone "inexperienced" is a personal insult is absolutely insane. That's a wildly miscalibrated sense of how social dynamics work and what it actually means to insult someone.
simonw's pelicans probably wouldn't get posted in response to a request for a more quantitative analysis.
You and others are right though, that there's potentially interesting or enjoyable stuff in there (maybe I should have lead with that?). It's just a large volume of it is not useful in response to a question specifically looking for more quantitative or detailed usage analysis.
You just responded to a comment giving a specific advantage that applies to Rust and a few other languages which doesn't apply to a large amount of other languages (single static binary).
How can you still not see any advantage? Or was the point of your comment to say that you think the only real motivation is self or Rust promotion, suggesting some dishonesty amongst the people you're responding to?
> When I returned from five months of paternity leave in early 2024, the org needed someone at my level to lead GenAI work in developer experience and I was the available person. It wasn't a bet so much as an opportunity I recognised when it showed up. I took enough time to satisfy myself that this wave was different from previous hype cycles before I committed, but once I did, I had to drop most of the rest of my work on developer experience to make space for it. For a while I was the only engineer at any seniority dedicated to this, and the depth I built up happened by necessity as much as by design.
Followed by....
> Senior engineers in AI-forward orgs are doing more leveraged, more hands-on, more meeting-heavy work simultaneously, with the human-focused parts of the role paying for it. The build cost collapsed, the alignment cost rose, the thinking time disappeared, and the productivity gains got captured by output volume rather than output quality. I
What is the job of a senior/lead engineer if not to take the uninformed hype chasing of the senior business, and deploy it in a way that makes things better?
I can't help but feel this senior engineer is talking far too casually about how - under their watch as senior AI engineer chap - engineers spend more time on throwaway code, have less personal development/1-to-1s, and didn't improve code quality. They haven't even mentioned the added financial token cost.
The only thing standing in the way of greedy hype chasing CEOs and a post apocolyptic wasteland is engineers taking their crazy requests and not making the world worse, and it sounds like the author has failed here. I think it's very positive and frank to share their experience, but I'm surprised they don't seem to see their role in it.
> It's not an engineer's job to fix upper management delusions, and engineering is poorly equipped for that in any case.
What is the job of engineering leadership in your mind then, if not to take requirements from the business and convert them into technical solutions which improve things for the business? The requirement here was "AI to improve DX" the outcome was developers lives are worse in many creative and future impacting ways.
I sincerely would love to know if the author or their employer consider what they've done as successful or not.
I like work in this area, and this is really helpful, thanks. I actively avoid cloud based LLMs and mainly use 4b - 30a3b param local models. This means I don't really have a good grasp of SOTA LLM performance or accuracy, but I know what to expect when dealing with local models, and where the pain points are.
I've only skimmed the post and read the abstract and in some places you make a nod to how simple tweaks can make something 10x faster/slower, but then all of your metrics and data seem to focus 100% on accuracy. You need to address speed.
Specifically for agentic workflows and local models, accuracy around function/tool calling hasn't been a problem for me now for about 6 - 12 months, personally, since around QwenCoder3. The main issue is context management and the impact on timing, since agents will often swap prompts and break prompt caching and similar timing improvements.
It looks like your work adds a layers and wrappers like guard rails and retries. This would make my local model experience - specifically for agents - unusable because of the delays it would add.
I really appreciate and respect the work you've done, and apologies if you have already addressed this head on, but with so little talk about the impact on timing here, I feel like you're hiding something or overinflating the actual real world improvements here - what are your thoughts?
It's also mildly concerning me that nobody else has raised this - am I doing something wrong here, or is everyone else just not actually using local models in real life?! Talk to me about your speed experiences!
I agree accuracy isn't maybe the best word here, I used it as it was used in the original post, mainly a as a catchall for "everything but speed", so fidelity, perplexity, etc.
I also agree that if I spent more time using cloud based LLMs, I would very much find local LLMs less capable and useful. Comparison is the thief of joy though, and I'd rather feel blissfully ignorance towards SOTA LLMs rather than a dependence on them.
Before taking a local focus approach, LLMs increasingly left me feeling a mixture of FOMO, sadness and futility towards the future of software and tech. I assume it's 100% a me problem, but it has it's benefits:)
No, I'm a fan of local as well. For me though, there is just such a fascination that I can have something like this sitting on my own hard drive. It's okay that it's not a "frontier model".
Hi! Latency is definitely a factor in any system, and the dashboard and paper do report elapsed time - but at the workflow level.
On a per-call basis, the wrappers are pure python ifs and such, measured in ms easily, and frankly negligible compared to the LLM call itself which will be on the order of magnitude seconds.
Where timing gets interesting is that forge will slow down workflows because the retries mean you don't error right away. Bare runs were failing fast in my experience. But on a per-call basis there's very little overhead.
I haven't detailed it simply because the order of magnitude of a single LLM call is so much higher than all the overhead put together.
Hi! Thanks for the response. Like I mentioned, I only skimmed, and it sounds like there's more to it than I understand, so I'll take a deeper look and see how it feels in practice.
> Where timing gets interesting is that forge will slow down workflows because the retries mean you don't error right away. Bare runs were failing fast in my experience. But on a per-call basis there's very little overhead.
> I haven't detailed it simply because the order of magnitude of a single LLM call is so much higher than all the overhead put together.
Yeah, that makes sense and seems fair. The sort of delays are almost and inevitability, you're not trying to improve speed, but by improving reliability, it can obviously increase overall throughput.
Having watched the demo video too now, automating retries etc would be helpful for me. It's impressive to see how quick the models run on better hardware, and the performance improvements are impressive, even if the overall run takes longer sometimes because it does more correct things. Thanks again!
Yup, confirming what pamcake said, 30b with 3b active.
I have a laptop with a broken screen and an RTX2060 at my disposal. I can run 12b - 14b dense usably, just, although I think 4b - 8b dense models give me the best tradeoff of speed and usefulness.
Larger MOE models with more parameters (20b+) but fewer active (2 - 3b) are sometimes a little bit slower, but are often far more capable.
Personally, I don't see this as people punching someone who's down. This is the sort of real life experience and necessary context from actual technical users that I come to HN comments for.
Someone is just asking to get Google's side and explaining why they want that, which seems reasonable since we're in a post where Google is being punched/blamed for this, and it sounds like it isn't Railways first questionable outage.
You seem to be arguing that vibecoding photoshop wasn't possible up until 2 months ago, with GPT 5.4/5.5.
That's a very, very weird take on many, many levels. Could you elaborate a bit about where that view came from, how often you use AI, what's your career etc.?
Good point. I guess I feel they're still getting into position there, and haven't really had their opportunity to blossom yet. The average western citizen's experience of war is still just slightly increased food and fuel prices.
For what purpose? Murder? Arson? It's amazing how often people say things like "no one is above the law" whenever it's convenient, then totally flip the script when it's not.
why is it that you give a pass to the violence and death in the dozens, hundreds, thousands and millions at the hands of billionaires who regularly kill for profit...
yet balk at someone deciding to fight back in kind and on an exponentially smaller scale, comparatively speaking?
I went the opposite way: I started with UHK, then went for a ZSA moonlander, but settled on a kbdcraft Israfel, which is a relatively cheap, split ortholinear.
I felt most of the extra functionality and polish that I guess makes up the massive costs of UHK and ZSA wasn't actually necessary. It was cool and fun and useful to try a bunch of different stuff, but then over time, I wanted things to be simple and small which UHK and ZSA Moonlander aren't (ZSA voyager wasn't at the time).
All I'm saying is if you've got comfortable with a cheap Corne, I think you might feel underwhelmed if you spend a lot on something a lot fancier.
The point seems to be that generative AI just generates stuff, and that real discovery requires variation, evaluation and selective retention.
The call to arms seems based on the assumption that people only every talk about generative AI as discovery machines themselves. I think it's pretty widely accepted that's not the case by everyone apart from cliche out-of-touch CEOs.
But the talk makes me realise that generative AI are incredible tools to do the discovery cycle with, and this is what I imagine professionally successful AI users are doing: variation, evaluation and selective retention of their inputs and outputs to generative AI.
reply