Hacker Timesnew | past | comments | ask | show | jobs | submit | WASDx's commentslogin

I think this is inevitable. Sooner or later, model-specific ASIC's will make economical sense. We're already seeing it happening with Taalas/Cerebras so I think it's sooner than 5 years. And inference is order of magnitude faster which is amazing.

> distributed LLM inference

This seems extremely inefficient considering data transfer between model layers if the model is distributed. I found this project called Petals that claim up to 4 tok/s for a 180B model although its repository hasn't been updated in two years.

https://petals.dev/


For token generation, yes: because current-gen LLMs are autoregressive you need to add the inter-node latency for every since token.

For prompt processing it would work though, and it could for diffusion LLMs as well.


I like this one, although its data seem to overlap with ECI.

https://artificialanalysis.ai/trends


https://chatjimmy.ai/ from Taalas also feels like that.


I think their "code" ranking is biased towards visual aesthetics more than raw coding as the voters are just asked which generated website they prefer.


I've had mostly problem-free experiences with intellij (ultimate-only feature I think). One click finds declarations both in business code and buried deep in libraries.


Following the code via IDE is indeed easy in javaland - but if you didn't have a breadcrumb yet... Spring boot you didn't architect yourself is indeed annoying to navigate.

Everything can be an entry point and it's often non-obvious how things are structured.

More opinionated frameworks which enforce routes and consumers to be centrally managed are generally easier to figure out from the filesystem.

But if you've got an IDE like intellij you get the entry point tool which lists all endpoints. Consumers are more annoying...


gemma-4-31B-it-assistant is a 0.5B model. So it's performance would likely be comparable to other models of such size.


I think this is the future. When models start converging at "really good" (which I think is already happening) then burning them into ASIC silicon is the natural next step.

Harnesses can keep improving with a fixed model and the throughput opens up new possibilities like doing 10x more "thinking" or exploring parallel paths and picking the best.


I was impressed enough by AI finding vulnerabilities in source code, but doing it in binary executables is just amazing. This has so much potential, good and bad.

And yet another lesson to not treat data as instructions. Sanitize all user input!


Transformers were literally designed for translation.

As we have known for a while, they ended up being really good at translating source to source or text to source. It shouldn't be too surprising they are also really good at understanding the asm version too.

Doesn't make it any less impressive, but maybe less surprising.


Creating a custom tuple class to use as key could be faster though. Nested map lookups have less efficient memory access patterns.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: