Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

1. There's nothing particularly special about Transformers, except that it was the first architecture to scale.

Transformers scale well because they leverage a GPU well. The drawback to this is the attention mechanism (RAM hungry and finite autoregressive window)

2. Transformers aren't "computers" in the turing machine sense given their finite state. You could make that argument for models with hidden states, though, like RNNs/LSTM, SSMs/Mamba, RWKV. THe issue with those is that the hidden state makes them harder to train at scale.



> 2. Transformers aren't "computers" in the turing machine sense given their finite state

That's true, but then the things we normally think of today as "computers" also aren't, because of their finite amount of memory. I think that the author is implying that, like linearization, we can approximate transformers as being Turing-complete for a small range of operation (that's much larger for "real" computers).


Well, scaling is big deal since that is where a lot of the power comes from. The way transformers leverage parallel hardware by processing tokens in parallel also makes them more general than just sequence-to-sequence, maybe more like graph-to-graph, which is what has allowed them to also be used for things like vision.

There's a common sentiment that data/scale is more important than architecture, and that other architectures would perform just as well if you scaled them up (to degree that is practical of course), but I'm not sure that is totally true. Of course scale is vital to performance, but scaling the wrong architecture isn't going to get you there. For example, you need the ability to attend to specific tokens, and munging the sequence history into single vector the way RNNs do is going to cap performance, which is why you see hybrid architectures like Jamba (Mamba + Transformer).

I think there is something special about Transformers - something that the developers accidentally got right! The key-based attention mechanism and the way attention heads in adjacent layers can pair up to form induction heads seems to be what makes them unreasonably effective at learning from language...

It's a shame there hasn't been much (any?) discussion of this that I'm aware of - perhaps just because the detailed architecture was more "accidental" than strategic. It seems the history is that Jacob Uszkoreit had the basic idea of language being more hierarchical than sequential, and therefore amenable to parallel processing (with a multi-layer architecture), but in initial implementations wasn't able to get the performance to beat other similar contemporary approaches such as using convolution. Noam Shazeer was apparently the wizard who took the basic idea, threw a lot of inspiration/experience (plus the kitchen sink?) at it, and was able to make it perform - apparently coming up with this specific attention mechanism. There was then an ablation process to simplify the architecture.


I have a blog post coming in a few days on this topic.

But basically, I agree, architecture doesn't really matter. The number of parameters and the training data matter.

The tradeoff transformers make is that you saturate the hardware really efficiently (so: easy to parallelize & scale) but the tradeoff is the n^2 scaling of the attention mechanism.

This turns out to be a great tradeoff if what you're doing is scaling models to 7B+ parameters.


Uszkoreit has mentioned that the global/quadratic attention was (paraphrasing) considered as overkill, but the brute force parallelism this simple approach allowed made that irrelevant.

But of course that changes when you scale up the context size enough, and the fix is simple since the key insight of the architecture was the hierarchical tree-like nature of language and thus dependence mostly on local (within branch) context, not global context. In Google's Big Bird attention (from their Elmo/Bert Muppet era!) they basically use sliding window local attention, but augmented with a fixed number of global tokens with global attention, and some random attention to further back non-local tokens. This mixture of attention patterns performs almost as well as global attention. Part of the reason (aside from the mostly local nature of language) is that as you ascend the transformer-layer hierarchy, receptive field sizes increase (same as they do in a CNN), so even with the attention gaps of random attention, there is still visibility at higher layers.


> There's nothing particularly special about Transformers, except that it was the first architecture to scale

This seems like a problematic framing. Is that not the definition of particularly special?


Only in the sense of "has an attribute others we tried before didn't have". Not "uniquely possesses that attribute".


That doesn’t make it any less important or remarkable though?

I guess I’m struggling to understand the push to trivialize it.

Transformers came onto the scene and the entire space exploded. Setting aside any technical/theoretical lack of uniqueness, it’s hard to ignore the real world results.


>That doesn’t make it any less important or remarkable though?

In the historical sense?

Because in the practical sense, if it's just the first to have this attribute, but we find others with the same attribute (as we apparently did), then it's not really important or remarkable anymore.


Transformers are special because theoretically you can't make something "weaker" than a transformer without it losing some accuracy: https://arxiv.org/abs/2209.04881 .


> Despite a remarkable amount of algorithmic effort on Boolean satisfiability (SAT) and related problems, to date no one has invented an algorithm with faster-than-exponential (O(2^n)) running time; indeed, there is no polynomial-time algorithm for SAT unless P = NP.

  [1] https://www.hindawi.com/journals/complexity/2018/7982851/
  [2] http://www.cs.cornell.edu/~sabhar/publications/learnIJCAI03.pdf
The [1] shows how to replace numerical problem solving process with the SAT (circuit) based one and obtain exponential speed up.

Let me quote the [2]: "We also show that without restarts but with a new learning scheme, clause learning can provide exponentially smaller proofs than regular resolution, which itself is known to be much stronger than ordinary DPLL."

So, in my opinion, there are algorithms that work in O(2^(n(1-e))) time.


Yes, Transformers are probably close to the Pareto frontier in terms of efficiency if you want to train something that has a sequence as an input.

But they're not inherently *special*. There's a bunch of other model types around that pareto frontier. Transformers are just good at saturating memory bandwidth, which is the hardware frontier.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: