Hacker Timesnew | past | comments | ask | show | jobs | submit | in-silico's commentslogin

> But it is another good example that "AI" is just glorified search and there is not reasoning or thinking going on behind the covers

A bold claim given that the current top post on HN is "An OpenAI model has disproved a central conjecture in discrete geometry": https://qht.co/item?id=48212493


While there is a limit to the amount of information you can fit in a fixed-size state, the theoretical ceiling is pretty high.

A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information.

Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction.


> context with 2.1 bits of entropy per token

Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter, and tokens encode a lot more than that - sometimes full words, with multimodal even more. If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.


> Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter

The reference I always go back to is the GPT-3 paper. The cross-entropy loss (an upper bound for entropy) got down to 1.75 nats (2.5 bits). I took 2.1 because 2.5 is an upper bound and I wanted the estimate to end up as a round number.

> If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.

Here's the thing: the concepts that the model stores in the KV cache are a deterministic function of the input tokens. Similar to the data processing inequality, this implies that no entropy is actually added.

Looking at it mechanically, a sufficiently powerful model only needs to encode the tokens and can recompute concepts later as needed.


While 100 million tokens sounds a lot, think about it for a bit, and you’ll see why it is basically nothing. Try to cram a human lifetime of sounds, smells, video and more sensory data into 100 million tokens. Heck, try to process the video plot of a single series into that window. It just won’t work, it won’t scale, and is laughable compared to contextual memory. I’m not saying that to belittle the authors of the paper but the reality is that this has very little to do with transient long term memory.

You don't remember a lifetime of smells. You don't have any memories from huge swaths of time. There are entire years of your life compressed down to vibes and a handful of events you largely misremember.

That’s a very weak argument. Memories are not exact replica of experiences. We know that many memories are retained through a lifetime, particularly the ones from early childhood. Unlike computers we always reconstruct memories from several modalities. Even if we remember largely on vibes as you say (which is not true when you look into neuroscience), the sheer amount of information is overwhelming. Again, try to run a 90 minute movie through an LLM memory system. It won’t be able to tell you the plot. That’s before you even feed it sound. Even 100M tokens is not enough for that. You on the other hand will largely remember the movies you liked and their major plot lines and from there be able to reconstruct its scenes. I think the engineers working on memory vastly underestimate the capacity problem of discrete states.

blah blah we know that blah neuroscience blah blah blah.

This isn't an argument you are making, it's just an assertion that you could make an argument if you are so inclined, but you won't be doing so at this time, but "science" is obviously on your side, but you can't be bothered to say how or even enough detail for someone to check what you are referring to. I can do that to, see my first sentence in this reply.

I don't know how LLM memory systems work. I do know that you don't have a lifetime of remembering everything with high precision. Not only do most people not remember the plot of most of the movies they have seen, they can't reliably list most of the movies they have seen. Not everyone has a good memory. My point is that it's not valid to reference a false model of how human memory works as a reason some specific LLM memory implementation isn't useful for solving some problems.


Exactly, and for a given task you don't need to recall what your friend's brother's name is to do a git commit and push. There's a pull for more context to make these things better, but also the pull to make these execute in such a small context effectively when appropriate.

I'm more on team small tasks because of my love of unix piping, I keep telling folks, as a old Linux dude, seeing subagents work together for the first time felt like I was learning to pipe sed and awk for the first time. I realized how powerful these could be, and we still seem to be going that direction.


I think you underestimate just how much information 100M words-ish of information is. It's like a 300,000 page novel. That's a 50 foot (~15 meter) thick book.

Surely with (much less than) 300K pages you could describe every meaningful detail of a video series' plot. You don't need to remember the exact pixel values.

You can also scale the numbers up. I specifically chose a relatively small model and short context length as a reference, so 100x bigger is not out of question. At that point, with a 10B token capacity, you are looking at all of English Wikipedia in a single state.


They basically just added DeltaNet hypernetworks to existing LLMs.

Nothing super novel or groundbreaking, but a moderately interesting read.


> Keep in mind that all LLMs are trained first on text and then fine-tuned on code.

No, they are trained on a mixture of text and code from the start.


This is true after pretraining, but reinforcement learning allows the model to discover strategies and ideas that weren't in its training corpus.


Are you perhaps thinking of transfer learning, i.e. where training on one subject can be applied to another? RL is more about coercing models in particular directions.


This is not what RL does, and please stop anthropomorphizing statistical modelling as the model certainly does not discovers ideas.


What does RL do then if not discover strategies and solutions that weren't in its training data?


RL adjusts the learned probabilities to conform to a secondary source other than the raw training data, for example (but not exclusively) human feedback. Putting it in extremely simplified terms: If, owing to the training data, the learned probability for "green people are _" is 70% to be followed by "inferior", you may use RL to massage this, de-scoring it every time it produces "green people are inferior to red people" and up-scoring it every time it produces "green people are an ethnic group originating from Greenland". Doing this will adjust its learned probability for that sequence of tokens.

At most, RL can be described as injecting information from a secondary source. It is not extending a model's programming to do anything other than what it was already doing, probability-based token prediction. It simply alters the probabilities.


What about things like AlphaZero and Atari gameplay, where the model has zero prior knowledge and learns superhuman ability purely using RL?

With sufficient RL sampling/training, there's no reason an LLM couldn't similarly develop entirely new skills, especially in verifiable domains like math and code.

> It simply alters the probabilities.

Yes? What else would a learning system do besides alter its behavior? (and you can just sample with argmax or pseudo-randomly of you think probabilities are a problem)


Functionally, i.e. focusing only input and output, a model can certainly discover an idea. That’s not anthropomorphism.

Similarly, people often object to using words like “reasoning” and “understanding” in relation to models, but again, functionally, models observably demonstrate both of those qualities - you can test for them and measure their proficiency.

The fact that this discovery, training, and understanding is implemented in terms of a statistical model isn’t really relevant. If it were, you could similarly argue that humans don’t discover, reason, or understand, we just process chemical and electrical signals through our biological neural network.


I wonder how different their method actually is from other sub-quadratic sparse attention methods like Reformer [1] and Routing Transformer [2].

[1]: https://arxiv.org/abs/2001.04451

[2]: https://arxiv.org/abs/2003.05997


Using the logarithmic mean of your range of about 3 kg of CO2 per day, and the fact that the average car emits about 0.2 kg of CO2 per km, this means that a typical day of Gemini coding produces about the same amount of CO2 as a 15 km (~9 mile) round-trip commute by car.


You can't average it like that because it's not an evenly random distribution. (And a place has to be very high in renewables, like on the order of 95%, before the emissions aren't dominated by the fossil component.) I don't know what the average datacenter uses for electricity source or region


The ARC-AGI benchmark is basically this already


That is not at all the intention of the ARC team. By ARC teams definition, passing any single ARC-AGI benchmark does not mean that AGI has been achieved. Instead, AGI would be considered achieved when we are no longer able to come up with new benchmarks that the AI systems do not immediately do well on.


People are trying to solve it with software too, even if you don't hear about it.

The most high-profile example is the latest set of Qwen models, which replace most of the attention mechanisms with Gated DeltaNet (which uses constant memory with respect to sequenc length).

Test-time training architectures are also getting a lot of attention, and have shown great performance in the acedemic setting. It's only a matter of time before we start getting open TTT models.


Modern kv caches can contain up to 1 million tokens (~3000 pages of text). It's not that short, it's like 48 straight hours of reading.


Yes and no, it's not just text, it's images, video, etc, and it's not just the pages of content, it's also all the "thinking" as well. Plus the models tend to work better earlier on in the context.

I regularly get close to filling up context windows and have to compact the context. I can do this several times in one human session of me working on a problem, which you could argue is roughly my own context window.

My point though was that almost nothing of the model's knowledge is in the context, it's all in the training. We have no functional long term memory for LLMs beyond training.


The KV cache isn't memory, it's the extent of the process saved so the inference can start where the last generated output is concatenated with the next input. It's entirely about saving compute and has nothing to do with memory.

This really confuses how stupid LLMs are: they're just text logs as output and text logs as input; hence the goblins are just tokens that seem to problematically be more probable in the output.

But the KV cache is a thing made to keep a session from having to run through the entire inference. The only thing you can call "memory" is there's no random perturbations in the KV cache while there may be in re=running chat which ends up being non-deterministic. You can think of it as a deterministic seed to prevent a random conversation from it's normal non-deterministic output


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: