Right. The objective is "correctly predict the entire training set", where that training set contains literally everything. So the objective becomes to speak every human language, every programming language, to understand every topic, to master every weird sub-genre of culture. That's an inherently very inefficient training objective if you just want an AI that can do some specific tasks. It's the whole insight behind models specific to summarization, text extraction, patch merging etc.
And don't forget the noise. If you look at the Anthropic papers it's clear from the examples they give that the dataset is still incredibly noisy even after extensive cleaning efforts. A lot of those parameters are being wasted trying to predict garbage outputs from HTML scraping gone wrong.
And don't forget the noise. If you look at the Anthropic papers it's clear from the examples they give that the dataset is still incredibly noisy even after extensive cleaning efforts. A lot of those parameters are being wasted trying to predict garbage outputs from HTML scraping gone wrong.