> There is no second internet I don't know about that. LLMs have been trained mo...

qcnguy · 2025-09-04T19:38:29 1757014709

That's been done already for years. OpenAI were training on bulk AI transcribed YouTube vids already in the GPT-4 era. Modern models are all multi-modal and cotrained on audio and image tokens together with text.

The AI companies are not only out of such data but their access to it is shrinking as the people who control the hosting sites wall them off (like YouTube).

Symmetry · 2025-09-04T12:46:13 1756989973

Also, even if we lacked the data to proceed with Chinchilla-optimal scaling that wouldn't be the same as being unable to proceed with scaling, it would just require larger models and more flops than we would prefer.

1970-01-01 · 2025-09-04T12:04:31 1756987471

Yes. It's a complete oversight and wrong. This paper missed:

darknets, the deep web, Usenet, BBS, Internet2, and all other paywalled archives .