Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

There is no actual distinction between the "real" thing and "mimicking".

The datasets behemoth LLMs are trained on include a lot of noise that derail progress. They also just contain a lot of irrelevant knowledge that the LLM has to learn or memorize so an obscene amount of parameters is required.

When you're not trying to teach a language model the sum total of human knowledge and you provide a high quality curated dataset, the scale barrier is much lower.

https://arxiv.org/abs/2305.07759



Why don't people just train LLMs on pure Wikipedia, arXiv, and other scientific websites? That would reduce the noise and improve hallucinations, no?


That is essentially Phi. The results are promising.

But LLMs do gain useful "emergent" properties from training on massive lower quality datasets.


It has to learn the meaning of words, including implicit associations, and to do that it needs to see approximately all the English text ever. We don't know how to balance this with only feeding it useful knowledge.


It doesn't necessarily have to see approximately all the English text ever. Real people don't learn English like that, for example.

It's just that given what we know about neural networks, it's often easier and simpler and more effective to increase the amount of training data than to change anything else.


An LLM has nothing in common with a human. An LLM works in ways that have nothing in common with the human brain.


Yes, LLMs and human brains share at most some faint similarities.

Nevertheless, human feats can act as an existence proof of what is possible. Including of what might be possible for a neural network.

(I'm not sure whether a large language model necessarily needs to be a neural network in the sense of a bunch of linear transformations interleaved with some simple non-linear activation functions. But for the sake of strengthening your argument, let's assume that we are assuming this restrictive definition of LLM.)


Real people might not speak every dialect of English. They may not be well versed in local grammatical oddities.


Doesn't seem to be much of a problem in practice?

If someone knew all the math and science in Wikipedia, for example, I think they'd probably be forgiven for not knowing every regionalism.


Unfortunately models aren't always good at knowing what they don't know ("out of distribution data") so it could lead to confidently wrong answers if you leave something out.

And if you want it to be superhuman then you're by definition not capable of knowing what's important, I guess.


Btw, models like GPT 4 can express that they are not confident.

But that looks like a 'smeared' out probability distribution on the next token. Not like the text produced by an unsure human.


It's been done before. "Textbooks are all you need".

https://arxiv.org/abs/2306.11644


I don't think it would know how to have a conversation if it's never been exposed to one before.


That’s what the fine tuning is about. It learns the language, concepts etc. from the main dataset and is then tweaked by continuing to train on a smaller, high quality, hand curated dataset. That’s how it learns to generate conversational responses by default instead of needing a complicated prompt.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: