There is no actual distinction between the "real" thing and "mimicking". The dat...

behnamoh · on Sept 22, 2023

Why don't people just train LLMs on pure Wikipedia, arXiv, and other scientific websites? That would reduce the noise and improve hallucinations, no?

brucethemoose2 · on Sept 22, 2023

That is essentially Phi. The results are promising.

But LLMs do gain useful "emergent" properties from training on massive lower quality datasets.

astrange · on Sept 22, 2023

It has to learn the meaning of words, including implicit associations, and to do that it needs to see approximately all the English text ever. We don't know how to balance this with only feeding it useful knowledge.

eru · on Sept 22, 2023

It doesn't necessarily have to see approximately all the English text ever. Real people don't learn English like that, for example.

It's just that given what we know about neural networks, it's often easier and simpler and more effective to increase the amount of training data than to change anything else.

user_named · on Sept 22, 2023

An LLM has nothing in common with a human. An LLM works in ways that have nothing in common with the human brain.

eru · on Sept 23, 2023

Yes, LLMs and human brains share at most some faint similarities.

Nevertheless, human feats can act as an existence proof of what is possible. Including of what might be possible for a neural network.

(I'm not sure whether a large language model necessarily needs to be a neural network in the sense of a bunch of linear transformations interleaved with some simple non-linear activation functions. But for the sake of strengthening your argument, let's assume that we are assuming this restrictive definition of LLM.)

basch · on Sept 22, 2023

Real people might not speak every dialect of English. They may not be well versed in local grammatical oddities.

marshray · on Sept 22, 2023

Doesn't seem to be much of a problem in practice?

If someone knew all the math and science in Wikipedia, for example, I think they'd probably be forgiven for not knowing every regionalism.

astrange · on Sept 23, 2023

Unfortunately models aren't always good at knowing what they don't know ("out of distribution data") so it could lead to confidently wrong answers if you leave something out.

And if you want it to be superhuman then you're by definition not capable of knowing what's important, I guess.

eru · on Sept 23, 2023

Btw, models like GPT 4 can express that they are not confident.

But that looks like a 'smeared' out probability distribution on the next token. Not like the text produced by an unsure human.

msp26 · on Sept 22, 2023

It's been done before. "Textbooks are all you need".

https://arxiv.org/abs/2306.11644

vczf · on Sept 22, 2023

I don't think it would know how to have a conversation if it's never been exposed to one before.

taneq · on Sept 22, 2023

That’s what the fine tuning is about. It learns the language, concepts etc. from the main dataset and is then tweaked by continuing to train on a smaller, high quality, hand curated dataset. That’s how it learns to generate conversational responses by default instead of needing a complicated prompt.