OpenAI by default includes transcripts in its training data. You have to explici...

bongodongobob · on Nov 13, 2024

Even if we don't opt out, it's not plaintext data right? Further, I feel like it's likely to be the tokenized version right?

And as far as strategic plans, I don't think it regurgitates novel, one source information does it? Isn't the regurgitation more like boilerplate, seen many times type stuff?

Has anyone actually been able to prompt a password out of an LLM? That kind of thing should be ignored because it's high entropy and rare afaik. For example, my name is bongodongobob and my password=hunter17. Do you really think someone will be able to pull this fact out of an LLM at some point? That doesn't seem to be the way they work as I understand it.

adastra22 · on Nov 13, 2024

Yes, it’s the full transcript including whatever data you upload. It’s pretty trivial to prompt an LLM to regurgitate its training data, especially the largest models. Even if it is rare, single dicument instance. You might get some word substitutions, but certainly the gist of the original will come through. Not high entropy passwords, but full documents? Yes.

bongodongobob · on Nov 13, 2024

Give me a prompt that regurgitates a valid windows server license and I'll believe this.

adastra22 · on Nov 13, 2024

I explicitly said that high entropy things like passwords and license keys are not likely to work due to how information is compressed in the training process. But if you meant the windows server license agreement text:

https://chatgpt.com/share/673431d5-b278-8004-ad16-10529788f5...

bongodongobob · on Nov 13, 2024

Ok well that isn't an example of something rare. It's a public document by a very large and public company that intended that document to be visible.