OpenAI by default includes transcripts in its training data. You have to explicitly opt out (and trust that the op out actually does anything).
I wouldn’t trust that everyone correctly opts out. If they don’t, the X months from now “tell me about Foo company’s strategic plans” could regurgitate their internal docs.
Even if we don't opt out, it's not plaintext data right? Further, I feel like it's likely to be the tokenized version right?
And as far as strategic plans, I don't think it regurgitates novel, one source information does it? Isn't the regurgitation more like boilerplate, seen many times type stuff?
Has anyone actually been able to prompt a password out of an LLM? That kind of thing should be ignored because it's high entropy and rare afaik. For example, my name is bongodongobob and my password=hunter17. Do you really think someone will be able to pull this fact out of an LLM at some point? That doesn't seem to be the way they work as I understand it.
Yes, it’s the full transcript including whatever data you upload. It’s pretty trivial to prompt an LLM to regurgitate its training data, especially the largest models. Even if it is rare, single dicument instance. You might get some word substitutions, but certainly the gist of the original will come through. Not high entropy passwords, but full documents? Yes.
I explicitly said that high entropy things like passwords and license keys are not likely to work due to how information is compressed in the training process. But if you meant the windows server license agreement text:
I wouldn’t trust that everyone correctly opts out. If they don’t, the X months from now “tell me about Foo company’s strategic plans” could regurgitate their internal docs.