Since you brought up that context, do you happen to know how it works?
I tested it, and it definitely works like a history - eg you can feed it to a model and ask it a question about what you just talked about, and it works.
Does Ollama essentially just convert the conversation so far into a binary representation, then convert it back to text and feed it back to the model ahead of your newest query? Or is it doing something more involved?
I wish it was better documented, have a ton of questions about how it functions in practice that I’ll end up trying to reverse engineer.
Like, what’s the lifetime of this context? If I load a new model into memory then reload the original, is that context valid? Is it valid if the computer restarts? Is it valid if the model gets updated?
Not an LLM expert but based on your explanation that it’s related to embedding (and another comment that Ollama loads it directly into memory) then I’m guessing that the model updating its weights definitely invalidates the context. Not so sure about the other options, like unloading/reloading the same model.
It's easier to reason about if you understand what it is. That binary data is basically a big vector representation of a textual context's contents. I suspect that it doesn't matter which model you use with the context binary, as ollama is handling providing it to the model.
Any particular resources you’d recommend to learn more about this?
So if I’m understanding it correctly, there’s one consistent way that Ollama will vectorize a set of text. Perhaps there are various ways one could, but Ollama chooses one.
What about multimodality? Ie taking a context from a prompt to llava to identify an image, with further questions about the contents of that image? Any non-llava model would definitely hallucinate, but would llava?
Multimodal models use embeddings as well, the difference there is that they've been trained to associate the same position in latent space to text and to the image that text describes, that way it can turn a textual response into an image and vice versa. A lot of models use CLIP, an embedding method from openAI.
I tested it, and it definitely works like a history - eg you can feed it to a model and ask it a question about what you just talked about, and it works.
Does Ollama essentially just convert the conversation so far into a binary representation, then convert it back to text and feed it back to the model ahead of your newest query? Or is it doing something more involved?