Since you brought up that context, do you happen to know how it works? I tested ...

CuriouslyC · on Feb 25, 2024

Each time you do inference you're basically converting the context to an embedding, why not cache that?

rybosome · on Feb 25, 2024

Thanks.

I wish it was better documented, have a ton of questions about how it functions in practice that I’ll end up trying to reverse engineer.

Like, what’s the lifetime of this context? If I load a new model into memory then reload the original, is that context valid? Is it valid if the computer restarts? Is it valid if the model gets updated?

Not an LLM expert but based on your explanation that it’s related to embedding (and another comment that Ollama loads it directly into memory) then I’m guessing that the model updating its weights definitely invalidates the context. Not so sure about the other options, like unloading/reloading the same model.

CuriouslyC · on Feb 26, 2024

It's easier to reason about if you understand what it is. That binary data is basically a big vector representation of a textual context's contents. I suspect that it doesn't matter which model you use with the context binary, as ollama is handling providing it to the model.

rybosome · on Feb 26, 2024

Thanks for the insights.

Any particular resources you’d recommend to learn more about this?

So if I’m understanding it correctly, there’s one consistent way that Ollama will vectorize a set of text. Perhaps there are various ways one could, but Ollama chooses one.

What about multimodality? Ie taking a context from a prompt to llava to identify an image, with further questions about the contents of that image? Any non-llava model would definitely hallucinate, but would llava?

CuriouslyC · on Feb 26, 2024

Multimodal models use embeddings as well, the difference there is that they've been trained to associate the same position in latent space to text and to the image that text describes, that way it can turn a textual response into an image and vice versa. A lot of models use CLIP, an embedding method from openAI.