Mine is also an m1. Just use llama3, its 8b quantized by default

logtrees · on June 14, 2024

I will try it out, curious to see how it will work with 8gb of memory haha. Thanks for the heads up!

apnew · on June 14, 2024

Do you happen to have any handy guides/docs/references for absolute beginners to follow?

SahAssar · on June 15, 2024

The absolute easiest way is https://github.com/Mozilla-Ocho/llamafile

Just download a single file and run it.

paulmd · on June 14, 2024

Ollama is not as powerful as llama.cpp or raw pytorch, but it is almost zero effort to get started.

brew install ollama; ollama serve; ollama pull llama3: 8b-v2.9-q5_K_M; ollama run llama3: 8b-v2.9-q5_K_M

https://ollama.com/library/dolphin-llama3:8b-v2.9-q5_K_M

(It may need to be Q4 or Q3 instead of Q5 depending on how the RAM shakes out. But the Q5_K_M quantization (k-quantization is the term) is generally the best balance of size vs performance vs intelligence if you can run it, followed by Q4_K_M. Running Q6, Q8, or fp16 is of course even better but you’re nowhere near fitting that on 8gb.)

https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

Dolphin-llama3 is generally more compliant and I’d recommend that over just the base model. It's been fine-tuned to filter out the dumb "sorry I can't do that" battle, and it turns out this also increases the quality of the results (by limiting the space you're generating, you also limit the quality of the results).

https://erichartford.com/uncensored-models

https://arxiv.org/abs/2308.13449

Most of the time you will want to look for an "instruct" model, if it doesn't have the instruct suffix it'll normally be a "fill in the blank" model that finishes what it thinks is the pattern in the input, rather than generate a textual answer to a question. But ollama typically pulls the instruct models into their repos.

(sometimes you will see this even with instruct models, especially if they're misconfigured. When llama3 non-dolphin first came out I played with it and I'd get answers that looked like stackoverflow format or quora format responses with ""scores"" etc, either as the full output or mixed in. Presumably a misconfigured model, or they pulled in a non-instruct model, or something.)

Dolphin-mixtral:8x7b-v2.7 is where things get really interesting imo. I have 64gb and 32gb machines and so far the Q6 and q4-k_m are the best options for those machines. dolphin-llama3 is reasonable but dolphin-mixtral is a richer better response.

I’m told there’s better stuff available now, but not sure what a good choice would be for for 64gb and 32gb if not mixtral.

Also, just keep an eye on r/LocalLLaMA in general, that's where all the enthusiasts hang out.

riddleronroof · on June 14, 2024

Ollama is llamma.cpp plus docker If you can do without docker, it’s faster

wkat4242 · on June 16, 2024

No, the ollama default quantisation is 4 bit

brrrrrm · on June 16, 2024

I meant 8b -> 8billion rather than 70b

wkat4242 · on June 16, 2024

Ah sorry!