35B-A3B model is at ~25 t/s. For comparison, on an A100 (~RTX 3090 with more memory) they fare respectively at 41 t/s and 97 t/s.
I haven't tested the 27B model yet, but 35B-A3B often gets off rails after 15k-20k tokens of context. You can have it to do basic things reliably, but certainly not at the level of "frontier" models.
Sorry on the delay - so it installs https://github.com/Blaizzy/mlx-vlm and other components and sets up the commands - you don't need to use it but we thought it might be easier for folks
Why use --fit on on an M4? My understanding was that given the unified memory, you should push all layers to the GPU with --n-gpu-layers all. Setting --flash-attn on and --no-mmap may also get you better results.
Meaningless question, fit will put everything on the gpu if it fits. Fa is default on. No-mmap is not an inference tradeoff and if you do turn it off you need to turn on direct io via -dio
What he should actually do is enable speculative decoding
But isn't the prefill speed the bottleneck in some systems* ?
Sure it's order of magnitude faster (10x on Apple Metal?) but there's also order of magnitude more tokens to process, especially for tasks involving summarization of some sort.
But point taken that the parent numbers are probably decode
* Specifically, Mac metal, which is what parent numbers are about
oMLX makes prefill effectively instantaneous on a Mac.
Storing an LRU KV Cache of all your conversations both in memory, and on (plenty fast enough) SSD, especially including the fixed agent context every conversation starts with, means we go from "painfully slow" to "faster than using Claude" most of the time. It's kind of shocking this much perf was lying on the ground waiting to be picked up.
Open models are still dumber than leading closed models, especially for editing existing code. But I use it as essentially free "analyze this code, look for problem <x|y|z>" which Claude is happy to do for an enormous amount of consumed tokens.
But speed is no longer a problem. It's pretty awesome over here in unified memory Mac land :)
To put it that into context, some tags count as upvotes, others count as downvotes, "Troll" is a downvote. So to have your post labelled as "Troll" with a positive score, it has to have enough upvotes to compensate the penalty from the "Troll" votes, but without having another tag dominate. 5 is the maximum score.
"Score: 5, Troll" is therefore the mark of a very successful troll.
They also have "Underrated" and "Overrated" which apply points but do not act as tags. So I guess the easiest way to get +5 Troll is to have many Troll and Underrated votes, if it works the way I think it does.
Hash functions and PRNGs are closely related, they share many properties and they can be built from the same algorithmic components, so for many kinds of PRNGs there are corresponding kinds of hash functions and vice-versa.
Nevertheless, the purposes of hash functions and PRNGs are different and complementary.
A PRNG receives a short fixed-length value (the seed) and it expands it into a long pseudo-random sequence of arbitrary length.
A hash function receives a long input sequence of arbitrary length and it generates a short fixed-length pseudo-random value.
Good PRNGs are injective functions and good hash functions are surjective functions.
Normally the design methods for PRNGs and for hash functions should be presented together, because it is easy to interconvert algorithms for one of them with algorithms for the other. For instance, given a good hash function one could make a PRNG by computing the hashes of a sequence of numbers or the hashes of a sequence of strings of increasing length, and given a good PRNG one could make a hash function by accumulating somehow the input into a PRNG seed and taking the first generated number, or better by using input chunks as seeds and then accumulating the first generated numbers into a single value.
However for a successful conversion between PRNG and hash function algorithms, the source algorithm may have have to be overdesigned, to guarantee good enough properties even after the conversion.
When an algorithm is designed directly as a hash function or as a PRNG, with clearly specified requirements, it can be designed only as good as strictly necessary, enabling thus a better performance.
Well, that's technically also a deterministic random number generator! (I want to say it's not a great one, but... that's apparently context-dependent!)
If your input is i.i.d. random, then truncating works great. Eg if your keys are UUIDs then truncating can work well.
Another use:
Suppose you write a tool like rmlint that is looking for duplicate files. Generally, you compute some hash for each file, see if you got any duplicates, and then compare the relevant files directly.
A traditional hash like crc or sha256 takes O(n) to compute. But for files you can start with some cheaper hashes, like file length. After all, files of different length can't have the same content. Taking the first few bytes of your file is another cheap 'hash' you can compute.
Only when these cheap 'hashes' show that you have a potential duplicate, do you go and pay for a more expensive hash.
The author emphasizes accessibility and coherence as a benefit but another interesting one is composability which does not emerge naturally in the world of UI. Create a UI for a pair of websites like a command line for grep and wc. LLMs already provide that but under the natural language interaction primitive. UI could allow for branded experiences, ad delivery and whatnot in ways that natural language doesn't.