Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]
I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.
Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.
Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.
Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.