How does it compare to Kimi 2.5 or Qwen 3.6 Plus?

eis · 2026-04-07T18:17:45 1775585865

The blog post has a benchmark comparison table with these two in it

jaggs · 2026-04-07T19:34:02 1775590442

Thanks, I missed that. It's very interesting. They're quite close, but I found Qwen 3.6 plus was just marginally better than Kimi 2.5. But looking at the stats I'll definitely give GLM 5.1 a try now. [edit: even though looking at it, it's not cheap and has a much smaller context size.And I can't tell about tool use.]

DeathArrow · 2026-04-07T18:15:43 1775585743

Compared to Kimi 2.5 or Qwen 3.6 Plus I don't know, but I ran GLM 5 (not 5.1) side by side with Qwen 3.5 Plus and it was visibly better.

XCSme · 2026-04-08T00:12:51 1775607171

General intelligence (not coding) comparison: https://aibenchy.com/compare/z-ai-glm-5-medium/z-ai-glm-5-1-...

BoorishBears · 2026-04-08T04:19:53 1775621993

Is there really no rule that discourages 99% of your interactions with HN from being peddling some useless slop benchmark?

XCSme · 2026-04-08T08:05:47 1775635547

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.

jaggs · 2026-04-08T10:10:16 1775643016

It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

BoorishBears · 2026-04-08T18:16:58 1775672218

This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???

jaggs · 2026-04-08T19:03:23 1775675003

Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.

BoorishBears · 2026-04-09T08:10:19 1775722219

Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.