Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I got similar results for most models, with gemini 3 flash (with reasoning) being the most consistent/reliable model: https://aibenchy.com

I also noticed the same thing: some models reason correctly but draw the wrong conclusions.

And MiniMax m2.5 just reasons forever (filling the entire reasoning context) and gives wrong answers. This is why it's #1 on OpenRouter, it burns through tokens.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: