Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Huh? It says: "Cosine's Genie achieves a SOTA score of 43.8% on the new SWE-bench Verified benchmark" with a link to https://www.swebench.com/

But the SWE-bench leaderboard (linked to in the post) doesn't show Cosine Genie at all, instead showing Amazon at the top with 38.8% accuracy.

If true, seems wild that Genie, a 10 person startup with $2.5m in funding can actually achieve SOTA results over Google, Amazon, Microsoft, Anthropic, OpenAI etc. It's not like the big players are overlooking this problem of using LLMs to automate software engineering. Anyone have more color on this? I see some speculation online that the training data can easily get contaminated with benchmark questions, but not much careful evidence.



It's with an asterisk. Here's their comment from cosine's website.

Note SWE-Bench has recently modified their submission requirements, now asking for the full working process of our AI model in addition to the final results -their condition to have us appear on the offical leaderboard. This change poses a significant challenge for us, as our proprietary methodology is evident in these internal processes. Publicly sharing this information would essentially open-source our approach, undermining the competitive advantage we’ve worked hard to develop. For now, we’ve decided to keep our model’s internal workings confidential. However we’ve made the model’s final outputs publicly available on GitHub for independent verification. These outputs clearly demonstrate our model’s 30% success rate on the SWE-Bench tasks.


It looks like they didn't want to make a public submission in order to avoid disclosing the model internals: https://cosine.sh/blog/genie-technical-report#:~:text=SWE%2D....


"verified" being the (rather confusing) keyword here.

https://openai.com/index/introducing-swe-bench-verified/


Your link doesn't actually point to the leaderboard. This link does https://www.swebench.com/ and you can click on the "Verified" tab. I don't see any entry for Genie.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: