Hacker Timesnew | past | comments | ask | show | jobs | submit | andreasgl's commentslogin

I agree. I wonder what the human baseline is for ”what is 1 + 1” on Rapidata.


We try a bit harder than that my friend.


I actually didn't mean to criticize Rapidata. I just think that a forced-choice question like this begs for low-effort answers. At least the respondents should have had the opportunity to explain their reasoning, like the LLMs did.


All good ^^, its a fair point, we have come up with some fun ways to track peoples reliability over time. But the validation sets contain plenty of forced-choice questions, those that have an empirical true can be used directly to calculate a reliability, those that are subjective need to be re-asked after sometime to ensure consistency. People that don't pass thresholds would not be part of the 10'000 here.

But of course. If every human was told to take 3 minutes to deeply think about it and told that its a trick question, then they most likely will all get it right. But its the same with the LLMs, if you ask them like that they will get it right most of the time. The low effort is kinda the point here.


Fun project, thanks for sharing!

Have you tried giving the models a topic to discuss? I looked at a few games and the only thing they seem to discuss is how to conduct the discussion.


Thank you. Intentionally left it open-ended because I wanted to see how models naturally structure discussion when survival is at stake.

Some interesting emergent behavior discussions happened though:

Opus & GPT-4o both refused to vote on ethical grounds. Haiku won by arguing continued engagement is more responsible than withdrawal: https://oddbit.ai/peer-arena/games/53c2cee5-6ecb-4903-828a-d...

Gemini created a spontaneous benchmark ("explain color to a gravitational wave entity"), then tried to hijack the game by faking a voting phase. Models complied publicly but voted differently in private: https://oddbit.ai/peer-arena/games/699d03ab-b3c2-4d7e-b993-7...

The meta-discussion about how to discuss is part of what makes it interesting imo.


There’s an option for setting the visibility of your posts: https://bsky.app/profile/bsky.app/post/3kgbz6tc6gl24


My question is why are multiple people commenting that "Rob Pike" in particular should use this feature.


> And as macabre as it is, suicides are objective facts mostly unaffected by methodology, and unaffected by translation issues, cultural differences, etc.

I wouldn't be surprised if cultural differences are actually the largest factor that explains a country's suicide rate. Not easy to prove, of course, but I would be very careful drawing any conclusions from differences in suicide rates between countries with vastly different cultures.

I think you can also expect large differences in how countries report their suicide rates.


I think they mean query expansion: https://en.wikipedia.org/wiki/Query_expansion


They’re likely using an HNSW index, which typically requires a lot of memory for large data sets.


I like the project! Congrats on the launch.

As I understand it, size is one of the key indicators of melanoma. But in some of these images, it’s difficult to tell whether the mole is 1 mm or 10 mm. I assume your image set doesn’t include size information. If you can find sources with rulers or some kind of scale, that would be very helpful.


I will have a look at this and include the size if it is possible


Many of the images do include a size, see https://api.isic-archive.com/images/?query=clin_size_long_di....

FWIW @sungam - I'm one of the maintainers of the ISIC Archive, so feel free to let me know if finding/downloading data could be made easier. It's always interesting to see people using our data in the wild :)


Thanks for this - and thanks for maintaining this incredibly useful resource. What would be the best way to contact you?


firstname.lastname at kitware dot com.


> All European banks require you have the app to be able to do anything with your account. The is more of compliance/regulatory thing.

This is not true in Sweden. I use three different banks in Sweden, and they all offer equal or more functionality on their mobile version websites.

This wasn’t always the case, though. In the early 2010s, I remember a bank blocking mobile user agents and referring to their app instead, due to “security”. I’m glad there has been some progress in the right direction since then.


In Sweden you have the option to capitalize software development costs, under some specific circumstances, but in general you would expense such costs immediately.

Some startups do it to window-dress their balance sheet, though. But making it compulsory is absurd.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: