Cool. So we benefit by prediction markets surfacing insider information about Trump's plans in the Iran conflict, and unknown insiders making hundreds of millions on that information with massive trades minutes before each announcement benefited the people watching prices in the oil market? That doesn't seem right.
If the result is statistically significant, it just barely makes it. 84.8% isn't that much higher than 80.8% and they had only 250 prompts, if I'm reading this right.
In a field where progress is measured in tenths of percent points, that's not true. Think of it this way: the error rate drops from 19% to 15%, or from 1 in 5 to 1 in 6.
Statistical significance is about whether an effect can reliably be said to have been measured at all; it's not about whether or not the effect itself would be significant in the sense of moving some other needle.
The ~5% improvement reported here might just be an artefact of the data collection or random variation, rather than a consistent repeatable change.
I know what significance means, and I also know that getting it from a p-value is nonsensical.
> The ~5% improvement reported here might just be an artefact of the data collection or random variation, rather than a consistent repeatable change.
You're questioning method or data representativeness, not significance. 250 samples is just about enough to for a 5% difference in NHST (stddev is around .4, so 1.64 sigma is .4/15.8*1.64=0.04 for single sided testing).
That would matter if we were asking the AI to generate code open-loop: someone probably already wrote something close to what you asked for in Python. But if the agent generates code, tries to compile it, sees the detailed error messages and acts on those messages to refine the code, it's going to produce a higher quality result. rustc produces really good diagnostics. And there's a lot of Rust code online now, even if there's so much more Python and Javascript/Typescript.
LLMs don't actually semantically parse the error messages. They will generate the most likely sequence resulting from the error message based on their training data, so you're back to the training data argument.
Except the presence of errors, mistakes, contradictions, and doubling-back causes LLMs to have substantially worse output, especially without dedicated sub-agents who have been instructed about that deficiency and know to process that kind of crap into better prompts to insert into a different LLM with pristine, error-free context. Without hard numbers we're both just pissing into the wind, but it's entirely plausible that the higher rate of errors matters more than the fact that those errors are more ergonomic. Anecdotally, my LLM work is a _lot_ more productive when I have it draft the thing in Python and translate it into Rust since it wastes so much time on the tiniest of syntactic mistakes.
The strongest evidence that something like MOND isn't the answer is that in some galaxy collisions, the visible matter and the dark matter appear to separate: the collision disrupts the visible matter and the dark matter appears to pass right through, uninterrupted, and we see galaxy remnants that look like they don't have dark matter. If MOND or some other modification of gravity were the answer we'd never see this kind of sorting.
If by some miracle someone managed to create this, and a critical mass of people somehow discovered it and used it, at some point they'd burn out, sell it, and it would turn into the same shit that we see everywhere else.
That is why egcs was launched, to get around the inability of the old team to do gcc releases. The issues had little to do with ideology and were about fixing a broken process and replacing it with something that had a hope of working.
I looked at it and it is impressively lightweight. It would help if it could collapse duplicate notifications, right now the notifications page is filled with repeats even though I'm not all that popular on fedi.
If the navigation simulates what would happen if we follow links to SPA#pos1, SPA#pos2, etc so that if I do two clicks within the SPA, and then hit Back three times I'm back to whatever link I followed to get to the SPA, I guess it's OK and follows user expectations. But if it is used as an excuse to trap the user in the SPA unless they kill the tab, not OK.
> From the browsers perspective those are the same thing though.
If the browser only allows adding at most one history item per click, I should be able to go back to where I entered a given site with at most that many back button clicks.
At a first glance, this doesn't seem crazy hard to implement? I'm probably missing some edge cases, though.
Some browser APIs (such as playing video) are locked behind a user interaction. Do the same for the history API: make it so you can't add any items to history until the user clicks a link, and then you can only add one.
That's not perfect, and it could still be abused, but it might prevent the most common abuses.
Clearview again. ICE is using it too, and their people think it is an oracle that is always correct, so that when someone shows a passport card or a RealID showing that they are someone else, a US citizen or permanent resident, they are usually accused of having a fake ID. It's a flawed tool and it misidentifies people sometimes.
reply