> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.
Yuck. At that point don't publish a benchmark, explains why their results are useless too.
-
Edit since I'm not able to reply to the below comment:
"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.
I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.
Why not? I described this in more detail in other comments.
Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.
Most models get this right. Also, this is just one failure mode of Claude.
Like I said in the edit, when people want specific formatting they ask for well known formats: Markdown, XML, JSON
I don't even need to debate if the benchmark is useful, it doesn't pass a sniff test: GPT-5.4 is not worse than Gemini 2.5 Flash in any way that matters to most users. In your benchmark it's meaningfully worse.
The questions do ask specifically to respond with the answer only, with an example format given in many cases.
Note that all reasoning models are tested with "medium" reasoning.
The benchmarks are questions/data processing tasks that an average user will likely ask, not coding questions (I didn't add any coding tests yet).
Gemini models also tend to be very consistent. Asking the same question will likely give the same result.
The two models you mention scored the same, the only difference is that Gemini was better at domain-specific questions (i.e. you ask something quite technical/niche).
If Gemini 2.5 Flash and GPT 5.4 perform the same for you, I'm glad.
It's not a useful finding for the rest of the world, and I sure hope non-technical people aren't being taken in by a steaming pile that implies those similarly performing LLMs (and many other ridiculous findings), but c'est la vie.
Now a days anyone can vibecode a "benchmark" with 0 understanding of the domain, what more should I expect?
I've always wondered if this (kinda widespread?) theory stems from most people thinking that "infitnity" includes every possible option, which is not true.
Mathematician here, so educated layman on the physics but expert on infinity if you like.
Mathematically, "infinity" doesn't imply every possible option. But in terms of quantum physics, yes it kind of does include every possible option. There is a kind of joke classroom exercise in quantum physics class to calculate the probability that a piano would instantaneously rematerialize a meter away from its previously observed location. Its 10^-[ ridiculous number] but still thats not zero.
The size of physical reconfiguration of a person's brain to cause them to break out singing is a much smaller deviation so comparatively likely. So 10^-[somewhat less ridiculous nunber]
The bigger issue with all those non-zero probabilities is they're meaningless while you still experience actual time as a human...but become pretty damn significant when you experience no time after you die.
So tiny probabilities become essentially guarantees unless the heat death of the universe is so thorough as to erase the slight probability that the whole thing pops back into existence.
I think an example would be the two body problem. It stays on an eccentricity. So it does not explore different eccentricities although they can have the same total energy.
(But I just looked that up too because this concept is mostly used/assumes in statistical physics)
Doesn't infinity include every possible option (possible meaning that it can happen within rules of physics)? If the model of the universe is one where events are happening with some probability, then if the probability is nonzero and the number of universes is infinite, then the event should happen in some of the universes.
While Microsoft is ending support for Windows 10 completely, Apple is just stopping feature upgrades. Apple usually supports old OS versions for years to come, especially when it's the only supported version for a lot of devices. So no, Intel Macs don't need to be retired.
The most painful parts are (1) it's a bit hot and loud under load; (2) you need to patch modern software like git, likely with little hope to upstream; (3) waiting hours for those "simple" things to compile - which, in the end, tells us something important about what we'd consider "simple" nowadays.
For both retro and previous-generation hardware, security is the most important concern. Patches for PowerPC kept coming until 2011 or so (that's almost 10 years after that particular machine was released). I'd expect the Intel Macs to keep getting official patches until 2030, and in the meantime I wouldn't be surprised to find community efforts to extend that. "Sorbet Leopard" was a thing for PPC Macs, the Hackintosh community is much stronger than back then.
> the Hackintosh community is much stronger than back then
Yeah but they'll be stuck on macOS 26. That's effectively the planned end of that community, they're not interested in running old versions of macOS on PCs.
People are patching newer macOS's to run on older HW (like OpenCore), running older OS's on PCs as they see fit, all Macs allow downgrading (and 10.15 runs on the final 2019 models). I speculate that the community will settle around some version that strikes a decent balance between stability, features, and ease of patching.
Sure, but that community is interested in running the latest version of macOS on a PC. When Apple releases macOS 27 next year, they will have to think long and hard about their next move. Do I buy a Mac to keep my ability to run the latest version of macOS? Or do I tolerate that I'm running an old version of macOS, the first one with the new design that wasn't really finished in that version to boot?
I give it ten years until the websites of that community straight up disappear.
Those are all hobby projects for 20-30yro machines, few of which are left around. There are millions of Intel Macs in excellent shape. Someone will carry the mantle.
We're not talking about Intel Macs. Those are here forever as collectables. I'm talking about the continuing relevance of hackintoshes. Those will soon join the Intel Macs in the annals of history, and disappear as a relevant community.
Only for security vulnerabilities that "Apple is aware may have been actively exploited". And almost never for any bug fixes (and sadly, Apple now tends to push off bug fixes to the next major release/"n+1" rather than fix bugs in the major version in which they were introduced).
I don't understand why I'm downvoted. I don't think it's acceptable to keep a machine with known vulnerabilities "not yet actively exploited" for "most common uses". The defense of Apple here goes too far.
Replayability means something different in this context. First, we do know the backdoor will pass the payload to system, so in general it is like an attacker has access to bash, presumably as root since it is sshd.
Replayability means, if someone were to catch a payload in action which did use the exploit, you can’t resend the attacker’s data and have it work. It might contain something like a date or other data specific only to the context it came from. This makes a recorded attack less helpful for developing a test… since you can’t replay it.
> It might contain something like a date or other data specific only to the context it came from.
In all these modern protocols, including SSHv2 / SecSH (Sean Connery fans at the IETF evidently) both parties deliberately introduce random elements into a signed conversation as a liveness check - precisely to prevent replaying previous communications.
TLS 1.3's zero round-trip (ORT) mode cannot do this, which is why it basically says you'd better be damn sure you've figured out exactly why it's safe to use this, including every weird replay scenario and why it's technically sound in your design or else you must not enable it. We may yet regret the whole thing and just tell everybody to refuse it.
What could be done, I think, is patch the exploit into logging the payload (and perhaps some network state?) instead of executing it to be able to analyse it. Analyse it, in the unlikely case that the owner of the key would still try their luck using it after discovery, on a patched system.
What it does: it's full RCE, remote code execution, it does whatever the attacker decides to upload. No mystery there.
it does whatever the decrypted/signed payload tells the backdoor to execute - it's sent along with the key.
The backdoor is just that - a backdoor to let in that payload (which will have come from the attacker in the future when they're ready to use this backdoor).
You are using someone elses propietary technology, you have to deal with their limitations. If you don't like there are endless alternatives.
"Wrongly denied" in this case depends on your point of view, clearly DALL-E didn't want this combination of words created, but you have no right for creation of these prompts.
I'm the last one defending large monolithic corps, but if you go to one and want to be free to do whatever you want you are already starting from a very warped expectation.
I don’t feel like it truly matters since they’ll release it and people will happily fine-tune/train all that safety right back out.
It sounds like a reputation/ethics thing to me. You probably don’t want to be known as the company that freely released a model that gleefully provides images of dismembered bodies (or worse).
I'm not saying it's bad, but it's definitely different than the others.
reply