More

rmi_ · 2026-03-27T07:54:18 1774598058

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

XCSme · 2026-03-27T08:28:45 1774600125

The main reason is that Claude models tend to ignore instructions. There is a failure example on the Methodology page.

BoorishBears · 2026-03-27T08:31:06 1774600266

> It is not my fault if Claude outputs something like "*1*, *1*", adding markdown highlighting, when most other models respect the required format correctly.

Yuck. At that point don't publish a benchmark, explains why their results are useless too.

-

Edit since I'm not able to reply to the below comment:

"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.

I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.

XCSme · 2026-03-27T08:34:15 1774600455

Why not? I described this in more detail in other comments.

Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.

Most models get this right. Also, this is just one failure mode of Claude.

BoorishBears · 2026-03-27T19:25:19 1774639519

Like I said in the edit, when people want specific formatting they ask for well known formats: Markdown, XML, JSON

I don't even need to debate if the benchmark is useful, it doesn't pass a sniff test: GPT-5.4 is not worse than Gemini 2.5 Flash in any way that matters to most users. In your benchmark it's meaningfully worse.

XCSme · 2026-03-28T01:42:19 1774662139

The questions do ask specifically to respond with the answer only, with an example format given in many cases.

Note that all reasoning models are tested with "medium" reasoning.

The benchmarks are questions/data processing tasks that an average user will likely ask, not coding questions (I didn't add any coding tests yet).

Gemini models also tend to be very consistent. Asking the same question will likely give the same result.

The two models you mention scored the same, the only difference is that Gemini was better at domain-specific questions (i.e. you ask something quite technical/niche).

BoorishBears · 2026-03-29T03:23:31 1774754611

If Gemini 2.5 Flash and GPT 5.4 perform the same for you, I'm glad.

It's not a useful finding for the rest of the world, and I sure hope non-technical people aren't being taken in by a steaming pile that implies those similarly performing LLMs (and many other ridiculous findings), but c'est la vie.

Now a days anyone can vibecode a "benchmark" with 0 understanding of the domain, what more should I expect?

rmi_ · 2026-03-26T19:30:00 1774553400

Just push to that instance, or, as Linus intended, send patches via e-mail.

rmi_ · 2026-03-14T21:29:30 1773523770

My usage only shows daily and weekly, though. I never got that.

stavros · 2026-03-14T21:31:08 1773523868

It has "current session" and "weekly". If you notice, "current session" is never more than five hours away from expiration.

rmi_ · 2026-03-14T21:33:57 1773524037

Oh, you're right. I don't know why I've always misread "current session" as daily.

Thanks for clearing that up. It'll help me schedule stuff in the future.

minimaxir · 2026-03-14T22:01:08 1773525668

For Claude Code, you use up 12% of your weekly allotment every session, so 8 sessions per week.

If you are only using a session a day, you're wasting a session. :)

rmi_ · 2026-02-18T08:25:57 1771403157

Thanks! I really like your benchmark.

Why is GLM-5 x's, though?

rmi_ · 2026-02-13T10:23:29 1770978209

The link leads to some kind of sound game for me?

jjcm · 2026-02-13T10:29:14 1770978554

I was able to get to it by just changing the url to this: https://www.myvibe.so/nategu/bullet-garden

rmi_ · 2025-08-15T11:51:26 1755258686

I've always wondered if this (kinda widespread?) theory stems from most people thinking that "infitnity" includes every possible option, which is not true.

(I'm a layman, too)

dzdt · 2025-08-15T12:10:31 1755259831

Mathematician here, so educated layman on the physics but expert on infinity if you like.

Mathematically, "infinity" doesn't imply every possible option. But in terms of quantum physics, yes it kind of does include every possible option. There is a kind of joke classroom exercise in quantum physics class to calculate the probability that a piano would instantaneously rematerialize a meter away from its previously observed location. Its 10^-[ ridiculous number] but still thats not zero.

The size of physical reconfiguration of a person's brain to cause them to break out singing is a much smaller deviation so comparatively likely. So 10^-[somewhat less ridiculous nunber]

XorNot · 2025-08-15T12:29:38 1755260978

The bigger issue with all those non-zero probabilities is they're meaningless while you still experience actual time as a human...but become pretty damn significant when you experience no time after you die.

So tiny probabilities become essentially guarantees unless the heat death of the universe is so thorough as to erase the slight probability that the whole thing pops back into existence.

cuttothechase · 2025-08-15T13:12:19 1755263539

Isn't it cold death of the universe?

mhl47 · 2025-08-15T12:36:30 1755261390

This is related to the question whether a system/the universe is ergodic (among other properties changing energy, space).

pmarreck · 2025-08-15T13:56:52 1755266212

What are examples of things that are NOT ergodic?

mhl47 · 2025-08-18T07:09:50 1755500990

I think an example would be the two body problem. It stays on an eccentricity. So it does not explore different eccentricities although they can have the same total energy.

(But I just looked that up too because this concept is mostly used/assumes in statistical physics)

druskacik · 2025-08-15T12:07:39 1755259659

Doesn't infinity include every possible option (possible meaning that it can happen within rules of physics)? If the model of the universe is one where events are happening with some probability, then if the probability is nonzero and the number of universes is infinite, then the event should happen in some of the universes.

(Still a layman, though.)

rmi_ · 2025-06-10T07:12:01 1749539521

While Microsoft is ending support for Windows 10 completely, Apple is just stopping feature upgrades. Apple usually supports old OS versions for years to come, especially when it's the only supported version for a lot of devices. So no, Intel Macs don't need to be retired.

rollcat · 2025-06-10T10:21:19 1749550879

It's surprising enough that you can still get a few things done with a 2002 PowerBook G4: <https://www.rollc.at/posts/2024-07-02-tibook/>

The most painful parts are (1) it's a bit hot and loud under load; (2) you need to patch modern software like git, likely with little hope to upstream; (3) waiting hours for those "simple" things to compile - which, in the end, tells us something important about what we'd consider "simple" nowadays.

For both retro and previous-generation hardware, security is the most important concern. Patches for PowerPC kept coming until 2011 or so (that's almost 10 years after that particular machine was released). I'd expect the Intel Macs to keep getting official patches until 2030, and in the meantime I wouldn't be surprised to find community efforts to extend that. "Sorbet Leopard" was a thing for PPC Macs, the Hackintosh community is much stronger than back then.

philistine · 2025-06-10T11:27:46 1749554866

> the Hackintosh community is much stronger than back then

Yeah but they'll be stuck on macOS 26. That's effectively the planned end of that community, they're not interested in running old versions of macOS on PCs.

rickdeckard · 2025-06-10T12:18:43 1749557923

I'm curious how much of the Hackintosh community can even be upgraded to macOS 26.

With Apple reducing the supported models so drastically [0], the OS may also no longer support most of those older hardware-components anymore.

[0] https://www.macrumors.com/2025/06/09/macos-tahoe-compatible-...

rollcat · 2025-06-10T12:27:15 1749558435

People are patching newer macOS's to run on older HW (like OpenCore), running older OS's on PCs as they see fit, all Macs allow downgrading (and 10.15 runs on the final 2019 models). I speculate that the community will settle around some version that strikes a decent balance between stability, features, and ease of patching.

philistine · 2025-06-10T13:36:57 1749562617

Sure, but that community is interested in running the latest version of macOS on a PC. When Apple releases macOS 27 next year, they will have to think long and hard about their next move. Do I buy a Mac to keep my ability to run the latest version of macOS? Or do I tolerate that I'm running an old version of macOS, the first one with the new design that wasn't really finished in that version to boot?

I give it ten years until the websites of that community straight up disappear.

rollcat · 2025-06-10T15:56:59 1749571019

There will be interest in running a stable and sensible version of macOS on Intel Macs as long as there any Intel Macs left around.

People still use PPC Macs to do work: <https://lowendmac.com/2025/skeuomorphic-icons-a-photoshop-pr...>.

People still write new software for System 6: <https://jcs.org/system6c>, <https://amendhub.com/jcs>.

Those are all hobby projects for 20-30yro machines, few of which are left around. There are millions of Intel Macs in excellent shape. Someone will carry the mantle.

philistine · 2025-06-11T00:20:41 1749601241

We're not talking about Intel Macs. Those are here forever as collectables. I'm talking about the continuing relevance of hackintoshes. Those will soon join the Intel Macs in the annals of history, and disappear as a relevant community.

concinds · 2025-06-10T07:35:01 1749540901

Only for security vulnerabilities that "Apple is aware may have been actively exploited". And almost never for any bug fixes (and sadly, Apple now tends to push off bug fixes to the next major release/"n+1" rather than fix bugs in the major version in which they were introduced).

JumpCrisscross · 2025-06-10T07:43:44 1749541424

> Only for security vulnerabilities that "Apple is aware may have been actively exploited"

That still leaves a perfectly adequate machine for most common uses.

fsflover · 2025-06-10T08:11:23 1749543083

Would you be fine with your family running a vulnerable, insecure machine for everything, including communication with you?

fsflover · 2025-06-10T09:10:22 1749546622

I don't understand why I'm downvoted. I don't think it's acceptable to keep a machine with known vulnerabilities "not yet actively exploited" for "most common uses". The defense of Apple here goes too far.

jurmous · 2025-06-10T08:21:59 1749543719

They also support updating Safari for 2 versions back of macOS.

rmi_ · on April 26, 2024

No: https://en.wikipedia.org/wiki/Cult_of_the_Dead_Cow

rmi_ · on March 30, 2024

> So unless that key is leaked

But, just for replayability, we could "patch" the exploit with a known key and see what it does, don't we?

swid · on March 30, 2024

Replayability means something different in this context. First, we do know the backdoor will pass the payload to system, so in general it is like an attacker has access to bash, presumably as root since it is sshd.

Replayability means, if someone were to catch a payload in action which did use the exploit, you can’t resend the attacker’s data and have it work. It might contain something like a date or other data specific only to the context it came from. This makes a recorded attack less helpful for developing a test… since you can’t replay it.

tialaramex · on March 31, 2024

> It might contain something like a date or other data specific only to the context it came from.

In all these modern protocols, including SSHv2 / SecSH (Sean Connery fans at the IETF evidently) both parties deliberately introduce random elements into a signed conversation as a liveness check - precisely to prevent replaying previous communications.

TLS 1.3's zero round-trip (ORT) mode cannot do this, which is why it basically says you'd better be damn sure you've figured out exactly why it's safe to use this, including every weird replay scenario and why it's technically sound in your design or else you must not enable it. We may yet regret the whole thing and just tell everybody to refuse it.

usrusr · on March 31, 2024

What could be done, I think, is patch the exploit into logging the payload (and perhaps some network state?) instead of executing it to be able to analyse it. Analyse it, in the unlikely case that the owner of the key would still try their luck using it after discovery, on a patched system.

What it does: it's full RCE, remote code execution, it does whatever the attacker decides to upload. No mystery there.

chii · on March 31, 2024

> see what it does

it does whatever the decrypted/signed payload tells the backdoor to execute - it's sent along with the key.

The backdoor is just that - a backdoor to let in that payload (which will have come from the attacker in the future when they're ready to use this backdoor).

rmi_ · on Feb 22, 2024

Tell me what they mean by "safety controls" first. It's very vaguely worded.

DALL-E, for example, wrongly denied serveral request of mine.

bergen · on Feb 22, 2024

You are using someone elses propietary technology, you have to deal with their limitations. If you don't like there are endless alternatives.

"Wrongly denied" in this case depends on your point of view, clearly DALL-E didn't want this combination of words created, but you have no right for creation of these prompts.

I'm the last one defending large monolithic corps, but if you go to one and want to be free to do whatever you want you are already starting from a very warped expectation.

Aeolun · on Feb 22, 2024

I don’t feel like it truly matters since they’ll release it and people will happily fine-tune/train all that safety right back out.

It sounds like a reputation/ethics thing to me. You probably don’t want to be known as the company that freely released a model that gleefully provides images of dismembered bodies (or worse).