Similarly, I wish that on x86, REP STOSB was the fastest way to copy memory. Because it only takes a few bytes in the icache. But fast memcpys end up being hundreds of bytes, to work with larger words while handling start and end alignment.
The real problem (and with the crc above) is that the fastest version for any given cpu may not be the fastest for any other. Its really short sighted to not spend the area on some of these features (aes, crc32, memcpy) because invariably one ends up with a long term optimization problem where in 5-10 years any given application has to run on on of a half dozen diffrent CPUs and optimizing for -mtune=native, and it likely results in suboptimal perf one the lastest CPUs because the micoarch designers can't be constrained to assuring that the the newer version runs any given instruction sequence proportionally faster than the previous. (aka the overall perf may go up but maybe something like the nontemporal store, or the polinomial mul doesn't keep up).
And this is really the CISC vs RISC argument and why all these RISC cpus have these CISC like instructions. You want top perf in general code you assure the rep sto and mov sequences (or whatever) run the fastest microcoded version possible on a given core. But intel sorta messed this up in the p6->nehalem timeframe (IIRC when they added the fast string flag) until they rediscovered this fact. IIRC Andy Glew admitted it was a bit of an oversight combined with an release/area issue on the original PPro they intended to fix, but then it took 10 years.
It still is in general situations (i.e. not the microbenchmarks where the ridiculously bloated unrolled "optimised" implementations may have a very slight edge.) I believe the Linux kernel uses it for this reason.
The kernel is a bit of a special case since very likely a syscall starts off with a cold I$, and also there's a lot of extra overhead if you insist on using SIMD registers.
In general I agree with you though, optimizing memcpy implementations only against microbenchmarks is dumb.
With ERMS it’s definitely not going to be slow, so it’s a good choice when you’re in a constrained environment (high instruction cache pressure, can’t use vector instructions).