Also not taking hard stances, but both cases are suspect. Power efficiency becau...

stingraycharles · on May 22, 2022

If you’re writing optimized code, hardly ever would you evaluate one CRC check at a time. You would process them in chunks, as the OP stated, but would just let a compiler do the auto-vectorization.

This is even more true in the case of CRC, where there’s clearly almost always one branch that wins: this is perfect for branch prediction, which would mean the whole “if eq” condition is preemptively skipped.

saagarjha · on May 22, 2022

The compiler probably isn’t going to be able to autovectorize a CRC unless you help it out.

dottrap · on May 22, 2022

I think power efficiency has a lot more variables now so it is not easy to know if consumption is linear with time. CPUs now dynamically throttle themselves, plus now Apple has advertised that its M1 cores are divided up between high-performance and high-efficiency efficiency cores, let alone how the underlying chip itself may consumer power differently for implementing different instructions.

So for a hypothetical example, it could be that using general purpose SIMD triggers the system to throttle up the CPUs and/or move to the high performance CPUs, whereas the dedicated CRC instructions might exist on the high-efficiency cores and not trigger any throttling.

I've forgotten all my computer architecture theory, but if I look back at Ohm's law and look at power, the equation is P = I^2 • R. Handwaving from my forgotten theory a bit here, ramping up the CPUs increases current, and we see that it is a squared factor. So by cutting the time by say a factor of 3 does mean you are done 3 times faster (which is a linear component), you still have to contend that you have a squared component in current which may have been increased.

I have no clue if the M1 actually does any of this, but merely stating that it is not obvious what is happening in terms of power efficiency. We've seen other examples of this. For example, I've read that Intel's AVX family instruction generally increases the power consumption and frequency of when utilized, but non-obviously, it often runs at a lower frequency when in 256 or 512 wide forms compared to the lesser widths (which then requires more work on the developer to figure out what is the optimal performance path as wider isn't necessarily faster). And as another example, when Apple shipped 2 video cards in their Macbooks, some general purpose Mac desktop application developers who cared about battery life were tip-toeing around different high level Apple APIs (e.g. Cocoa, Core Animation, etc.) because some APIs under the hood automatically triggered the high performance GPU to switch on (and eat power), while these general purpose desktop applications didn't want or need the extra performance (at the cost of eating the user's battery).

saagarjha · on May 22, 2022

> whereas the dedicated CRC instructions might exist on the high-efficiency cores

M1 has a heterogeneous ISA, FWIW.

astrange · on May 22, 2022

Homogenous? The P and E cores have all the same instructions. You won’t get suddenly moved from one to the other or hit emulations.

saagarjha · on May 22, 2022

Oops, yes, that's what I meant. Thanks for catching that!