Wait, where in glibc is SIMD used anyway?

bonzini · on Sept 8, 2019

All string functions: not surprising since recent processors have the crazy PCMPxSTRy instructions that are basically a hardware implementation of strcmp, strchr, memchr, strspn etc.

memcpy and memset use SSE on some generations, but these days are best inlined by the compiler as "rep movsb/stosb".

stingraycharles · on Sept 8, 2019

I better hope AVX512 isn’t used on simple string functions, as it causes so much heat Intel CPUs have to reduce their clock speed significantly, affecting other running processes as well.

https://www.tcm.phy.cam.ac.uk/~mjr/IT/clocks.html

jbapple · on Sept 8, 2019

The instruction and instruction set referenced in the comment you're referring to (PCMPxSTRy and SSE) are not AVX512; they were introduced long before AVX512 chips were manufactured.

lazyjones · on Sept 8, 2019

> it causes so much heat Intel CPUs have to reduce their clock speed significantly,

How about: don't buy affected CPUs if you care. The last thing people who write libraries should care about is a few selected CPUs that underperform on a generally useful feature.

temac · on Sept 8, 2019

Nope. At this point given the state of the market, and given what AVX512 does to begin with (so maybe the "at this point" will be for forever), it makes very very little sense to use it everywhere. It would just be actively detrimental to most users.

physicsguy · on Sept 8, 2019

There’s nothing wrong with dropping the clock speed when the whole point of SIMD instructions is that they execute on multiple data. As long as the clock speed drop beats the multiplicative speedup from doing 4x/8x/whatever operations in parallel, it’s fine

stingraycharles · on Sept 8, 2019

My point is that the clock speed is also dropped for other processes. So it's not a simple heuristic "oh if we have more data than X we can use AVX512 instead of AVX256", because you do not know what other processes are doing, and they will be slowed down as well.

This type of heuristic only works if you are the sole users of the server, e.g. a database server.

vnorilo · on Sept 8, 2019

The problem is that the architecture in question would take forever in cpu time terms to switch back and forth, so a sprinkle of heavy avx512 was the worst of all worlds. In addition, some cpus would emulate 512 with half width micro-ops during voltage ramp up. Yes, it sounds crazy.

temac · on Sept 8, 2019

Not it is not fine at all. If you have too many instructions but not enough, your whole CPU will run slower mostly on non AVX-512 instructions, and maybe even your whole package => the end result will just be slower. Not to say that those processors are bad, but AVX-512 on them is for niche workloads.

saagarjha · on Sept 8, 2019

> memcpy and memset use SSE on some generations, but these days are best inlined by the compiler as "rep movsb/stosb".

Why? I’ve usually seen this get compiled to some version of “mov byte ptr, inc”.

bonzini · on Sept 8, 2019

You want to copy more one byte at a time (unrolling by 4, 8 or more; and for memset, multiplying the stored value by 0x01010101...). In recent processors rep movsb and rep stosb do all that in microcode, and are also able to copy or fill the destination one cache line at a time.

robocat · on Sept 8, 2019

For example strlen: https://stackoverflow.com/questions/57650895/why-does-glibcs...

temac · on Sept 8, 2019

That's not a very good example; this code is generic (although non portable) C optimized implementation, not something making use of fancy processor instructions. It is even less probable that the compiler understand it and manage to optimize it to using dedicated instructions, when it is written that way.

arthur2e5 · on Sept 8, 2019

Read the bug again! The feature is about "transparently" loading libraries that are put in specific subdirectories based on the CPU so that a single configuration can work fast everywhere. Intel's Clear Linux wrote about that in https://clearlinux.org/news-blogs/transparent-use-library-pa....

clmul · on Sept 8, 2019

I know that it is used at least for the implementation of functions like memcpy.

fulafel · on Sept 8, 2019

Compilers do autovectorization (with varying degrees of success) so potentially in lots of places?

goatinaboat · on Sept 8, 2019

This is about the library not the compiler.

LfLxfxxLxfxx · on Sept 8, 2019

A library is compiled with a compiler.

masklinn · on Sept 8, 2019

This check is the opposite of compiler autovectorisation. Its point is specifically for the library to dispatch between implementations (e.g. to an explicitly vectorised one)

fulafel · on Sept 8, 2019

Re-reading this, check is orthogonal to autovectorisation. Opposite of autovectorisation would be manually written SIMD assembly / intrisics code.

fulafel · on Sept 8, 2019

Yes, and it can make sense to dispatch between compiler-vectorised and non-simd implementations.

masklinn · on Sept 8, 2019

It does make sense, the issue at hand is how it's done: by checking for the CPU vendor / product line rather than asking the CPU if it has the operations we want.

fulafel · on Sept 8, 2019

Yep - capabilities based switching makes most sense when you are just interested on what code can successfully execute, and not making decisions based on known perf characteristics of microarchitectures.

But this subthread was about "where in glibc is SIMD used anyway?" and the answer is that in all code, potentially.