It's because CUDA performs better. It's not nice, but it's the situation we're l...

Athas · on Oct 28, 2019

Are you certain that the story is as simple as "CUDA performs better"? It's common folklore, but I have seen little evidence. The only situations I know of when CUDA performs better is when CUDA-specific features are used (if they are relevant for whatever problem is at hand). Also, CUDA libraries (like cuBLAS or cuFFT) tend to be more efficient than their OpenCL equivalent, which is likely because much more work has gone into them. I have also noted that the CUDA compiler is willing to use less accurate (but faster) floating-point instructions by default (for things like e.g. inverse square root), where you need to pass options to the OpenCL compiler for it to do the same. This will matter for some programs.

In fact, I have run tens of thousands of lines of essentially equivalent CUDA and OpenCL code (automatically generated) on the same hardware, and performance was in all cases very similar[0]. If anything, CUDA was actually slower than average (but in the cases I investigated, this was down to arbitrary differences like the CUDA compiler not unrolling some loops as aggressively and such).

[0]: https://futhark-lang.org/blog/2019-02-08-futhark-0.9.1-relea...

llukas · on Oct 28, 2019

Did you compare the performance of nvrtc vs offline nvcc compiler?

Athas · on Oct 28, 2019

No; the code we would need to generate would be rather different. Would you expect a significant difference? When we did research on nvrtc before implementing this, we couldn't find any concrete information that nvrtc should generate slower code.

sytelus · on Oct 28, 2019

Sure, I don't mind CUDA backend as first class citizen. I'm talking about having my code sprinkled with word "cuda" all over. Why can't I write my code that is bit more abstract and potentially compilable to different backends? That is, think about the primitives instead of tightly getting married to cuda forever. AMD performance might not be good today but how about 10 years later? How about using TPUs instead? or FPGAs (if someone creates backend for it)?

KenoFischer · on Oct 28, 2019

Well, one problem is that you're reading an NVIDIA marketing post on an NVIDIA blog talking in particular about the lowest levels of the stack targeting NVIDIA hardware. Higher level abstractions can and do just work across different hardware backends (not as well as we'd like, but we have some thoughts on how to improve that).

shmerl · on Oct 28, 2019

It doesn't perform better than what you can do in Vulkan. It's simply more entrenched.