High-Performance GPU Computing in the Julia Programming Language (2017)

maleadt · on Oct 28, 2019

Author here, happy to answer any questions! We've been developing and maintaining this toolchain for a while now, so the relevant packages (CUDAnative.jl for kernel programming, CuArrays.jl for a GPU array abstraction) are much more mature. Our focus has recently been on implementing a common base of array operations that can be used across devices (GPU, CPU, etc), so that users can develop using the base CPU array type, quickly benefit from a GPU by switching to CuArrays, only to rely on specific CUDA-specific functionality from CuArrays/CUDAnative when they need custom functionality.

adamnemecek · on Oct 28, 2019

Julia is one of my fav languages. For numerical computing, neither python + numpy, nor matlab come even close. The interop is nuts. To call, say numpy fft, you just do

using PyCall

np = pyimport("numpy")

res = np.fft.fft(rand(ComplexF64, 10))

No casting back and forth. This is a toy example, julia ofc has fftw bindings.

Interop with C++, MATLAB, Mathematica etc is similarly simple.

siproprio · on Oct 28, 2019

In theory Julia is supposed to be fantastic.

In practice, things either don't exist, or are poorly implemented:

Plotting simple things take 30 seconds.

And that's if you don't count the time it takes to `] add Plots`, especially on Windows!

And the REPL is broken.

And the editor is slow and annoying (Juno or vscode).

And documentation ranges from poor (no examples, buggy between platforms, broken links due to version updates) to non-existent. For example, lots of tutorials will often link to broken links to official documentation, links that one time were thought to be working but now aren't.

And so on...

ddragon · on Oct 28, 2019

Your 1st, 2nd and 4th points seem to be fundamentally the same, which is compile time latency making interactive use slow. That's definitely a problem for a language trying to solve the two language problem of having high interactivity and high performance at the same time, and the compiler team [1] are now focusing on that issue on versions 1.4 and beyond hopefully it will get to the point where it isn't a problem anymore.

Documentation is always a problem (especially for smaller packages), but I don't feel I had more issue with Julia than other languages (with a few exceptions, some languages managed to have an exceptional culture in terms of great documentation). Most of the issues were due to the fact that Julia just got to 1.0 a year ago, and the breaking changes made so most documentation became outdated, but this will only become less of a problem since the language became stable.

Julia has one of the best REPL of any language I used, and thankfully I didn't meet with any problems with it. Might be a good idea to create an issue on the github.

[1] https://discourse.julialang.org/t/compiler-work-priorities/1...

siproprio · on Oct 28, 2019

Nope, they are not the same points!

1. It's about compiler latency. Compiling happens when you call the function. That indeed can and should be improved.

But there are other things that contribute to the experience being shitty.

2. Is about adding a package, when adding a package it downloads all the dependencies (which includes cairo, and WinRPM on windows which all have problems of their own).

4. Is about the poor atom experience - I'm not a fan of electron apps myself - and about the slowness of the LSP on vscode, which just does not provide a good experience on vs code and things are often broken, especially on the latest versions.

If you think all these things are "compiler latency" then perhaps you're part of the problem.

ddragon · on Oct 28, 2019

Fair enough, you mentioned the speed of adding a package and I assume it was the build time after the download (which is mostly git). I have to confess that I don't consider the speed of the initial build essential since it's a one time thing (correctly and efficiently building and handling dependencies is, which it does very well in my opinion). Also, I think they are testing now a way to deliver binary libraries within the deps which can improve the situation with the external dependencies.

Visual Studio Code is definitely fast enough in my machine (especially with a Revise.jl workflow), which is why I also assumed it was stuff like running part of the code or using the Language Server Protocol, which will hit the same compilation lag. I agree that Atom is slow, and it's one of the reason I don't use it.

Though I'm clearly biased since my experience is entirely in Linux, it's possible that the Windows experience is just worse.

siproprio · on Oct 28, 2019

The VS Code extension took a long time to support newer versions of the language.

What happens in between figuring out how to make basic tooling work and actually getting something done is frustration.

Building is so clearly not a one-time thing if you use the language to explore solutions, and play with things.

socialdemocrat · on Oct 28, 2019

Yes plotting in Julia is slow upon first invocation due to the JIT. Annoys me too, but to say the REPL is broken is profoundly puzzling to me. It is the best REPL I have ever used. It beats anything I have used for Python, Ruby, JavaScript, Lua etc.

Also your documentation issue is also strange. Yes certain things don’t exist but I would say the Julia docs is quite well made. In particular if you use the REPL documentation I find it much better than Python. Tends to be quite nice examples, color coding etc.

Oreb · on Oct 28, 2019

> It is the best REPL I have ever used. It beats anything I have used for Python, Ruby, JavaScript, Lua etc.

This is true, but that's a _very_ low bar to pass. Julia is a Lisp, and deserve to be compared to other Lisps rather than to lesser languages. Every Common Lisp or Scheme I have used has a vastly superior REPL experience than Julia. Even Clojure is better.

Don't get me wrong: I love Julia, and I hope it will eventually replace Python as the main language for scientific computing, data science and machine learning. But the REPL experience, at this point, leaves a lot to be desired. I'm sure it will improve in the future.

StefanKarpinski · on Oct 28, 2019

I’m curious what specifically you would want improved in the REPL.

siproprio · on Oct 28, 2019

Does the REPL on Windows have all the features and niceties and quality of life of the REPL on bash or other OSes?

If so (which isn't), then we can start suggesting new features, perhaps better text editing capabilities, or introspection, better access to documentation.

StefanKarpinski · on Nov 4, 2019

The REPL on all platforms is the same. There’s an issue with old buggy versions of cmd.exe, but that’s only on such old versions of Windows that they’re not even supported by Microsoft anymore.

siproprio · on Oct 28, 2019

The shell is not broken because it's slow.

It's broken because it has poor and puzzling exceptions, because output lags on Windows from gtk bugs (known for years, never fixed, huge GitHub discussion),and because the shell mode is often broken.

cbkeller · on Oct 28, 2019

Things are still improving, but I already find it quite usable in practice. Yes, precompilation can take a while, but once you start using the language regularly you hardly notice it, since you only need to re-precompile after installing updates -- which ends up being a small proportion of the time. Same with "time to first plot", since I almost always have a session already running already these days.

zzleeper · on Oct 28, 2019

Similar experience here. Did a comparison of a bunch of statistical tools (R, Matlab, Julia, Python, etc.) on small-ish datasets. Used the latest versions in all cases, in Windows 10. All but Julia ran the regressions in <1s, while Julia took 20+ seconds, mostly importing the required libraries and just starting up.

Sure, the usual answer is "well its an initial cost, its faster after that" but not all my code would otherwise take days to run. As long as "using CSV" takes 10 seconds, I'm out.

xiaodai · on Oct 28, 2019

This is a common issue that people encounter. I am glad that compilation latency is now the top priority in terms of Julia-compiler work. Hoping to see something interesting from there

mirekrusin · on Oct 28, 2019

Low latency development flow is slightly different in Julia. You should startup you process and reload updated code with revise. You won't have this problem. Obviously compilation performance improvements will be very welcome when they arrive but it's not a deal breaker because of this revise based flow.

alpaca128 · on Oct 28, 2019

I just tried Revise. Yes, it's very nice and usable for programs without dependencies, but including a Plots example program that normally takes Julia 1 minute and 20 seconds took 19 minutes. There's only so much time I'm willing to wait for that.

Currently I'm mainly using Jupyter Notebook and that is by far the best experience I've had(it's like Revise but much, much faster). But to me it seems Jupyter Notebook wasn't designed with code outside of a single isolated file in mind, which makes it cumbersome in some cases.

I like the language, but I hope the situation improves soon. Editing code in the browser is not that much fun.

mirekrusin · on Oct 28, 2019

That sounds crazy long, maybe something is not right, have you tried reaching at https://discourse.julialang.org for help for example?

stabbles · on Oct 28, 2019

> And the REPL is broken.

You should really back up that claim, because in my experience it's absolutely great

xiaodai · on Oct 28, 2019

Sounds like road to maturity problems. The key is to think, "are these issues solvable?" if they are it's only a matter of time. The next thing is when those things are fixed, does Julia offer something above and beyond what Python and R can do (easily). If the answer is for you, thne it's a matter of whether Julia provides value now for you.

systems · on Oct 28, 2019

Just pointing out that this is an article from 2017 and was discussed before on hn https://qht.co/item?id=15564639

vasili111 · on Oct 28, 2019

I always glad to see topics about Julia. I think it has good potential to replace several other languages with one better language.

sytelus · on Oct 28, 2019

Why this infrastructure is so tightly coupled with CUDA? CUDA is very specific and closed APIs for NVidia hardware only. Programming languages should focus on more general primitives that might work on NVidia or TPUs or something else. PyTorch also has CUDA all over in its APIs and its frustrating to see such tight binding with closed one company API. Also take a look at OpenCL.

maleadt · on Oct 28, 2019

Our view is that to get performance out of a system (here CUDA), it's better not to start abstracting it right away. So we have CUDAnative.jl and CUDAdrv.jl for fairly low-level CUDA programming, albeit in a high-level language. However, with CuArrays.jl we implement the Julia array interface for CUDA GPUs. That means you can write array code for one platform (CPU using Base.Array) and start using hardware accelerators by just switching the array type (CUDA GPU using CuArray). Of course, real-life applications might still need to use CUDA specific functionality for one reason or another, but at least you can get most of the way without platform-specific programming.

darknoon · on Oct 28, 2019

It's because CUDA performs better. It's not nice, but it's the situation we're living in. Particularly AMD support and performance are lot.

Athas · on Oct 28, 2019

Are you certain that the story is as simple as "CUDA performs better"? It's common folklore, but I have seen little evidence. The only situations I know of when CUDA performs better is when CUDA-specific features are used (if they are relevant for whatever problem is at hand). Also, CUDA libraries (like cuBLAS or cuFFT) tend to be more efficient than their OpenCL equivalent, which is likely because much more work has gone into them. I have also noted that the CUDA compiler is willing to use less accurate (but faster) floating-point instructions by default (for things like e.g. inverse square root), where you need to pass options to the OpenCL compiler for it to do the same. This will matter for some programs.

In fact, I have run tens of thousands of lines of essentially equivalent CUDA and OpenCL code (automatically generated) on the same hardware, and performance was in all cases very similar[0]. If anything, CUDA was actually slower than average (but in the cases I investigated, this was down to arbitrary differences like the CUDA compiler not unrolling some loops as aggressively and such).

[0]: https://futhark-lang.org/blog/2019-02-08-futhark-0.9.1-relea...

llukas · on Oct 28, 2019

Did you compare the performance of nvrtc vs offline nvcc compiler?

Athas · on Oct 28, 2019

No; the code we would need to generate would be rather different. Would you expect a significant difference? When we did research on nvrtc before implementing this, we couldn't find any concrete information that nvrtc should generate slower code.

sytelus · on Oct 28, 2019

Sure, I don't mind CUDA backend as first class citizen. I'm talking about having my code sprinkled with word "cuda" all over. Why can't I write my code that is bit more abstract and potentially compilable to different backends? That is, think about the primitives instead of tightly getting married to cuda forever. AMD performance might not be good today but how about 10 years later? How about using TPUs instead? or FPGAs (if someone creates backend for it)?

KenoFischer · on Oct 28, 2019

Well, one problem is that you're reading an NVIDIA marketing post on an NVIDIA blog talking in particular about the lowest levels of the stack targeting NVIDIA hardware. Higher level abstractions can and do just work across different hardware backends (not as well as we'd like, but we have some thoughts on how to improve that).

shmerl · on Oct 28, 2019

It doesn't perform better than what you can do in Vulkan. It's simply more entrenched.

jlebar · on Oct 28, 2019

> Why this infrastructure is so tightly coupled with CUDA?

It's not. It uses LLVM, which can easily target AMD GPUs. (Whether the Julia folks have invested in making this work, I dunno, but it's not Extremely Hard.)

Understandably nvidia gives you the wrong impression.

vchuravy · on Oct 28, 2019

We are indeed interested in targeting AMD GPUs. There is a prototype backend available at https://github.com/JuliaGPU/AMDGPUnative.jl and we are closely following the status of SPIR-V and Intel GPUs in LLVM.

The focus on CUDA comes from the fact that most HPC systems for scientific computing are using Nvidia GPUs. That is finally slowly changing.

pjmlp · on Oct 28, 2019

Because Khronos up to a little while lived on a bubble that we have to use C, write our own compiler and linking logic to use GPGPUs and collect debugging toolchains from each OEM.

Only when they started getting a beating of PTX bytecode and multi-language deployment on CUDA did they woke up and came up with SPIR (later SPIR-V) and SYCL, which still isn't widely deployed.

shmerl · on Oct 28, 2019

Because Nvidia likes lock-in. It totally doesn't have to. Today we have Vulkan for general purpose GPU programming.

llukas · on Oct 29, 2019

OpenCL, hip, Vulkan, what tomorrow? Or alternatively, there were cl* libraries, roc* libraries and hip* libraries for AMD? Which ones are supported?

CUDA doesn't require rewriting code with ${OSS} framework of the year, every year. They need to earn that lock-in with future compatibility guarantees which none of OSS projects has.

shmerl · on Oct 29, 2019

Whatever it is, as long as it's not tied to one GPU only, it could be promising. Something that's tied to Nvidia or anyone else exclusively is not good, and surely isn't democratizing anything.

> CUDA doesn't require rewriting code with ${OSS} framework of the year, every year.

How so? Change the GPU from Nvidia, and you are forced to rewrite code. That's the whole point of lock-in, it's a tax on developers. CUDA doens't guarantee you anything, if you don't stick with their GPUs.

Vulkan on the the other hand has conformance requirements.

dlphn___xyz · on Oct 28, 2019

there are specific benefits of cuda over opencl: see https://arxiv.org/vc/arxiv/papers/1005/1005.2581v1.pdf

sytelus · on Oct 28, 2019

Yes, but is cuda going to keep its edge 10 years down the line? Do I want to hardcode my algorithms so tightly with today's cuda APIs? Can there be better more generic primitives that are agnostic of propitiatory cuda APIs but would support it as backend without too much perf hit?

xiaodai · on Oct 28, 2019

If you want performance then yeah. If you are after hypothetical performance in future which may not even materialise, then the choice is yours. Everyone knows where the sensible ground is. Which, unfortunately, is CUDA only

shaklee3 · on Oct 28, 2019

AMD has a search and replace library that's API compatible with many cuda functions now. It hasn't caught on yet, but if they release decent hardware soon, it might.

refresh-creds · on Oct 28, 2019

That paper is already about 10 years old so I think you are being trolled.

idnefju · on Oct 28, 2019

OpenCL isn't in a good place. CUDA has become the industry standard.

m4r35n357 · on Oct 28, 2019

Julia is presented as a simple language, but is is anything but that in practice.

ddragon · on Oct 28, 2019

Julia is not presented as a simple language, it's presented as a "I want everything" language [1], a Python-Ruby-Perl-C-Fortran-Lisp-Matlab crossover with it's own unique spice. Which is completely opposite from something like Go. You can start programming knowing only one of Julia's inspiration, for example programming Julia like Python, but if you want all the language brings you'll have to dive in a lot of the other sides (which might clash a little with the cleverness of the compiler, as it will accept such varied styles it will not guide you to the one through way of idiomatic Julia code).

Still the Julia team did a great job in making all those diverse features feel part of one connected philosophy instead of an ad hoc pile of functionality, even if it does take a little while to fully internalize it.

[1] https://julialang.org/blog/2012/02/why-we-created-julia

shmerl · on Oct 28, 2019

> The performance possibilities of GPUs can be democratized by providing more high-level tools that are easy to use by a large community of applied mathematicians and machine learning programmers.

How exactly CUDA is "democratizing" anything, if it's tied to Nvidia? Vulkan backend would make more sense for that purpose.

rrss · on Oct 28, 2019

That sentence explains perfectly well what it means by democratizing, and how is independent of the platform being tied to nvidia.

shmerl · on Oct 28, 2019

Can you elaborate please? I was under the impression that CUDA is tied to Nvidia, unless you mean there are now working shims for other GPUs.

Athas · on Oct 28, 2019

CUDA is ultimately an API. AMD even has a converter for transforming CUDA cuda to something more portable[0].

While it would be better in a democratic sense for GPUs to be accessed using a fully free API, having an easily usable proprietary API is still more democratic than a difficult-to-use API (especially when, as here, the easy-to-use layer is actually fully free, and can perhaps be retargeted to fully free lower layers later).

[0]: https://gpuopen.com/compute-product/hip-convert-cuda-to-port...

shmerl · on Oct 28, 2019

It still looks like porting idea, not like a shim that makes CUDA run on AMD. So I'd say CUDA is still locked to Nvidia. AMD are trying to ease up the transition to portable options - that's surely good, but it's not a full fledged lock-in unlocking.

I'd say, Nvidia are being hypocritical here, with this whole "democratizing" claim. They are direct beneficiaries of the lock-in they are advancing with it.

pjmlp · on Oct 28, 2019

It allows us to use any programming language with PTX backend.

OpenCL on the other hand is C FTW and now kind of supports C++ if one has luck with the drivers.

From that point of view is democratizing GPGPU programming to anyone that doesn't want to deal with either C or C++.

shmerl · on Oct 28, 2019

That's not democratizing GPU computing, that's "democratizing" Nvidia lock-in. Totally different thing, so their claim was hypocritical, since they made it sound like a general thing.

pjmlp · on Oct 29, 2019

It is easy to sort out, Khronos just needs to accept that a large majority of developers want productive SDKs, not raw specifications based on C, and with luck some C++ as well.

I also don't see you complain that so far the only mature SYCL SDK is available from Codeplay, thus making it a single vendor "standard". At least until Intel (One API) and others actually come one with their SYCL extensions, because naturally nothing that Khronos does can be without extensions and its multiple execution paths.