No. Like the siblings said, say, you have a program which spends 10% time in Pyt...

willseth · on Nov 6, 2024

Python has no threads or processes hard limit, and the pure Python code in between calls into C extensions is irrelevant because you would not apply multithreading to it. Even if you did, the optimal number of threads would vary based on workload and compute. No idea where you got 10.

rfoo · on Nov 7, 2024

> No idea where you got 10.

Because of GIL, there may be at most one thread in a Python process running pure Python code. If I have a computation which takes 10% time in pure Python, 90% time in C extensions, I can only launch at most 10 threads, because 10 * 10% = 100%, and expect mostly linear scalability.

> the pure Python code in between calls into C extensions is irrelevant because you would not apply multithreading to it

No. There is a very important use case where the entire computation, driven by Python, is embarrassingly parallel and you'd want to parallize that, instead of having internal parallization in each your C extensions call. So the pure Python code in between calls into C extensions MUST BE SCALABLE. C extensions code may not launch thread at all.

willseth · on Nov 7, 2024

This is your original comment, which as stated is simply incorrect

> numpy isn't written in Python. However, there is a scalability issue: they can only drive so many threads (not 1, but not many) in a process due to GIL.

Now you have concocted this arbitrary example of why you can't use multithreading that has nothing to do with your original comment or my response.

> instead of having internal parallelization in each your C extensions call ... C extensions code may not launch thread at all.

I don't think you understood my comment - or maybe you don't understand Python multithreading. If a C extension is single threaded but releases the GIL, you can use multithreading to parallelize it in Python. e.g. `ThreadPool(processes=100)` will create 100 threads within the current Python process and it will soak all the CPUs you have -- without additional Python processes. I have done this many times with numpy, numba, vector indexes, etc.

Even for your workload, using multithreading for the GIL-free code in a hierarchical parallelization scheme would be far more efficient than naive multiprocessing.

rfoo · on Nov 8, 2024

> This is your original comment, which as stated is simply incorrect

I apologize if I can't make you understand what I said. But I still believe I said it clearly and it is simply correct.

Anyway, let me try to mansplain it again, "numpy isn't written in Python" - and numpy releases GIL. So as long as a Python code calls into numpy, the thread on which the numpy call runs can go without GIL. The GIL could be taken by another thread to run Python code. I have easily have 100 threads running in numpy without any GIL issue. BUT, they eventually needs to return to Python and retake GIL. Say, for each such thread and for every 1 second there is 0.1 seconds they need to run pure Python code (and must hold GIL). Please tell me how to scale this to >10 threads.

> Now you have concocted this arbitrary example of why you can't use multithreading that has nothing to do with your original comment or my response.

The example is not arbitrary at all. This is exactly the problem people are facing TODAY in ANY DL training written in PyTorch.

I have a Python thread, driving GPUs in an async way, it barely runs any Python code, all good. No problem.

Then, I need to load and preprocess data [1] for the GPUs to consume. I need very high velocity changing this code, so it looks like a stupid script, reads data from storage, and then does some transformation using numpy / again whatever shit I decided to call. Unfortunately, as dealing with the data is largely where magic happens in today's DL, the code spend non-trivial time (say, 10%) in pure Python code manipulating bullshit dicts in between all numpy calls.

Compared to what happens on GPUs this is pretty lightweight, this is not latency sensitive and I just need good enough throughput, there's always enough CPU cores alongside GPUs, so, ideally, I just dial up the concurrency. And then I hit the GIL wall.

> I don't think you understood my comment - or maybe you don't understand Python multithreading. If a C extension is single threaded but releases the GIL, you can use multithreading to parallelize it in Python. e.g. `ThreadPool(processes=100)` will create 100 threads within the current Python process and it will soak all the CPUs you have -- without additional Python processes. I have done this many times with numpy, numba, vector indexes, etc.

I don't think you understood what you did before and you already wasted a lot of CPUs.

I never talked about doing any SINGLE computation. Maybe you are one of those HPC gurus who care and only care about solving single very big problem instances? Otherwise I have no idea why you are even talking about hierarchical parallelization after I already said that a lot of problem is embarrassingly parallel and they are important.

[1] Why don't I simply do it once and store the result? Because that's where actual research is happening and is what a lot of experiments are about. Yeah, not model architectures, not hyper-parameters. Just how you massage your data.

willseth · on Nov 8, 2024

Your original comment as written claims that numpy cannot scale using threads because of the GIL. You admit that is wrong, but somehow can't read your comment back and understand that it says that. What you really meant was that combinations of pure Python and numpy don't scale trivially using threads, which is true but not what you wrote. You were actually just thinking of your PyTorch specific use case, which you evidently haven't figured out how to scale properly, and oversimplified a complaint about it.

> I don't think you understood what you did before and you already wasted a lot of CPUs.

No CPUs were wasted lol. You are clearly confused about how threads and processes in Python work. You also don't seem to understand hierarchical parallelization, which is simply a pattern that works well in cases where you can better maximize parallelism using combination of processes and threads.

There are probably better ways to address your preprocessing problem, but I get the impression you're one of those people only incidentally using Python out of necessity to run PyTorch jobs and frustrated or haven't yet come to the realization that you need to learn how to optimize your Python compute workload because PyTorch doesn't do everything for you automatically.