Should we enable boost threads in bootstrap?
Sorry for not replying directly to the existing thread, but I've only just subscribed to cctbxbb. Since I started the discussion of the Global Interpreter Lock (GIL) yesterday, I thought I should give a quick run-down of why it's important and where/how it can be released. In brief, the much-maligned GIL exists because much of the Python API proper (in particular, reference counting) is not thread-safe. Its purpose is to ensure that only one thread can ever be using Python objects at any given time. In practice this means that a naive implementation of Python threads gives you the worst of all possible worlds: parallel logic, but slower-than-single-threaded performance. Python regularly tries to swap between all threads it has running, but only the one that currently holds the GIL can go forward. So if one thread runs a method that takes 10 seconds without releasing the GIL, *all* threads will hang for 10 seconds. To be clear, this *doesn't* prevent you from using C++ threads, as long as those threads are not acting on Python objects. The thing is, in most production code the heavy-duty computation is *not* done on Python objects - it's done in C++ on objects that Python simply holds a pointer to. All such functions can safely release the GIL as long as they reacquire it before returning to Python. In the context of the example above, that means all other threads are able to continue on doing their own thing while that 10-second function is running. So why use threads rather than multiprocessing? Three key reasons: - threads are trivial to set up and run in the same way on Linux, macOS and Windows - threads share memory by default, making life much easier when they need to communicate regularly - It turns out that OpenCL/CUDA do not play at all well with forked processes. A real-world example: in ISOLDE I'm currently bringing together two key packages with their own Python APIs: ChimeraX for the molecular graphics, OpenMM for MD simulation. Since this is an interactive application, speed is of the essence. Graphics performance needs to be independent of simulation performance (to allow the user to rotate/translate etc. smoothly no matter how fast the simulation runs) so parallelism is mandatory. There is constant back-and-forth communication needed (the simulation needs to update coordinates; interactions need to be sent back to the simulation), which easier/faster in shared memory. The simulation is running on a GPU. My initial implementation used Python's multiprocessing module applying os.fork() - this worked in Linux under the very specific circumstances where the GPU had not previously been used for OpenCL or CUDA by the master process, but failed on the Mac and is of course impossible on Linux. Switching from multiprocessing to threading (with no other changes to my code) gave me an implementation that works equally well on Linux and Mac with no OS-specific code, and should work just as well on Windows once I get around to doing a build. The performance is effectively equal to what I was getting from multiprocessing, since all the major libraries I'm using (ChimeraX, OpenMM, NumPy) release the GIL in all their C++ calls. At the moment, though, CCTBX functions don't - but there's no reason why they can't. See https://wiki.python.org/moin/boost.python/HowTo#Multithreading_Support_for_m.... A simple toy example using Python threads to speed up a heavyish computation: from multiprocessing.pool import ThreadPool import numpy from math import ceil from time import sleep def f(a, target, start_i, end_i): arr = a[start_i:end_i] target[start_i:end_i] = numpy.exp(numpy.cos(arr)+numpy.sin(arr)) def test_threads(a, num_threads): l = len(a) ret = numpy.empty(l, a.dtype) stride = int(ceil(l/num_threads)) with ThreadPool(processes=num_threads) as p: for i in range(num_threads): start = stride*i end = stride*(i+1) if end > l: end = l p.apply_async(f, (a, ret, start, end)) p.close() p.join() return ret a = numpy.random.rand(50000000) target = numpy.empty(len(a), a.dtype) %timeit f(a, target, 0, len(a)) 2.35 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit test_threads(a, 1): 2.55 s ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit test_threads(a, 2): 1.5 s ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit test_threads(a, 3): 1.2 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Correction: "... and is of course impossible on *Windows*." On 2017-09-27 11:54, Tristan Croll wrote:
Sorry for not replying directly to the existing thread, but I've only just subscribed to cctbxbb. Since I started the discussion of the Global Interpreter Lock (GIL) yesterday, I thought I should give a quick run-down of why it's important and where/how it can be released.
In brief, the much-maligned GIL exists because much of the Python API proper (in particular, reference counting) is not thread-safe. Its purpose is to ensure that only one thread can ever be using Python objects at any given time. In practice this means that a naive implementation of Python threads gives you the worst of all possible worlds: parallel logic, but slower-than-single-threaded performance. Python regularly tries to swap between all threads it has running, but only the one that currently holds the GIL can go forward. So if one thread runs a method that takes 10 seconds without releasing the GIL, *all* threads will hang for 10 seconds.
To be clear, this *doesn't* prevent you from using C++ threads, as long as those threads are not acting on Python objects.
The thing is, in most production code the heavy-duty computation is *not* done on Python objects - it's done in C++ on objects that Python simply holds a pointer to. All such functions can safely release the GIL as long as they reacquire it before returning to Python. In the context of the example above, that means all other threads are able to continue on doing their own thing while that 10-second function is running.
So why use threads rather than multiprocessing? Three key reasons: - threads are trivial to set up and run in the same way on Linux, macOS and Windows - threads share memory by default, making life much easier when they need to communicate regularly - It turns out that OpenCL/CUDA do not play at all well with forked processes.
A real-world example: in ISOLDE I'm currently bringing together two key packages with their own Python APIs: ChimeraX for the molecular graphics, OpenMM for MD simulation. Since this is an interactive application, speed is of the essence. Graphics performance needs to be independent of simulation performance (to allow the user to rotate/translate etc. smoothly no matter how fast the simulation runs) so parallelism is mandatory. There is constant back-and-forth communication needed (the simulation needs to update coordinates; interactions need to be sent back to the simulation), which easier/faster in shared memory. The simulation is running on a GPU. My initial implementation used Python's multiprocessing module applying os.fork() - this worked in Linux under the very specific circumstances where the GPU had not previously been used for OpenCL or CUDA by the master process, but failed on the Mac and is of course impossible on Linux. Switching from multiprocessing to threading (with no other changes to my code) gave me an implementation that works equally well on Linux and Mac with no OS-specific code, and should work just as well on Windows once I get around to doing a build. The performance is effectively equal to what I was getting from multiprocessing, since all the major libraries I'm using (ChimeraX, OpenMM, NumPy) release the GIL in all their C++ calls. At the moment, though, CCTBX functions don't - but there's no reason why they can't. See https://wiki.python.org/moin/boost.python/HowTo#Multithreading_Support_for_m....
A simple toy example using Python threads to speed up a heavyish computation:
from multiprocessing.pool import ThreadPool import numpy from math import ceil from time import sleep
def f(a, target, start_i, end_i): arr = a[start_i:end_i] target[start_i:end_i] = numpy.exp(numpy.cos(arr)+numpy.sin(arr))
def test_threads(a, num_threads): l = len(a) ret = numpy.empty(l, a.dtype) stride = int(ceil(l/num_threads)) with ThreadPool(processes=num_threads) as p: for i in range(num_threads): start = stride*i end = stride*(i+1) if end > l: end = l p.apply_async(f, (a, ret, start, end)) p.close() p.join() return ret
a = numpy.random.rand(50000000) target = numpy.empty(len(a), a.dtype)
%timeit f(a, target, 0, len(a)) 2.35 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_threads(a, 1): 2.55 s ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_threads(a, 2): 1.5 s ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_threads(a, 3): 1.2 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
participants (1)
-
Tristan Croll