Removal of Boost.Thread support
Hi fellow cctbx developers, I am going to ditch support for Boost.Thread and therefore to remove the SF computation parallelised with Boost.Thread as well. Parallelisation at the C++ level should be done with OpenMP: Boost.Thread was merely an early experimentation that was kept alive for no particularly good reasons. If anybody disagrees with that move, please voice your opinion now. Best wishes, Luc J. Bourhis
On 8/13/12 10:50 PM, Luc Bourhis wrote:
Hi fellow cctbx developers,
I am going to ditch support for Boost.Thread and therefore to remove the SF computation parallelised with Boost.Thread as well. Parallelisation at the C++ level should be done with OpenMP: Boost.Thread was merely an early experimentation that was kept alive for no particularly good reasons. If anybody disagrees with that move, please voice your opinion now.
Best wishes,
Luc J. Bourhis
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
I have been lurking on this mailing list for a bit. I am very interested in and have some practical experience with OpenMP and Nvidia CUDA programming. I work on such projects both to make use of modern hardware on typical single user machines, and because, I find it fun. I have found OpenMP to be rather easy to setup and gain good speedup, but it is generally very difficult to get close to the maximum theoretical performance (N cores gives a speedup of N) for relatively short computations (less than 1 second). I have several questions (that I know may not have simple answers): 0) Is there a public roadmap or recent plan of how to proceed? 1) Does the cctbx developers community take kindly to others meddling in the code? 2) For which types of machines would one be trying to tune cctbx's OpenMP code? In general, the tradeoffs are different for machines with a small number of cores versus a massive shared memory platform (1000s of cores). 3) What is the primary motivation? (e.g. have easy to extend code that make use of more cores because they are there? or highly efficient methods that scale very well -- 12 cores should give as close as possible to 12x speedup with respect to 1 core) --Jeff
Hi Jeffrey,
You can now include Cuda code in the cctbx as well. You can call your Cuda routines from a c++ function that has python bindings generated by boost.python.
There should be some example code available on how to do this. A -- enable-cuda during config is required.
P
Sent from my iPad
On Aug 15, 2012, at 7:21, Jeffrey Van Voorst
On 8/13/12 10:50 PM, Luc Bourhis wrote:
Hi fellow cctbx developers,
I am going to ditch support for Boost.Thread and therefore to remove the SF computation parallelised with Boost.Thread as well. Parallelisation at the C++ level should be done with OpenMP: Boost.Thread was merely an early experimentation that was kept alive for no particularly good reasons. If anybody disagrees with that move, please voice your opinion now.
Best wishes,
Luc J. Bourhis
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
I have been lurking on this mailing list for a bit. I am very interested in and have some practical experience with OpenMP and Nvidia CUDA programming. I work on such projects both to make use of modern hardware on typical single user machines, and because, I find it fun. I have found OpenMP to be rather easy to setup and gain good speedup, but it is generally very difficult to get close to the maximum theoretical performance (N cores gives a speedup of N) for relatively short computations (less than 1 second).
I have several questions (that I know may not have simple answers): 0) Is there a public roadmap or recent plan of how to proceed? 1) Does the cctbx developers community take kindly to others meddling in the code? 2) For which types of machines would one be trying to tune cctbx's OpenMP code? In general, the tradeoffs are different for machines with a small number of cores versus a massive shared memory platform (1000s of cores). 3) What is the primary motivation? (e.g. have easy to extend code that make use of more cores because they are there? or highly efficient methods that scale very well -- 12 cores should give as close as possible to 12x speedup with respect to 1 core)
--Jeff _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
On Wed, Aug 15, 2012 at 7:21 AM, Jeffrey Van Voorst
I have been lurking on this mailing list for a bit. I am very interested in and have some practical experience with OpenMP and Nvidia CUDA programming. I work on such projects both to make use of modern hardware on typical single user machines, and because, I find it fun. I have found OpenMP to be rather easy to setup and gain good speedup, but it is generally very difficult to get close to the maximum theoretical performance (N cores gives a speedup of N) for relatively short computations (less than 1 second).
I have several questions (that I know may not have simple answers): 0) Is there a public roadmap or recent plan of how to proceed?
Nope. Most of the speed improvements we've discussed here at Berkeley have focused on better algorithms and optimization methods. There is some interest (mostly Peter) in porting the direct structure factor calculations to GPUs, which would potentially make this method accessible for macromolecules, but it's a long-term project.
1) Does the cctbx developers community take kindly to others meddling in the code?
Since CCTBX is an open-source project, we generally welcome meddling, as long as a) you talk to us first, b) you don't break anything.
2) For which types of machines would one be trying to tune cctbx's OpenMP code? In general, the tradeoffs are different for machines with a small number of cores versus a massive shared memory platform (1000s of cores).
Small machines (where "small" means "2-64 cores"). Very few calculations that we do are suitable for massive shared-memory systems.
3) What is the primary motivation? (e.g. have easy to extend code that make use of more cores because they are there? or highly efficient methods that scale very well -- 12 cores should give as close as possible to 12x speedup with respect to 1 core)
I think a lot of the OpenMP support currently in CCTBX was largely experimental - it seemed like an easy thing to try. The main goal for us at Berkeley was (and still is) to make Phenix faster; once it was obvious that OpenMP wouldn't help very much, we sort of lost interest. We've had far more luck with cruder parallelization using Python multiprocessing (although this is very situational). (A secondary problem is that OpenMP is incompatible with the multiprocessing module, so we don't distribute OpenMP builds of either CCTBX or Phenix as a result.) The best use of OpenMP that I can think of would be to parallelize the direct summation code, which is so inefficient that Amdahl's Law shouldn't be as big of a buzz-kill as it was for the parallelized FFT calculations. -Nat
My 2 cents... One possibility would be to use vendors' FFT libs. Another possibility is to use multiple processes/threads and communicate via zeromq. zmq is hyped alot, but ipython uses it for multiprocessing and pyzmq is fairly easy to get on *nix platforms. I haven't checked into its availability on MS Windows. --Jeff
On Wed, Aug 15, 2012 at 1:17 PM, Jeffrey Van Voorst
One possibility would be to use vendors' FFT libs.
This doesn't really help us - the FFT isn't enough of a limiting step in refinement etc., which is why we don't distribute the OpenMP-parallelized builds. And honestly, the single biggest thing that could be done to speed up the FFT is to make it take advantage of crystallographic symmetry instead of working in P1, but that's a huge task.
Another possibility is to use multiple processes/threads and communicate via zeromq. zmq is hyped alot, but ipython uses it for multiprocessing and pyzmq is fairly easy to get on *nix platforms. I haven't checked into its availability on MS Windows.
It's not clear to me where this would actually make a difference - most of the code either isn't inherently parallel, or is so easily split up that we can use the multiprocessing module (or even a queuing system). The direct summation is the only embarrassingly parallel routine that I'm aware of that's actually a huge bottleneck. -Nat
Another possibility is to use multiple processes/threads and communicate via zeromq. zmq is hyped alot, but ipython uses it for multiprocessing and pyzmq is fairly easy to get on *nix platforms. I haven't checked into its availability on MS Windows. It's not clear to me where this would actually make a difference - most of the code either isn't inherently parallel, or is so easily split up that we can use the multiprocessing module (or even a queuing system). The direct summation is the only embarrassingly parallel routine that I'm aware of that's actually a huge bottleneck.
-Nat At a high level, the major benefit (of pyzmq over multiprocessing) would in instances where parts could be distributed among multiple machines or the process(es) are event-driven. Therefore, the benefit is application dependent, and pyzmq would require installing and testing another 3rd party lib.
--Jeff
On Thu, Aug 16, 2012 at 4:58 AM, Luc Bourhis
A secondary problem is that OpenMP is incompatible with the multiprocessing module
Could you enlighten me as to the exact nature of the problem?
Well, you were the one who figured out the explanation, so I'm just paraphrasing (poorly) what you told me, but: conventional multithreading does not play nice with OpenMP threads. Since Python multiprocessing (and/or threading, which is used by multiprocessing) involves multiple threads (an implementation detail that will not be obvious to the casual Python programmer, thanks to the GIL), it crashes OpenMP. I can dig up the original emails which explained this if you want. -Nat
A secondary problem is that OpenMP is incompatible with the multiprocessing module Could you enlighten me as to the exact nature of the problem? Well, you were the one who figured out the explanation, so I'm just
On Thu, Aug 16, 2012 at 4:58 AM, Luc Bourhis
wrote: paraphrasing (poorly) what you told me, but: conventional multithreading does not play nice with OpenMP threads. Since Python multiprocessing (and/or threading, which is used by multiprocessing) involves multiple threads (an implementation detail that will not be obvious to the casual Python programmer, thanks to the GIL), it crashes OpenMP. I can dig up the original emails which explained this if you want. -Nat _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb I thought multiprocessing was to get around the issues of the GIL and use processes instead of threads, but provide the same (or as similar as
On 8/16/12 11:17 AM, Nathaniel Echols wrote: possible) api as the threading module. At least, that is what is stated on the multiprocessing module page; I haven't tried multiprocessing + OpenMP so I cannot verify that the statements on the multiprocessing page are indeed correct. --Jeff
On Thu, Aug 16, 2012 at 11:09 AM, Jeffrey Van Voorst
I thought multiprocessing was to get around the issues of the GIL and use processes instead of threads, but provide the same (or as similar as possible) api as the threading module.
Correct, the problem is that we use the multiprocessing module to launch jobs from the Phenix GUI. -Nat
Le 15/08/2012 21:04, Nathaniel Echols a écrit :
The best use of OpenMP that I can think of would be to parallelize the direct summation code, which is so inefficient that Amdahl's Law shouldn't be as big of a buzz-kill as it was for the parallelized FFT calculations.
You can use OpenMP for direct summation code for the structure factors calculation. It's working very well. But OpenMP is not only the only thing you can look at. I can work 4 times faster on a single core compared to cctbx by using optimised trigonometric functions[1] and SIMD friendly code. By the way is double precision really necessary? The code I have written is about 3 times faster is single precision than the double precision. However I don't know how similar the calculation of the gradient is so I don't if the same kind of method is applicable. I am using the structure factors calculation as a playground for optimization :). Pascal [1] Sleef library: http://shibatch.sourceforge.net/
Hi Jeffrey, to rephrase what has already be written by Nat, a significant portion of CPU time is often spent in algorithm that are not easily parallelised. Let's elaborate on two examples so as to give you a more precise idea of the issues at hand: one that is not favorable, the computation of protein structure factors and their derivatives, and one that is more favorable, the computation of the normal matrix in least-squares refinement. I. Protein structure factors and their derivatives It's roughly done in 2 steps: 1. the electron density is sample on a grid as follow: for each X-ray scatterer S, compute the contribution of S to the grid node in some neighbourhood of the location of S 2. use FFT to compute the structure factors from the sampled density For derivatives, just replace electron density and structure factor by their derivatives in that description. The FFT is very amenable to parallelisation but the sampling in step 1 is much more difficult. Indeed parallelising the loop as described with a mere OpenMP directive is hopeless: each grid point gets contribution from several scatterer, thus requiring careful locking to prevent race conditions, but the resulting parallel code is then hopelessly inefficient. Having each thread work on its private grid puts too much strain on memory: a no-go as well. We even tried to parallelise the inner loop over the grid point, again with a mere OpenMP directive. But that was predictably hopeless as well: the outer loop has many iterations whereas the inner loop are comparably very few, thus making the overhead of firing the threads lethal. Since the CPU time spent in the sampling step is significant, the benefit of parallelising the FFT step is severely tarnished by Amdahl's law. One would need to parallelise the sampling step, and that is definitively difficult to do. At the very least, that would require a major reorganisation of a very tricky piece of code, i.e. lots of work and lots of testing. For example, one could think of looping over the grid point, and for each of them add up the contribution of neighbouring scatterers. Such a loop would be amenable to parallelisation with a simple OpenMP directive but this new algorithm requires a preparation phase building data structures to efficiently find those neighbouring atoms. So a significant rewrite as stated. II. Computation of the normal matrix in least-squares refinement Here for each reflection hkl, we compute the gradient g of the structure factor Fc(hkl) wrt the crystallographic parameters and then perform A = A + g g^T (#) where A is the so-called normal matrix, that will be used to compute the shift of parameters bringing the model closer to a minimum of the least-squares fit of the modelled Fc against the observed structure factors. Those operations (#), called symmetric rank-1 updates if you are mathematically minded, are very amenable to parallelisation, and also to vectorization actually. Since they dominate the CPU time, speeding them up with parallelisation and vectorization is worth the effort even though Amdahl's law will still spoil the speedup if e.g. the computation of those gradients is not parallelised. Unfortunately, the use of the normal matrix is only feasible for structures with up to around a 1000 atoms in the asymmetric unit. Moreover even for small- to medium- size structures, this particular algorithm is more the exception than the norm: hotspot easy to parallelise are definitely rare in crystallographic computing. Best wishes, Luc Bourhis
Hi
When you have (very) high resolution data, structure factor calculation via direct summation Is preferred over fft based approaches. In those cases, hardware acceleration via gpu could provide significant benefits.
Peter
Sent from my iPad
On Aug 15, 2012, at 21:51, Luc Bourhis
Hi Jeffrey,
to rephrase what has already be written by Nat, a significant portion of CPU time is often spent in algorithm that are not easily parallelised.
Let's elaborate on two examples so as to give you a more precise idea of the issues at hand: one that is not favorable, the computation of protein structure factors and their derivatives, and one that is more favorable, the computation of the normal matrix in least-squares refinement.
I. Protein structure factors and their derivatives It's roughly done in 2 steps:
1. the electron density is sample on a grid as follow: for each X-ray scatterer S, compute the contribution of S to the grid node in some neighbourhood of the location of S 2. use FFT to compute the structure factors from the sampled density
For derivatives, just replace electron density and structure factor by their derivatives in that description.
The FFT is very amenable to parallelisation but the sampling in step 1 is much more difficult. Indeed parallelising the loop as described with a mere OpenMP directive is hopeless: each grid point gets contribution from several scatterer, thus requiring careful locking to prevent race conditions, but the resulting parallel code is then hopelessly inefficient. Having each thread work on its private grid puts too much strain on memory: a no-go as well. We even tried to parallelise the inner loop over the grid point, again with a mere OpenMP directive. But that was predictably hopeless as well: the outer loop has many iterations whereas the inner loop are comparably very few, thus making the overhead of firing the threads lethal.
Since the CPU time spent in the sampling step is significant, the benefit of parallelising the FFT step is severely tarnished by Amdahl's law. One would need to parallelise the sampling step, and that is definitively difficult to do. At the very least, that would require a major reorganisation of a very tricky piece of code, i.e. lots of work and lots of testing. For example, one could think of looping over the grid point, and for each of them add up the contribution of neighbouring scatterers. Such a loop would be amenable to parallelisation with a simple OpenMP directive but this new algorithm requires a preparation phase building data structures to efficiently find those neighbouring atoms. So a significant rewrite as stated.
II. Computation of the normal matrix in least-squares refinement
Here for each reflection hkl, we compute the gradient g of the structure factor Fc(hkl) wrt the crystallographic parameters and then perform A = A + g g^T (#) where A is the so-called normal matrix, that will be used to compute the shift of parameters bringing the model closer to a minimum of the least-squares fit of the modelled Fc against the observed structure factors. Those operations (#), called symmetric rank-1 updates if you are mathematically minded, are very amenable to parallelisation, and also to vectorization actually. Since they dominate the CPU time, speeding them up with parallelisation and vectorization is worth the effort even though Amdahl's law will still spoil the speedup if e.g. the computation of those gradients is not parallelised.
Unfortunately, the use of the normal matrix is only feasible for structures with up to around a 1000 atoms in the asymmetric unit.
Moreover even for small- to medium- size structures, this particular algorithm is more the exception than the norm: hotspot easy to parallelise are definitely rare in crystallographic computing.
Best wishes,
Luc Bourhis
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
Peter, do you have a specific example illustrating it? I'd like to see what exactly the benefits are. In my experience tuning parameters of FFT-based Fcalc and gradients calculation (finer density sampling, larger atomic radii) makes the difference between FFT based vs direct summation invisible at resolutions up to 0.37A (I did not look higher). Pavel On 8/15/12 10:03 PM, Phzwart wrote:
Hi
When you have (very) high resolution data, structure factor calculation via direct summation Is preferred over fft based approaches. In those cases, hardware acceleration via gpu could provide significant benefits.
Peter
Sent from my iPad
On Aug 15, 2012, at 21:51, Luc Bourhis
wrote: Hi Jeffrey,
to rephrase what has already be written by Nat, a significant portion of CPU time is often spent in algorithm that are not easily parallelised.
Let's elaborate on two examples so as to give you a more precise idea of the issues at hand: one that is not favorable, the computation of protein structure factors and their derivatives, and one that is more favorable, the computation of the normal matrix in least-squares refinement.
I. Protein structure factors and their derivatives It's roughly done in 2 steps:
1. the electron density is sample on a grid as follow: for each X-ray scatterer S, compute the contribution of S to the grid node in some neighbourhood of the location of S 2. use FFT to compute the structure factors from the sampled density
For derivatives, just replace electron density and structure factor by their derivatives in that description.
The FFT is very amenable to parallelisation but the sampling in step 1 is much more difficult. Indeed parallelising the loop as described with a mere OpenMP directive is hopeless: each grid point gets contribution from several scatterer, thus requiring careful locking to prevent race conditions, but the resulting parallel code is then hopelessly inefficient. Having each thread work on its private grid puts too much strain on memory: a no-go as well. We even tried to parallelise the inner loop over the grid point, again with a mere OpenMP directive. But that was predictably hopeless as well: the outer loop has many iterations whereas the inner loop are comparably very few, thus making the overhead of firing the threads lethal.
Since the CPU time spent in the sampling step is significant, the benefit of parallelising the FFT step is severely tarnished by Amdahl's law. One would need to parallelise the sampling step, and that is definitively difficult to do. At the very least, that would require a major reorganisation of a very tricky piece of code, i.e. lots of work and lots of testing. For example, one could think of looping over the grid point, and for each of them add up the contribution of neighbouring scatterers. Such a loop would be amenable to parallelisation with a simple OpenMP directive but this new algorithm requires a preparation phase building data structures to efficiently find those neighbouring atoms. So a significant rewrite as stated.
II. Computation of the normal matrix in least-squares refinement
Here for each reflection hkl, we compute the gradient g of the structure factor Fc(hkl) wrt the crystallographic parameters and then perform A = A + g g^T (#) where A is the so-called normal matrix, that will be used to compute the shift of parameters bringing the model closer to a minimum of the least-squares fit of the modelled Fc against the observed structure factors. Those operations (#), called symmetric rank-1 updates if you are mathematically minded, are very amenable to parallelisation, and also to vectorization actually. Since they dominate the CPU time, speeding them up with parallelisation and vectorization is worth the effort even though Amdahl's law will still spoil the speedup if e.g. the computation of those gradients is not parallelised.
Unfortunately, the use of the normal matrix is only feasible for structures with up to around a 1000 atoms in the asymmetric unit.
Moreover even for small- to medium- size structures, this particular algorithm is more the exception than the norm: hotspot easy to parallelise are definitely rare in crystallographic computing.
Best wishes,
Luc Bourhis
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
participants (6)
-
Jeffrey Van Voorst
-
Luc Bourhis
-
Nathaniel Echols
-
Pascal
-
Pavel Afonine
-
Phzwart