On Wed, Aug 15, 2012 at 7:21 AM, Jeffrey Van Voorst
I have been lurking on this mailing list for a bit. I am very interested in and have some practical experience with OpenMP and Nvidia CUDA programming. I work on such projects both to make use of modern hardware on typical single user machines, and because, I find it fun. I have found OpenMP to be rather easy to setup and gain good speedup, but it is generally very difficult to get close to the maximum theoretical performance (N cores gives a speedup of N) for relatively short computations (less than 1 second).
I have several questions (that I know may not have simple answers): 0) Is there a public roadmap or recent plan of how to proceed?
Nope. Most of the speed improvements we've discussed here at Berkeley have focused on better algorithms and optimization methods. There is some interest (mostly Peter) in porting the direct structure factor calculations to GPUs, which would potentially make this method accessible for macromolecules, but it's a long-term project.
1) Does the cctbx developers community take kindly to others meddling in the code?
Since CCTBX is an open-source project, we generally welcome meddling, as long as a) you talk to us first, b) you don't break anything.
2) For which types of machines would one be trying to tune cctbx's OpenMP code? In general, the tradeoffs are different for machines with a small number of cores versus a massive shared memory platform (1000s of cores).
Small machines (where "small" means "2-64 cores"). Very few calculations that we do are suitable for massive shared-memory systems.
3) What is the primary motivation? (e.g. have easy to extend code that make use of more cores because they are there? or highly efficient methods that scale very well -- 12 cores should give as close as possible to 12x speedup with respect to 1 core)
I think a lot of the OpenMP support currently in CCTBX was largely experimental - it seemed like an easy thing to try. The main goal for us at Berkeley was (and still is) to make Phenix faster; once it was obvious that OpenMP wouldn't help very much, we sort of lost interest. We've had far more luck with cruder parallelization using Python multiprocessing (although this is very situational). (A secondary problem is that OpenMP is incompatible with the multiprocessing module, so we don't distribute OpenMP builds of either CCTBX or Phenix as a result.) The best use of OpenMP that I can think of would be to parallelize the direct summation code, which is so inefficient that Amdahl's Law shouldn't be as big of a buzz-kill as it was for the parallelized FFT calculations. -Nat