On Wed, Aug 15, 2012 at 1:17 PM, Jeffrey Van Voorst
One possibility would be to use vendors' FFT libs.
This doesn't really help us - the FFT isn't enough of a limiting step in refinement etc., which is why we don't distribute the OpenMP-parallelized builds. And honestly, the single biggest thing that could be done to speed up the FFT is to make it take advantage of crystallographic symmetry instead of working in P1, but that's a huge task.
Another possibility is to use multiple processes/threads and communicate via zeromq. zmq is hyped alot, but ipython uses it for multiprocessing and pyzmq is fairly easy to get on *nix platforms. I haven't checked into its availability on MS Windows.
It's not clear to me where this would actually make a difference - most of the code either isn't inherently parallel, or is so easily split up that we can use the multiprocessing module (or even a queuing system). The direct summation is the only embarrassingly parallel routine that I'm aware of that's actually a huge bottleneck. -Nat