Hi Graeme, Had a look at the code again in parallel_map(). I think it may be possible to adapt it to retain stack traces of individual qsub jobs. In libtbx/easy_mp.py compare the lines 627 with 718. Both are doing result = res() But the latter is guarded by a try except block. Any exception is the result of a child process dying and it's stack trace is added as the third member of the parmres tuple which is passed on to the user. I think similar could be done with parallel_map(). So I suggest fashioning a parallel_map2() function which is a copy of parallel_map() but with the added exception handler around the result=res() statement. As I have no access to a qsub cluster I can't test whether this would work. Regards, Rob On 03/04/2018 13:23, [email protected] wrote:
HI Rob
I think this is true … sometimes
It sets up the qsub every time, but does not always use it - at least it works on my MacBook with no qsub ;-)
That said, the question remains why exception reports are bad for parallel map… we *are* using preserve_exception_message…
Cheers Graeme
On 3 Apr 2018, at 13:20, Dr. Robert Oeffner
wrote: Hi Graeme,
Just had a look at the code in dials/util/mp.py. It seems that you are using parallel_map() on a cluster using qsub. Unfortunately multi_core_run() is not designed for that. It only runs on a single multi core CPU PC.
Sorry,
Rob
On 03/04/2018 12:44, [email protected] wrote:
Thanks Rob, I could not dig out the thread (and the mail list thing does not have search that I could find) I’ll talk to the crew about swapping this out for dials.* - though is possibly quite a big change? Cheers Graeme On 3 Apr 2018, at 12:26, Dr. Robert Oeffner
mailto:[email protected]> wrote: Hi Graeme, I recall we've been here before, http://phenix-online.org/pipermail/cctbxbb/2017-December/001807.html I believe the solution is to use easy_mp.multi_core_run() instead of easy_mp.parallel_map(). The first function preserves stack traces of individual process, unlike easy_mp.parallel_map(). Regards, Rob On 03/04/2018 07:16, [email protected]mailto:[email protected] wrote: Folks, Following up on user reports again of errors within easy_mp - all that gets logged is “something went wrong” i.e. Using multiprocessing with 10 parallel job(s) Traceback (most recent call last): File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py", line 613, in <module> halraiser(e) File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py", line 611, in <module> script.run() File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py", line 341, in run reflections = integrator.integrate() File "/home/user/bin/dials-installer/modules/dials/algorithms/integration/integrator.py", line 1214, in integrate self.reflections, _, time_info = processor.process() File "/home/user/bin/dials-installer/modules/dials/algorithms/integration/processor.py", line 271, in process preserve_exception_message = True) File "/home/user/bin/dials-installer/modules/dials/util/mp.py", line 171, in multi_node_parallel_map preserve_exception_message = preserve_exception_message) File "/home/user/bin/dials-installer/modules/dials/util/mp.py", line 53, in parallel_map preserve_exception_message = preserve_exception_message) File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/easy_mp.py", line 627, in parallel_map result = res() File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/result.py", line 119, in __call__ self.traceback( exception = self.exception() ) File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 115, in __call__ self.raise_handler( exception = exception ) File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/mainthread.py", line 100, in poll value = target( *args, **kwargs ) File "/home/user/bin/dials-installer/modules/dials/util/mp.py", line 91, in __call__ preserve_exception_message = self.preserve_exception_message) File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/easy_mp.py", line 627, in parallel_map result = res() File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/result.py", line 119, in __call__ self.traceback( exception = self.exception() ) File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 86, in __call__ raise exception RuntimeError: Please report this error to [email protected]mailto:[email protected]: exit code = -9 I forget why it was decided that keeping the proper stack trace was a bad thing, but could this be revisited? It would greatly help to see it in the output of the program (if as is the case here I do not have the user data) My email-fu is not strong enough to dig out the previous conversation Cheers Graeme -- Robert Oeffner, Ph.D. Research Associate, The Read Group Department of Haematology, Cambridge Institute for Medical Research University of Cambridge Cambridge Biomedical Campus Wellcome Trust/MRC Building Hills Road Cambridge CB2 0XY www.cimr.cam.ac.uk/investigators/read/index.htmlhttp://www.cimr.cam.ac.uk/investigators/read/index.html tel: +44(0)1223 763234 -- Robert Oeffner, Ph.D. Research Associate, The Read Group Department of Haematology, Cambridge Institute for Medical Research University of Cambridge Cambridge Biomedical Campus Wellcome Trust/MRC Building Hills Road Cambridge CB2 0XY
www.cimr.cam.ac.uk/investigators/read/index.html tel: +44(0)1223 763234
-- Robert Oeffner, Ph.D. Research Associate, The Read Group Department of Haematology, Cambridge Institute for Medical Research University of Cambridge Cambridge Biomedical Campus Wellcome Trust/MRC Building Hills Road Cambridge CB2 0XY www.cimr.cam.ac.uk/investigators/read/index.html tel: +44(0)1223 763234