Hi Graeme,

if all you want to do is print the stacktrace out to stdout/stderr, then the following lines should do this:

from libtbx.scheduling import stacktrace
stacktrace.enable()
parallel_map(...)

This will install a custom excepthook and print the propagated stacktrace onto the console when the program crashes.

If you want to accumulate potentially multiple traceback, you need to do what Rob suggests (although I would not actually copy the code, but instead create a wrapping function that fails silently and stores the tracebacks) - it is not immediately clear to me whether this has any advantages over the previous approach, but let me know if you need some help for such a generic solution.

BW, Gabor


On Tue, Apr 3, 2018 at 1:46 PM, Dr. Robert Oeffner <rdo20@cam.ac.uk> wrote:
Hi Graeme,

Had a look at the code again in parallel_map(). I think it may be possible to adapt it to retain stack traces of individual qsub jobs.
In libtbx/easy_mp.py compare the lines 627 with 718. Both are doing
        result = res()
But the latter is guarded by a try except block. Any exception is the result of a child process dying and it's stack trace is added as the third member of the parmres tuple which is passed on to the user.

I think similar could be done with parallel_map(). So I suggest fashioning a parallel_map2() function which is a copy of parallel_map() but with the added exception handler around the result=res() statement.

As I have no access to a qsub cluster I can't test whether this would work.

Regards,

Rob



On 03/04/2018 13:23, Graeme.Winter@Diamond.ac.uk wrote:
HI Rob

I think this is true … sometimes

It sets up the qsub every time, but does not always use it - at least it works on my MacBook with no qsub ;-)

That said, the question remains why exception reports are bad for parallel map… we *are* using preserve_exception_message…

Cheers Graeme


On 3 Apr 2018, at 13:20, Dr. Robert Oeffner <rdo20@cam.ac.uk> wrote:

Hi Graeme,

Just had a look at the code in dials/util/mp.py. It seems that you are using parallel_map() on a cluster using qsub. Unfortunately multi_core_run() is not designed for that. It only runs on a single multi core CPU PC.

Sorry,

Rob


On 03/04/2018 12:44, Graeme.Winter@Diamond.ac.uk wrote:
Thanks Rob, I could not dig out the thread (and the mail list thing does not have search that I could find)
I’ll talk to the crew about swapping this out for dials.* - though is possibly quite a big change?
Cheers Graeme
On 3 Apr 2018, at 12:26, Dr. Robert Oeffner <rdo20@cam.ac.uk<mailto:rdo20@cam.ac.uk>> wrote:
Hi Graeme,
I recall we've been here before,
http://phenix-online.org/pipermail/cctbxbb/2017-December/001807.html
I believe the solution is to use easy_mp.multi_core_run() instead of easy_mp.parallel_map(). The first function preserves stack traces of individual process, unlike easy_mp.parallel_map().
Regards,
Rob
On 03/04/2018 07:16, Graeme.Winter@Diamond.ac.uk<mailto:Graeme.Winter@Diamond.ac.uk> wrote:
Folks,
Following up on user reports again of errors within easy_mp - all that gets logged is “something went wrong” i.e.
  Using multiprocessing with 10 parallel job(s)
Traceback (most recent call last):
   File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py", line 613, in <module>
     halraiser(e)
   File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py", line 611, in <module>
     script.run()
   File "/home/user/bin/dials-installer/build/../modules/dials/command_line/integrate.py", line 341, in run
     reflections = integrator.integrate()
   File "/home/user/bin/dials-installer/modules/dials/algorithms/integration/integrator.py", line 1214, in integrate
     self.reflections, _, time_info = processor.process()
   File "/home/user/bin/dials-installer/modules/dials/algorithms/integration/processor.py", line 271, in process
     preserve_exception_message = True)
   File "/home/user/bin/dials-installer/modules/dials/util/mp.py", line 171, in multi_node_parallel_map
     preserve_exception_message = preserve_exception_message)
   File "/home/user/bin/dials-installer/modules/dials/util/mp.py", line 53, in parallel_map
     preserve_exception_message = preserve_exception_message)
   File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/easy_mp.py", line 627, in parallel_map
     result = res()
   File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/result.py", line 119, in __call__
     self.traceback( exception = self.exception() )
   File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 115, in __call__
     self.raise_handler( exception = exception )
   File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/mainthread.py", line 100, in poll
     value = target( *args, **kwargs )
   File "/home/user/bin/dials-installer/modules/dials/util/mp.py", line 91, in __call__
     preserve_exception_message = self.preserve_exception_message)
   File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/easy_mp.py", line 627, in parallel_map
     result = res()
   File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/result.py", line 119, in __call__
     self.traceback( exception = self.exception() )
   File "/home/user/bin/dials-installer/modules/cctbx_project/libtbx/scheduling/stacktrace.py", line 86, in __call__
     raise exception
RuntimeError: Please report this error to dials-support@lists.sourceforge.net<mailto:dials-support@lists.sourceforge.net>: exit code = -9
I forget why it was decided that keeping the proper stack trace was a bad thing, but could this be revisited? It would greatly help to see it in the output of the program (if as is the case here I do not have the user data)
My email-fu is not strong enough to dig out the previous conversation
Cheers Graeme
--
Robert Oeffner, Ph.D.
Research Associate, The Read Group
Department of Haematology,
Cambridge Institute for Medical Research
University of Cambridge
Cambridge Biomedical Campus
Wellcome Trust/MRC Building
Hills Road
Cambridge CB2 0XY
www.cimr.cam.ac.uk/investigators/read/index.html<http://www.cimr.cam.ac.uk/investigators/read/index.html>
tel: +44(0)1223 763234


--
Robert Oeffner, Ph.D.
Research Associate, The Read Group
Department of Haematology,
Cambridge Institute for Medical Research
University of Cambridge
Cambridge Biomedical Campus
Wellcome Trust/MRC Building
Hills Road
Cambridge CB2 0XY

www.cimr.cam.ac.uk/investigators/read/index.html
tel: +44(0)1223 763234




--
Robert Oeffner, Ph.D.
Research Associate, The Read Group
Department of Haematology,
Cambridge Institute for Medical Research
University of Cambridge
Cambridge Biomedical Campus
Wellcome Trust/MRC Building
Hills Road
Cambridge CB2 0XY

www.cimr.cam.ac.uk/investigators/read/index.html
tel: +44(0)1223 763234
_______________________________________________
cctbxbb mailing list
cctbxbb@phenix-online.org
http://phenix-online.org/mailman/listinfo/cctbxbb