Re: [cctbxbb] [Cctbx-cvs] SF.net SVN: cctbx:[25333] trunk/libtbx/env_config.py
Hi, I just spent some time tracking software crashes to this change. Is setting the default to en_US really appropriate and what we want? In particular it affects the output of downstream, external software we run from within python. What is the unicode issue you hint at in the commit message? -Markus Dr Markus Gerstel MBCS Postdoctoral Research Associate Tel: +44 1235 778698 Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE -----Original Message----- From: [email protected] [mailto:[email protected]] Sent: 07 September 2016 00:54 To: [email protected] Subject: [Cctbx-cvs] SF.net SVN: cctbx:[25333] trunk/libtbx/env_config.py Revision: 25333 http://sourceforge.net/p/cctbx/code/25333 Author: bkpoon Date: 2016-09-06 23:54:29 +0000 (Tue, 06 Sep 2016) Log Message: ----------- Unicode support: set LC_ALL in dispatchers to the one in the user's environment (if available, and supports UTF-8), otherwise use the default setting of en_US.UTF-8; fixes unicode issue with python in Linux (e.g. os.path functions do not work correctly with unicode if LC_ALL=C Modified Paths: -------------- trunk/libtbx/env_config.py Modified: trunk/libtbx/env_config.py =================================================================== --- trunk/libtbx/env_config.py 2016-09-06 21:15:34 UTC (rev 25332) +++ trunk/libtbx/env_config.py 2016-09-06 23:54:29 UTC (rev 25333) @@ -945,6 +945,15 @@ def write_bin_sh_dispatcher(self, source_file, target_file, source_is_python_exe=False): + + # determine LC_ALL from environment (Python UTF-8 compatibility in Linux) + LC_ALL = os.environ.get('LC_ALL') # user setting + if (LC_ALL is not None): + if ( ('UTF-8' not in LC_ALL) and ('utf8' not in LC_ALL) ): + LC_ALL = None + if (LC_ALL is None): + LC_ALL = 'en_US.UTF-8' # default + f = target_file.open("w") if (source_file is not None): print >> f, '#! /bin/sh' @@ -975,7 +984,7 @@ print >> f, '#' print >> f, _SHELLREALPATH_CODE print >> f, 'unset PYTHONHOME' - print >> f, 'LC_ALL=C' + print >> f, 'LC_ALL=' + LC_ALL print >> f, 'export LC_ALL' print >> f, 'LIBTBX_BUILD="$(shellrealpath "$0" && cd "$(dirname "$RESULT")/.." && pwd)"' print >> f, 'export LIBTBX_BUILD' This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. ------------------------------------------------------------------------------ _______________________________________________ Cctbx-cvs mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/cctbx-cvs -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
Hi Markus,
There is an issue with non-ASCII paths (unicode type) and basic Python
functions if the locale (like 'C') does not support UTF-8. Without UTF-8
support, these functions try to convert the unicode type into a str type
with the 'ascii' encoding, which triggers a UnicodeEncodeError. I attached
a script that tests it. The unicode path should fail for libtbx.python
before my change and pass for after my change. Or change the LC_ALL setting
in the build/bin/libtbx.python dispatcher (if the en_US locale is
available, en_US will fail, en_US.UTF-8 will work).
An additional wrinkle is that LC_ALL=C works fine on my mac (OS X 10.10.5).
Also, there is a "C.UTF-8" locale on Ubuntu, but not on CentOS.
Basically, to support non-ASCII paths (unicode type) in basic Python
functions, any locale with UTF-8 or utf8 will work. The en_US part is not
that important.
What are the errors that you get? I ran the regression tests for dials
(libtbx.run_tests_parallel module=dials) and dials_regression
(module=dials_regression) and everything passes except for one test in
dials_regression (dials_regression/test.py). But the error seems to be
about a goniometer object. Do you have the en_US locale installed?
Right now, I'm just checking if LC_ALL is set in the user environment and
using that if it has the extra UTF-8 part. I can also check the LANG
environment variable. That might be work better for users that do not have
the en_US locale installed.
--
Billy K. Poon
Research Scientist, Molecular Biophysics and Integrated Bioimaging
Lawrence Berkeley National Laboratory
1 Cyclotron Road, M/S 33R0345
Berkeley, CA 94720
Tel: (510) 486-5709
Fax: (510) 486-5909
Web: https://phenix-online.org
On Thu, Sep 8, 2016 at 2:26 AM,
Hi,
I just spent some time tracking software crashes to this change. Is setting the default to en_US really appropriate and what we want? In particular it affects the output of downstream, external software we run from within python.
What is the unicode issue you hint at in the commit message?
-Markus
Dr Markus Gerstel MBCS Postdoctoral Research Associate Tel: +44 1235 778698
Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE
-----Original Message----- From: [email protected] [mailto:[email protected]] Sent: 07 September 2016 00:54 To: [email protected] Subject: [Cctbx-cvs] SF.net SVN: cctbx:[25333] trunk/libtbx/env_config.py
Revision: 25333 http://sourceforge.net/p/cctbx/code/25333 Author: bkpoon Date: 2016-09-06 23:54:29 +0000 (Tue, 06 Sep 2016) Log Message: ----------- Unicode support: set LC_ALL in dispatchers to the one in the user's environment (if available, and supports UTF-8), otherwise use the default setting of en_US.UTF-8; fixes unicode issue with python in Linux (e.g. os.path functions do not work correctly with unicode if LC_ALL=C
Modified Paths: -------------- trunk/libtbx/env_config.py
Modified: trunk/libtbx/env_config.py =================================================================== --- trunk/libtbx/env_config.py 2016-09-06 21:15:34 UTC (rev 25332) +++ trunk/libtbx/env_config.py 2016-09-06 23:54:29 UTC (rev 25333) @@ -945,6 +945,15 @@
def write_bin_sh_dispatcher(self, source_file, target_file, source_is_python_exe=False): + + # determine LC_ALL from environment (Python UTF-8 compatibility in Linux) + LC_ALL = os.environ.get('LC_ALL') # user setting + if (LC_ALL is not None): + if ( ('UTF-8' not in LC_ALL) and ('utf8' not in LC_ALL) ): + LC_ALL = None + if (LC_ALL is None): + LC_ALL = 'en_US.UTF-8' # default + f = target_file.open("w") if (source_file is not None): print >> f, '#! /bin/sh' @@ -975,7 +984,7 @@ print >> f, '#' print >> f, _SHELLREALPATH_CODE print >> f, 'unset PYTHONHOME' - print >> f, 'LC_ALL=C' + print >> f, 'LC_ALL=' + LC_ALL print >> f, 'export LC_ALL' print >> f, 'LIBTBX_BUILD="$(shellrealpath "$0" && cd "$(dirname "$RESULT")/.." && pwd)"' print >> f, 'export LIBTBX_BUILD'
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
------------------------------------------------------------ ------------------ _______________________________________________ Cctbx-cvs mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/cctbx-cvs
-- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
Hi Billy,
Thanks for your mail.
As usual, it’s $insertYear and unicode is still not a solved problem :(
I ran into UnicodeEncode/DecodeErrors, but I am now happy that your change only exposed underlying issues in my code (outside of the cctbx/dials/xia2 repositories). I have sprinkled some forced UTF-8 encoding on top, and everything appears to be working fine now.
As to the changed output, that for example includes default wget output where it puts the file it writes to disk in ``quotes’’, and they observe the LC_ALL encoding. Fortunately enough we don’t really care about fancy formatting, so this is not a real problem.
-Markus
Dr Markus Gerstel MBCS
Postdoctoral Research Associate
Tel: +44 1235 778698
Diamond Light Source Ltd.
Diamond House
Harwell Science & Innovation Campus
Didcot
Oxfordshire
OX11 0DE
From: [email protected] [mailto:[email protected]] On Behalf Of Billy Poon
Sent: 08 September 2016 19:45
To: cctbx mailing list
Cc: [email protected]
Subject: Re: [cctbxbb] [Cctbx-cvs] SF.net SVN: cctbx:[25333] trunk/libtbx/env_config.py
Hi Markus,
There is an issue with non-ASCII paths (unicode type) and basic Python functions if the locale (like 'C') does not support UTF-8. Without UTF-8 support, these functions try to convert the unicode type into a str type with the 'ascii' encoding, which triggers a UnicodeEncodeError. I attached a script that tests it. The unicode path should fail for libtbx.python before my change and pass for after my change. Or change the LC_ALL setting in the build/bin/libtbx.python dispatcher (if the en_US locale is available, en_US will fail, en_US.UTF-8 will work).
An additional wrinkle is that LC_ALL=C works fine on my mac (OS X 10.10.5). Also, there is a "C.UTF-8" locale on Ubuntu, but not on CentOS.
Basically, to support non-ASCII paths (unicode type) in basic Python functions, any locale with UTF-8 or utf8 will work. The en_US part is not that important.
What are the errors that you get? I ran the regression tests for dials (libtbx.run_tests_parallel module=dials) and dials_regression (module=dials_regression) and everything passes except for one test in dials_regression (dials_regression/test.py). But the error seems to be about a goniometer object. Do you have the en_US locale installed?
Right now, I'm just checking if LC_ALL is set in the user environment and using that if it has the extra UTF-8 part. I can also check the LANG environment variable. That might be work better for users that do not have the en_US locale installed.
--
Billy K. Poon
Research Scientist, Molecular Biophysics and Integrated Bioimaging
Lawrence Berkeley National Laboratory
1 Cyclotron Road, M/S 33R0345
Berkeley, CA 94720
Tel: (510) 486-5709
Fax: (510) 486-5909
Web: https://phenix-online.org
On Thu, Sep 8, 2016 at 2:26 AM,
Hi Markus,
Great!
Just to let you know of some additional quirks Rob and I found about
unicode. Windows filesystems do not seem to like UTF-8, so you should use
the to_str and to_unicode functions in libtbx/utils.py if you want to
handle non-ASCII filenames on Windows. They default to 'mbcs' for the
encoding codec on Windows.
--
Billy K. Poon
Research Scientist, Molecular Biophysics and Integrated Bioimaging
Lawrence Berkeley National Laboratory
1 Cyclotron Road, M/S 33R0345
Berkeley, CA 94720
Tel: (510) 486-5709
Fax: (510) 486-5909
Web: https://phenix-online.org
On Fri, Sep 9, 2016 at 1:59 AM,
Hi Billy,
Thanks for your mail.
As usual, it’s $insertYear and unicode is still not a solved problem :(
I ran into UnicodeEncode/DecodeErrors, but I am now happy that your change only exposed underlying issues in my code (outside of the cctbx/dials/xia2 repositories). I have sprinkled some forced UTF-8 encoding on top, and everything appears to be working fine now.
As to the changed output, that for example includes default wget output where it puts the file it writes to disk in ``quotes’’, and they observe the LC_ALL encoding. Fortunately enough we don’t really care about fancy formatting, so this is not a real problem.
-Markus
Dr Markus Gerstel MBCS
Postdoctoral Research Associate
Tel: +44 1235 778698
Diamond Light Source Ltd.
Diamond House
Harwell Science & Innovation Campus
Didcot
Oxfordshire
OX11 0DE
*From:* [email protected] [mailto:cctbxbb-bounces@ phenix-online.org] *On Behalf Of *Billy Poon *Sent:* 08 September 2016 19:45 *To:* cctbx mailing list *Cc:* [email protected] *Subject:* Re: [cctbxbb] [Cctbx-cvs] SF.net SVN: cctbx:[25333] trunk/libtbx/env_config.py
Hi Markus,
There is an issue with non-ASCII paths (unicode type) and basic Python functions if the locale (like 'C') does not support UTF-8. Without UTF-8 support, these functions try to convert the unicode type into a str type with the 'ascii' encoding, which triggers a UnicodeEncodeError. I attached a script that tests it. The unicode path should fail for libtbx.python before my change and pass for after my change. Or change the LC_ALL setting in the build/bin/libtbx.python dispatcher (if the en_US locale is available, en_US will fail, en_US.UTF-8 will work).
An additional wrinkle is that LC_ALL=C works fine on my mac (OS X 10.10.5). Also, there is a "C.UTF-8" locale on Ubuntu, but not on CentOS.
Basically, to support non-ASCII paths (unicode type) in basic Python functions, any locale with UTF-8 or utf8 will work. The en_US part is not that important.
What are the errors that you get? I ran the regression tests for dials (libtbx.run_tests_parallel module=dials) and dials_regression (module=dials_regression) and everything passes except for one test in dials_regression (dials_regression/test.py). But the error seems to be about a goniometer object. Do you have the en_US locale installed?
Right now, I'm just checking if LC_ALL is set in the user environment and using that if it has the extra UTF-8 part. I can also check the LANG environment variable. That might be work better for users that do not have the en_US locale installed.
--
Billy K. Poon
Research Scientist, Molecular Biophysics and Integrated Bioimaging
Lawrence Berkeley National Laboratory
1 Cyclotron Road, M/S 33R0345
Berkeley, CA 94720
Tel: (510) 486-5709
Fax: (510) 486-5909
Web: https://phenix-online.org
On Thu, Sep 8, 2016 at 2:26 AM,
wrote: Hi,
I just spent some time tracking software crashes to this change. Is setting the default to en_US really appropriate and what we want? In particular it affects the output of downstream, external software we run from within python.
What is the unicode issue you hint at in the commit message?
-Markus
Dr Markus Gerstel MBCS Postdoctoral Research Associate Tel: +44 1235 778698
Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE
-----Original Message----- From: [email protected] [mailto:[email protected]] Sent: 07 September 2016 00:54 To: [email protected] Subject: [Cctbx-cvs] SF.net SVN: cctbx:[25333] trunk/libtbx/env_config.py
Revision: 25333 http://sourceforge.net/p/cctbx/code/25333 Author: bkpoon Date: 2016-09-06 23:54:29 +0000 (Tue, 06 Sep 2016) Log Message: ----------- Unicode support: set LC_ALL in dispatchers to the one in the user's environment (if available, and supports UTF-8), otherwise use the default setting of en_US.UTF-8; fixes unicode issue with python in Linux (e.g. os.path functions do not work correctly with unicode if LC_ALL=C
Modified Paths: -------------- trunk/libtbx/env_config.py
Modified: trunk/libtbx/env_config.py =================================================================== --- trunk/libtbx/env_config.py 2016-09-06 21:15:34 UTC (rev 25332) +++ trunk/libtbx/env_config.py 2016-09-06 23:54:29 UTC (rev 25333) @@ -945,6 +945,15 @@
def write_bin_sh_dispatcher(self, source_file, target_file, source_is_python_exe=False): + + # determine LC_ALL from environment (Python UTF-8 compatibility in Linux) + LC_ALL = os.environ.get('LC_ALL') # user setting + if (LC_ALL is not None): + if ( ('UTF-8' not in LC_ALL) and ('utf8' not in LC_ALL) ): + LC_ALL = None + if (LC_ALL is None): + LC_ALL = 'en_US.UTF-8' # default + f = target_file.open("w") if (source_file is not None): print >> f, '#! /bin/sh' @@ -975,7 +984,7 @@ print >> f, '#' print >> f, _SHELLREALPATH_CODE print >> f, 'unset PYTHONHOME' - print >> f, 'LC_ALL=C' + print >> f, 'LC_ALL=' + LC_ALL print >> f, 'export LC_ALL' print >> f, 'LIBTBX_BUILD="$(shellrealpath "$0" && cd "$(dirname "$RESULT")/.." && pwd)"' print >> f, 'export LIBTBX_BUILD'
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
------------------------------------------------------------ ------------------ _______________________________________________ Cctbx-cvs mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/cctbx-cvs
-- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
participants (2)
-
Billy Poon
-
markus.gerstel@diamond.ac.uk