Re: [cctbxbb] Unstable platform-dependent sort order for flex.sort_permutation
Hi Oleg,
The permutations would lead to the same sequence of miller indices, but in the context of an unmerged reflection array (i.e. miller.array.sort_permutation), then the data associated with those miller indices would not necessarily be in the same order. As the dataset is then split into two half-datasets, this difference in sort order leads to a different value of the calculate correlation coefficient between those two half datasets:
https://github.com/cctbx/cctbx_project/blob/master/cctbx/miller/__init__.py#...
Cheers,
Richard
Dr Richard Gildea
Data Analysis Scientist
Tel: +441235 77 8078
Diamond Light Source Ltd.
Diamond House
Harwell Science & Innovation Campus
Didcot
Oxfordshire
OX11 0DE
________________________________
From: [email protected] [[email protected]] on behalf of [email protected] [[email protected]]
Sent: 18 November 2016 15:59
To: cctbx mailing list
Subject: Re: [cctbxbb] Unstable platform-dependent sort order for flex.sort_permutation
Hi Richard,
it is a bug only if the permutations do not lead to the same sequence. Otherwise you cannot expect to get the same sorting permutations for collections with redundant data on different platforms or even between different versions of compilers.
Cheers,
Oleg.
________________________________
From: [email protected]
from scitbx.array_family import flex a = flex.size_t([7, 1, 1, 5, 1, 3, 3, 7, 1, 7, 3, 3, 5, 7, 5, 5, 5, 1, 1, 1]) print list(flex.sort_permutation(a)) [1, 2, 4, 8, 17, 18, 19, 5, 6, 10, 11, 3, 12, 14, 15, 16, 0, 7, 9, 13]
Linux:
from scitbx.array_family import flex a = flex.size_t([7, 1, 1, 5, 1, 3, 3, 7, 1, 7, 3, 3, 5, 7, 5, 5, 5, 1, 1, 1]) print list(flex.sort_permutation(a)) [19, 1, 2, 18, 4, 17, 8, 11, 10, 6, 5, 12, 14, 15, 16, 3, 9, 7, 13, 0]
Is it known/expected for the sort order to be platform dependent, or is this a bug? Here is the relevant code for flex.sort_permutation(): https://github.com/cctbx/cctbx_project/blob/master/scitbx/array_family/sort.... Cheers, Richard Dr Richard Gildea Data Analysis Scientist Tel: +441235 77 8078 Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
Richard & Oleg,
I'm not in office right now, but I think the concept you are looking for is
the "stable sort", which would sort items but give the fewest
rearrangements. I believe the C++ standard library exposes this function,
so it would be possible to create a corresponding
flex.stable_sort_permuation for our array family.
Nick
Nicholas K. Sauter, Ph. D.
Senior Scientist, Molecular Biophysics & Integrated Bioimaging Division
Lawrence Berkeley National Laboratory
1 Cyclotron Rd., Bldg. 33R0345
Berkeley, CA 94720
(510) 486-5713
On Fri, Nov 18, 2016 at 8:10 AM,
Hi Oleg,
The permutations would lead to the same sequence of miller indices, but in the context of an unmerged reflection array (i.e. miller.array.sort_permutation), then the data associated with those miller indices would not necessarily be in the same order. As the dataset is then split into two half-datasets, this difference in sort order leads to a different value of the calculate correlation coefficient between those two half datasets:
https://github.com/cctbx/cctbx_project/blob/master/ cctbx/miller/__init__.py#L4700
Cheers,
Richard
Dr Richard Gildea Data Analysis Scientist Tel: +441235 77 8078
Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE ________________________________ From: [email protected] [[email protected]] on behalf of [email protected] [[email protected]] Sent: 18 November 2016 15:59 To: cctbx mailing list Subject: Re: [cctbxbb] Unstable platform-dependent sort order for flex.sort_permutation
Hi Richard,
it is a bug only if the permutations do not lead to the same sequence. Otherwise you cannot expect to get the same sorting permutations for collections with redundant data on different platforms or even between different versions of compilers.
Cheers,
Oleg.
________________________________ From: [email protected]
on behalf of [email protected] Sent: 18 November 2016 14:12:29 To: [email protected] Subject: [cctbxbb] Unstable platform-dependent sort order for flex.sort_permutation Hi,
I've been trying to diagnose why the miller array cc_anom calculations seem to be platform dependent. I've narrowed this down to a variation in the sort order returned by flex.sort_permutation which is called from within miller.array.sort("packed_indices"). The following code demonstrates the problem and the different output I get on mac and Linux:
Mac:
from scitbx.array_family import flex a = flex.size_t([7, 1, 1, 5, 1, 3, 3, 7, 1, 7, 3, 3, 5, 7, 5, 5, 5, 1, 1, 1]) print list(flex.sort_permutation(a)) [1, 2, 4, 8, 17, 18, 19, 5, 6, 10, 11, 3, 12, 14, 15, 16, 0, 7, 9, 13]
Linux:
from scitbx.array_family import flex a = flex.size_t([7, 1, 1, 5, 1, 3, 3, 7, 1, 7, 3, 3, 5, 7, 5, 5, 5, 1, 1, 1]) print list(flex.sort_permutation(a)) [19, 1, 2, 18, 4, 17, 8, 11, 10, 6, 5, 12, 14, 15, 16, 3, 9, 7, 13, 0]
Is it known/expected for the sort order to be platform dependent, or is this a bug?
Here is the relevant code for flex.sort_permutation():
https://github.com/cctbx/cctbx_project/blob/master/ scitbx/array_family/sort.h
Cheers,
Richard
Dr Richard Gildea Data Analysis Scientist Tel: +441235 77 8078
Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE
-- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
Hi Richard,
I guess you need to extend it to also use the reflection intensity in the comparator if you really need reproducible sets :).
Cheers,
Oleg.
________________________________
From: [email protected]
from scitbx.array_family import flex a = flex.size_t([7, 1, 1, 5, 1, 3, 3, 7, 1, 7, 3, 3, 5, 7, 5, 5, 5, 1, 1, 1]) print list(flex.sort_permutation(a)) [1, 2, 4, 8, 17, 18, 19, 5, 6, 10, 11, 3, 12, 14, 15, 16, 0, 7, 9, 13]
Linux:
from scitbx.array_family import flex a = flex.size_t([7, 1, 1, 5, 1, 3, 3, 7, 1, 7, 3, 3, 5, 7, 5, 5, 5, 1, 1, 1]) print list(flex.sort_permutation(a)) [19, 1, 2, 18, 4, 17, 8, 11, 10, 6, 5, 12, 14, 15, 16, 3, 9, 7, 13, 0]
Is it known/expected for the sort order to be platform dependent, or is this a bug? Here is the relevant code for flex.sort_permutation(): https://github.com/cctbx/cctbx_project/blob/master/scitbx/array_family/sort.... Cheers, Richard Dr Richard Gildea Data Analysis Scientist Tel: +441235 77 8078 Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
The permutations would lead to the same sequence of miller indices, but in the context of an unmerged reflection array (i.e. miller.array.sort_permutation), then the data associated with those miller indices would not necessarily be in the same order. As the dataset is then split into two half-datasets, this difference in sort order leads to a different value of the calculate correlation coefficient between those two half datasets:
I guess you need to extend it to also use the reflection intensity in the comparator if you really need reproducible sets :).
Interesting question! The simplest solution is definitively that advocated by Nick, stable sort. But that is arbitrary as far as the intensities are concerned to keep the order of the indices, especially since that indices order is most surely quite arbitrary in the first place. Sorting first on the indices and then on the intensities: is that any less arbitrary to compute the correlation?
On 18 Nov 2016, at 17:10, [email protected] wrote:
a different value of the calculate correlation coefficient between those two half datasets:
What does split_unmerged do again?
Hi, Thanks all for the responses, I think the simplest solution would be to add an option stable=False to flex.sort_permutation that can optionally use std::stable_sort in place of std::sort. Luc: split_unmerged splits an unmerged dataset into two random half datasets, in order to calculate the correlation coefficient between the two half datasets: https://github.com/cctbx/cctbx_project/blob/master/cctbx/miller/merge_equiva... See also Karplus, P. A., & Diederichs, K. (2012). Linking crystallographic model and data quality. Science, 336(6084), 1030-1033: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3457925/ Cheers, Richard Dr Richard Gildea Data Analysis Scientist Tel: +441235 77 8078 Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE ________________________________ From: [email protected] [[email protected]] on behalf of Luc Bourhis [[email protected]] Sent: 18 November 2016 16:45 To: cctbx mailing list Subject: Re: [cctbxbb] Unstable platform-dependent sort order for flex.sort_permutation On 18 Nov 2016, at 17:10, [email protected]mailto:[email protected] wrote: a different value of the calculate correlation coefficient between those two half datasets: What does split_unmerged do again? -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
On 18 Nov 2016, at 17:55, [email protected] wrote:
Luc: split_unmerged splits an unmerged dataset into two random half datasets, in order to calculate the correlation coefficient between the two half datasets:
oh, so you mean that using the same random seed on MacOS and on Linux you did not get the same CC then?
Hi Luc, Yes, the random numbers generated by the mersenne twister were the same, however the input tmp_array was in a different sort order, which meant that the output half datasets were platform-dependent: https://github.com/cctbx/cctbx_project/blob/master/cctbx/miller/__init__.py#... I have just committed the necessary changes to add the parameter stable(=False) to flex.sort_permutation(). miller.array.sort_permutation sets stable=True when calling flex.sort_permutation(). This looks to have made the CC1/2 calculations platform-independent. Cheers, Richard Dr Richard Gildea Data Analysis Scientist Tel: +441235 77 8078 Diamond Light Source Ltd. Diamond House Harwell Science & Innovation Campus Didcot Oxfordshire OX11 0DE ________________________________ From: [email protected] [[email protected]] on behalf of Luc Bourhis [[email protected]] Sent: 18 November 2016 17:06 To: cctbx mailing list Subject: Re: [cctbxbb] Unstable platform-dependent sort order for flex.sort_permutation On 18 Nov 2016, at 17:55, [email protected]mailto:[email protected] wrote: Luc: split_unmerged splits an unmerged dataset into two random half datasets, in order to calculate the correlation coefficient between the two half datasets: oh, so you mean that using the same random seed on MacOS and on Linux you did not get the same CC then? -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
I have just committed the necessary changes to add the parameter stable(=False) to flex.sort_permutation(). miller.array.sort_permutation sets stable=True when calling flex.sort_permutation(). This looks to have made the CC1/2 calculations platform-independent.
Thank you, that’s generally useful an addition. Sorry for not getting the context of your question in the first place!
You might be interested in an alternative method of calculating CC(1/2) from variances, rather than from explicit half-sets, described tersely in this paper 1. Assmann G, Brehm W, Diederichs K. Identification of rogue datasets in serial crystallography. Journal of Applied Crystallography. 2016 Jun;49(3):1021–8.
On 18 Nov 2016, at 18:41, Luc Bourhis
wrote: I have just committed the necessary changes to add the parameter stable(=False) to flex.sort_permutation(). miller.array.sort_permutation sets stable=True when calling flex.sort_permutation(). This looks to have made the CC1/2 calculations platform-independent.
Thank you, that’s generally useful an addition. Sorry for not getting the context of your question in the first place!
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
participants (5)
-
Luc Bourhis
-
Nicholas Sauter
-
oleg@olexsys.org
-
Phil Evans
-
richard.gildea@diamond.ac.uk