Re: [cctbxbb] Question about sigma(I) calculation in merge_equivalents
Dear Keitaro, Phil, Luc, after looking into the question of how to merge equivalent reflections (in the context of XDS data processing), I am now convinced that cctbx should _not_ use max(internal sigma, external sigma) by default. Why? - it is not obvious to the "end user" of the data that come out of cctbx that this formula was applied (I am not aware that the above formula is clearly documented) - the data processing programs that I know of do not use this formula, but rather use the "external sigma" . I don't imply that the "external sigma" is better, but it violates the principle of least surprise to deviate from a proven procedure. - if the observations (that are merged in cctbx) come from one of these data processing programs then there is a chance that the error model was already adjusted such that the reduced chi**2 is near 1 (at least this is the case for XDS and SCALA). If the sigmas of the merged data then are adjusted upwards again (by the formula above), I have a strong feeling that this leads to an inconsistency. - using max() seems like an ad-hoc way and it lacks a clear rationale. I would like to see examples of comparison of downstream crystallographic calculations (most importantly, experimental phasing) using different ways of calculating the sigma of the merged data. Until this proves the superiority of the above formula, I believe it should be an option, not a default. thanks, Kay
Keitaro Yamashita yamashita at castor.sci.hokudai.ac.jp Thu Sep 6 09:20:34 PDT 2012
Previous message: [cctbxbb] Question about sigma(I) calculation in merge_equivalents Next message: [cctbxbb] Question about sigma(I) calculation in merge_equivalents Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear Luc,
Thank you for your explanation! I understand better than before.
Usually, sigmas given by data processing programs are already corrected based on their error model. Sigmas are adjusted to match the actual scatter. I think using internal variance is re-correction of sigmas. Is it valid way? If internal variance is bigger, it suggests error model is not perfect?
I calculated external/internal variances using lysozyme (standard sample in protein crystallography) data. Intensities and sigmas are determined by XDS.
I attached two plots, where "Imean", "wsigma", "sigma" are averaged intensity, internal sigma, external sigma, respectively. One plot is histogram of wsigma/sigma by multiplicity. For lower multiplicity, we can see extreme discrepancies. (Note that each vertical axis is not on the same scale.) The other plot is wsigma/sigma vs intensity. Extreme discrepancies can be seen in lower intensities. I hope it could be interesting for you.
Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small.
Then, I think it would be nice if we can choose the way in cctbx. I mean, option to choose "use bigger one" or "always use external/internal variance" would be nice to have.
I am looking forward to your comment.
Best regards, Keitaro
Dear Keitaro,
But it is still unclear to me why it takes the greatest of the "internal" variance and "external" variance. Is it based on some tests using real data? or is it theoretically superior to using always external variance?
Those are good questions and to be honest I do not know for sure the answer to them. As it seems common in applied statistics, the treatment starts with by-the-book methods relying on a well defined theory but at
2012/9/6 Luc Bourhis
: the end there is always a completely heuristic twist. Particularly true in crystallography I would argue. But let me try to give some rationales. It seems to me that the internal and external variance should not
differ too much. Let's consider the two ways this may not be true.
1. The quoted intensities of a group of equivalent reflections have a
small spread, leading to a small internal variance, but the quoted sigma's are comparatively big, resulting in an external variance significantly bigger than the internal one. This is a possible event but an unlikely one: the statistical intuition in that case is to say that the small internal variance is a fluke and to use the external one instead.
2. An external variance significantly smaller than the internal one,
should ring an alarm bell. Indeed a small external variance means that the small quoted sigma's strongly suggests the intensities cannot spread too much from their assumed common true value whereas the comparatively bigger internal variance blatantly contradicts that. Thus either the intensities or the sigma's have not been correctly determined. Crystallographers seem to err on the side of trusting data here, i.e. to disregard the sigma's, and therefore to choose the internal variance.
I would like to know how this method affects further
crystallographic process.
I am afraid I do not have experience with your domain, protein
crystallography. I know that the small molecule program Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small. Since Crystals is as well accepted as ShelXL to produce publishable structures, it answer your question in at least in one corner of crystallography, unfortunately not yours.
I think it could be a simple and interesting exercise to take a
representative protein dataset of yours, then to print the redundancy, the internal, and the external variance. Actually I would be surprised if such a study has not already been done and published. Perhaps some of the gurus on this forum can shed more lights onto that subject.
Best wishes,
Luc
-- Kay Diederichs http://strucbio.biologie.uni-konstanz.de email: [email protected] Tel +49 7531 88 4049 Fax 3183 Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz This e-mail is digitally signed. If your e-mail client does not have the necessary capabilities, just ignore the attached signature "smime.p7s".
Following up on this: since the use of the "internal sigma" is
controversial, I've made it possible to disable this and simply use
the "external sigma" instead. This is now what will be done for
calculating merging statistics (in
iotbx/command_line/merging_statistics.py, and higher-level code in
Phenix), for the sake of consistency with other software. Since I
have no idea which method is actually better or how to test the effect
of such a change on downstream programs, I'm leaving the previous
behavior as default for now.
-Nat
On Mon, Sep 10, 2012 at 5:37 AM, Kay Diederichs
Dear Keitaro, Phil, Luc,
after looking into the question of how to merge equivalent reflections (in the context of XDS data processing), I am now convinced that cctbx should _not_ use max(internal sigma, external sigma) by default. Why? - it is not obvious to the "end user" of the data that come out of cctbx that this formula was applied (I am not aware that the above formula is clearly documented) - the data processing programs that I know of do not use this formula, but rather use the "external sigma" . I don't imply that the "external sigma" is better, but it violates the principle of least surprise to deviate from a proven procedure. - if the observations (that are merged in cctbx) come from one of these data processing programs then there is a chance that the error model was already adjusted such that the reduced chi**2 is near 1 (at least this is the case for XDS and SCALA). If the sigmas of the merged data then are adjusted upwards again (by the formula above), I have a strong feeling that this leads to an inconsistency. - using max() seems like an ad-hoc way and it lacks a clear rationale.
I would like to see examples of comparison of downstream crystallographic calculations (most importantly, experimental phasing) using different ways of calculating the sigma of the merged data. Until this proves the superiority of the above formula, I believe it should be an option, not a default.
thanks,
Kay
Keitaro Yamashita yamashita at castor.sci.hokudai.ac.jp Thu Sep 6 09:20:34 PDT 2012
Previous message: [cctbxbb] Question about sigma(I) calculation in merge_equivalents Next message: [cctbxbb] Question about sigma(I) calculation in merge_equivalents Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Dear Luc,
Thank you for your explanation! I understand better than before.
Usually, sigmas given by data processing programs are already corrected based on their error model. Sigmas are adjusted to match the actual scatter. I think using internal variance is re-correction of sigmas. Is it valid way? If internal variance is bigger, it suggests error model is not perfect?
I calculated external/internal variances using lysozyme (standard sample in protein crystallography) data. Intensities and sigmas are determined by XDS.
I attached two plots, where "Imean", "wsigma", "sigma" are averaged intensity, internal sigma, external sigma, respectively. One plot is histogram of wsigma/sigma by multiplicity. For lower multiplicity, we can see extreme discrepancies. (Note that each vertical axis is not on the same scale.) The other plot is wsigma/sigma vs intensity. Extreme discrepancies can be seen in lower intensities. I hope it could be interesting for you.
Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small.
Then, I think it would be nice if we can choose the way in cctbx. I mean, option to choose "use bigger one" or "always use external/internal variance" would be nice to have.
I am looking forward to your comment.
Best regards, Keitaro
2012/9/6 Luc Bourhis
: Dear Keitaro,
But it is still unclear to me why it takes the greatest of the "internal" variance and "external" variance. Is it based on some tests using real data? or is it theoretically superior to using always external variance?
Those are good questions and to be honest I do not know for sure the answer to them. As it seems common in applied statistics, the treatment starts with by-the-book methods relying on a well defined theory but at the end there is always a completely heuristic twist. Particularly true in crystallography I would argue. But let me try to give some rationales.
It seems to me that the internal and external variance should not differ too much. Let's consider the two ways this may not be true.
1. The quoted intensities of a group of equivalent reflections have a small spread, leading to a small internal variance, but the quoted sigma's are comparatively big, resulting in an external variance significantly bigger than the internal one. This is a possible event but an unlikely one: the statistical intuition in that case is to say that the small internal variance is a fluke and to use the external one instead.
2. An external variance significantly smaller than the internal one, should ring an alarm bell. Indeed a small external variance means that the small quoted sigma's strongly suggests the intensities cannot spread too much from their assumed common true value whereas the comparatively bigger internal variance blatantly contradicts that. Thus either the intensities or the sigma's have not been correctly determined. Crystallographers seem to err on the side of trusting data here, i.e. to disregard the sigma's, and therefore to choose the internal variance.
I would like to know how this method affects further crystallographic process.
I am afraid I do not have experience with your domain, protein crystallography. I know that the small molecule program Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small. Since Crystals is as well accepted as ShelXL to produce publishable structures, it answer your question in at least in one corner of crystallography, unfortunately not yours.
I think it could be a simple and interesting exercise to take a representative protein dataset of yours, then to print the redundancy, the internal, and the external variance. Actually I would be surprised if such a study has not already been done and published. Perhaps some of the gurus on this forum can shed more lights onto that subject.
Best wishes,
Luc
-- Kay Diederichs http://strucbio.biologie.uni-konstanz.de email: [email protected] Tel +49 7531 88 4049 Fax 3183 Fachbereich Biologie, Universität Konstanz, Box M647, D-78457 Konstanz
This e-mail is digitally signed. If your e-mail client does not have the necessary capabilities, just ignore the attached signature "smime.p7s".
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
participants (2)
-
Kay Diederichs
-
Nathaniel Echols