Question about sigma(I) calculation in merge_equivalents

Keitaro Yamashita

5 Sep 2012 5 Sep '12

1:01 p.m.

Dear CCTBX developers, I have a question about the method in CCTBX to calculate sigma(averaged I) when merging. In cctbx/miller/merge_equivalents.h (line 594 in phenix-dev-1148), sigma(averaged I)^2 will be calculated by max(mv.gsl_stats_wvariance()/values.size(), 1/mv.sum_weights()). It takes maximum value of the two options. I know the latter equation - this is just sigma(averaged I)^2 = 1/sum(w), where w = 1/sigma(I)^2. And I believe this equation is widely used in other crystallographic programs e.g. XDS. The former is, according to the comment in the code, the emulation of gsl_stats_wvariance, and we can find the information from gsl website: http://www.gnu.org/software/gsl/manual/html_node/Weighted-Samples.html I don't know how this "wvariance" is derived. Is it a better estimate of sigma(averaged I)? Why does it take the maximum from the two options? I would like to know the theory of this sigma calculation in CCTBX. Best regards, Keitaro

Show replies by date

Luc Bourhis

5 Sep 5 Sep

3:41 p.m.

New subject: Question about sigma(I) calculation in merge_equivalents

Dear Keitaro,

...

I would like to know the theory of this sigma calculation in CCTBX.

That theory is written up in a file you can find in the source distribution of the CCTBX, cctbx/miller/equivalent_reflection_merging.tex. For your convenience I have also attached the resulting PDF. Best wishes, Luc J. Bourhis

Keitaro Yamashita

6 Sep 6 Sep

9:12 a.m.

New subject: Question about sigma(I) calculation in merge_equivalents

Dear Luc, Thank you for your kindness. I didn't know tex documents are included in cctbx! I'm trying to understand it. But it is still unclear to me why it takes the greatest of the "internal" variance and "external" variance. Is it based on some tests using real data? or is it theoretically superior to using always external variance? I would like to know how this method affects further crystallographic process. Keitaro 2012/9/5 Luc Bourhis :

...

Dear Keitaro,

...
I would like to know the theory of this sigma calculation in CCTBX.

That theory is written up in a file you can find in the source distribution of the CCTBX, cctbx/miller/equivalent_reflection_merging.tex. For your convenience I have also attached the resulting PDF.

Best wishes,

Luc J. Bourhis

_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb

Luc Bourhis

11:27 a.m.

New subject: Question about sigma(I) calculation in merge_equivalents

Dear Keitaro,

...

But it is still unclear to me why it takes the greatest of the "internal" variance and "external" variance. Is it based on some tests using real data? or is it theoretically superior to using always external variance?

Those are good questions and to be honest I do not know for sure the answer to them. As it seems common in applied statistics, the treatment starts with by-the-book methods relying on a well defined theory but at the end there is always a completely heuristic twist. Particularly true in crystallography I would argue. But let me try to give some rationales. It seems to me that the internal and external variance should not differ too much. Let's consider the two ways this may not be true. 1. The quoted intensities of a group of equivalent reflections have a small spread, leading to a small internal variance, but the quoted sigma's are comparatively big, resulting in an external variance significantly bigger than the internal one. This is a possible event but an unlikely one: the statistical intuition in that case is to say that the small internal variance is a fluke and to use the external one instead. 2. An external variance significantly smaller than the internal one, should ring an alarm bell. Indeed a small external variance means that the small quoted sigma's strongly suggests the intensities cannot spread too much from their assumed common true value whereas the comparatively bigger internal variance blatantly contradicts that. Thus either the intensities or the sigma's have not been correctly determined. Crystallographers seem to err on the side of trusting data here, i.e. to disregard the sigma's, and therefore to choose the internal variance.

...

I would like to know how this method affects further crystallographic process.

I am afraid I do not have experience with your domain, protein crystallography. I know that the small molecule program Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small. Since Crystals is as well accepted as ShelXL to produce publishable structures, it answer your question in at least in one corner of crystallography, unfortunately not yours. I think it could be a simple and interesting exercise to take a representative protein dataset of yours, then to print the redundancy, the internal, and the external variance. Actually I would be surprised if such a study has not already been done and published. Perhaps some of the gurus on this forum can shed more lights onto that subject. Best wishes, Luc

Keitaro Yamashita

4:20 p.m.

New subject: Question about sigma(I) calculation in merge_equivalents

Dear Luc, Thank you for your explanation! I understand better than before. Usually, sigmas given by data processing programs are already corrected based on their error model. Sigmas are adjusted to match the actual scatter. I think using internal variance is re-correction of sigmas. Is it valid way? If internal variance is bigger, it suggests error model is not perfect? I calculated external/internal variances using lysozyme (standard sample in protein crystallography) data. Intensities and sigmas are determined by XDS. I attached two plots, where "Imean", "wsigma", "sigma" are averaged intensity, internal sigma, external sigma, respectively. One plot is histogram of wsigma/sigma by multiplicity. For lower multiplicity, we can see extreme discrepancies. (Note that each vertical axis is not on the same scale.) The other plot is wsigma/sigma vs intensity. Extreme discrepancies can be seen in lower intensities. I hope it could be interesting for you.

...

Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small.

Then, I think it would be nice if we can choose the way in cctbx. I mean, option to choose "use bigger one" or "always use external/internal variance" would be nice to have. I am looking forward to your comment. Best regards, Keitaro 2012/9/6 Luc Bourhis :

...

Dear Keitaro,

...
But it is still unclear to me why it takes the greatest of the "internal" variance and "external" variance. Is it based on some tests using real data? or is it theoretically superior to using always external variance?

Those are good questions and to be honest I do not know for sure the answer to them. As it seems common in applied statistics, the treatment starts with by-the-book methods relying on a well defined theory but at the end there is always a completely heuristic twist. Particularly true in crystallography I would argue. But let me try to give some rationales.

It seems to me that the internal and external variance should not differ too much. Let's consider the two ways this may not be true.

1. The quoted intensities of a group of equivalent reflections have a small spread, leading to a small internal variance, but the quoted sigma's are comparatively big, resulting in an external variance significantly bigger than the internal one. This is a possible event but an unlikely one: the statistical intuition in that case is to say that the small internal variance is a fluke and to use the external one instead.

2. An external variance significantly smaller than the internal one, should ring an alarm bell. Indeed a small external variance means that the small quoted sigma's strongly suggests the intensities cannot spread too much from their assumed common true value whereas the comparatively bigger internal variance blatantly contradicts that. Thus either the intensities or the sigma's have not been correctly determined. Crystallographers seem to err on the side of trusting data here, i.e. to disregard the sigma's, and therefore to choose the internal variance.

...
I would like to know how this method affects further crystallographic process.

I am afraid I do not have experience with your domain, protein crystallography. I know that the small molecule program Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small. Since Crystals is as well accepted as ShelXL to produce publishable structures, it answer your question in at least in one corner of crystallography, unfortunately not yours.

I think it could be a simple and interesting exercise to take a representative protein dataset of yours, then to print the redundancy, the internal, and the external variance. Actually I would be surprised if such a study has not already been done and published. Perhaps some of the gurus on this forum can shed more lights onto that subject.

Best wishes,

Luc

_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb

Phil Evans

7 Sep 7 Sep

3:53 p.m.

New subject: Question about sigma(I) calculation in merge_equivalents

I think this is a hard question and I don't know the answer. Here a few of my thoughts Phil On 6 Sep 2012, at 17:20, Keitaro Yamashita wrote:

...

Dear Luc,

Thank you for your explanation! I understand better than before.

Usually, sigmas given by data processing programs are already corrected based on their error model. Sigmas are adjusted to match the actual scatter. I think using internal variance is re-correction of sigmas. Is it valid way? If internal variance is bigger, it suggests error model is not perfect?

I calculated external/internal variances using lysozyme (standard sample in protein crystallography) data. Intensities and sigmas are determined by XDS.

I attached two plots, where "Imean", "wsigma", "sigma" are averaged intensity, internal sigma, external sigma, respectively. One plot is histogram of wsigma/sigma by multiplicity. For lower multiplicity, we can see extreme discrepancies. (Note that each vertical axis is not on the same scale.) The other plot is wsigma/sigma vs intensity. Extreme discrepancies can be seen in lower intensities. I hope it could be interesting for you.

...
Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small.

Then, I think it would be nice if we can choose the way in cctbx. I mean, option to choose "use bigger one" or "always use external/internal variance" would be nice to have.

I am looking forward to your comment.

Best regards, Keitaro

2012/9/6 Luc Bourhis :

...
Dear Keitaro,

...
But it is still unclear to me why it takes the greatest of the "internal" variance and "external" variance. Is it based on some tests using real data? or is it theoretically superior to using always external variance?

Those are good questions and to be honest I do not know for sure the answer to them. As it seems common in applied statistics, the treatment starts with by-the-book methods relying on a well defined theory but at the end there is always a completely heuristic twist. Particularly true in crystallography I would argue. But let me try to give some rationales.

It seems to me that the internal and external variance should not differ too much. Let's consider the two ways this may not be true.

1. The quoted intensities of a group of equivalent reflections have a small spread, leading to a small internal variance, but the quoted sigma's are comparatively big, resulting in an external variance significantly bigger than the internal one. This is a possible event but an unlikely one: the statistical intuition in that case is to say that the small internal variance is a fluke and to use the external one instead.

2. An external variance significantly smaller than the internal one, should ring an alarm bell. Indeed a small external variance means that the small quoted sigma's strongly suggests the intensities cannot spread too much from their assumed common true value whereas the comparatively bigger internal variance blatantly contradicts that. Thus either the intensities or the sigma's have not been correctly determined. Crystallographers seem to err on the side of trusting data here, i.e. to disregard the sigma's, and therefore to choose the internal variance.

...
I would like to know how this method affects further crystallographic process.

I am afraid I do not have experience with your domain, protein crystallography. I know that the small molecule program Crystals use only the external variance by default, the reasoning being that the internal variance being based on sample statistics is almost always too unreliable because groups of equivalent reflections are too small. Since Crystals is as well accepted as ShelXL to produce publishable structures, it answer your question in at least in one corner of crystallography, unfortunately not yours.

I think it could be a simple and interesting exercise to take a representative protein dataset of yours, then to print the redundancy, the internal, and the external variance. Actually I would be surprised if such a study has not already been done and published. Perhaps some of the gurus on this forum can shed more lights onto that subject.

Best wishes,

Luc

_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb

4614

Age (days ago)

4616

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Keitaro Yamashita
Luc Bourhis
Phil Evans