Model based outlier calculation in mmtbx.scaling.outlier_rejection
Dear cctbx developers, I am interested in the implementation of model-based reflection outlier rejection. As I read the code mmtbx/scaling/outlier_rejection.py (lines 244-351), I noticed that maybe there was a discrepancy between what log_message explained and the actual code. The log_message in the code says:
Outliers are rejected on the basis of the assumption that a scaled log likelihood differnce 2(log[P(Fobs)]-log[P(Fmode)])/Q\" is distributed according to a Chi-square distribution (Q\" is equal to the second derivative of the log likelihood function of the mode of the distribution). The outlier threshold of the p-value relates to the p-value of the extreme value distribution of the chi-square distribution.
while actual p_value is calculated for each hkl as p_value = 1 - erf(sqrt(LLG))**N, where LLG = log p(F=Fbar | Fmodel) - log p(F=Fobs | Fmodel), and N is the number of reflections. Here, Fbar is F which gives the maximum value of p(F | Fmodel). At least, Q (the second derivative of p(F=Fbar | Fmodel)) is not used in the actual calculation. Could someone please explain the meaning of the actual calculation? Why taking square-root and raising erf() result to the power of N? Thank you very much, Keitaro
Hi Keitaro, Peter Zwart wrote the code so I hope he comments on this more. My understanding that this is essentially based on Randy Read's paper: Read, R. J. (1999). Acta Cryst. D55, 1759-1764. Pavel On 9/5/13 2:31 AM, Keitaro Yamashita wrote:
Dear cctbx developers,
I am interested in the implementation of model-based reflection outlier rejection. As I read the code mmtbx/scaling/outlier_rejection.py (lines 244-351), I noticed that maybe there was a discrepancy between what log_message explained and the actual code. The log_message in the code says:
Outliers are rejected on the basis of the assumption that a scaled log likelihood differnce 2(log[P(Fobs)]-log[P(Fmode)])/Q\" is distributed according to a Chi-square distribution (Q\" is equal to the second derivative of the log likelihood function of the mode of the distribution). The outlier threshold of the p-value relates to the p-value of the extreme value distribution of the chi-square distribution. while actual p_value is calculated for each hkl as p_value = 1 - erf(sqrt(LLG))**N, where LLG = log p(F=Fbar | Fmodel) - log p(F=Fobs | Fmodel), and N is the number of reflections. Here, Fbar is F which gives the maximum value of p(F | Fmodel). At least, Q (the second derivative of p(F=Fbar | Fmodel)) is not used in the actual calculation.
Could someone please explain the meaning of the actual calculation? Why taking square-root and raising erf() result to the power of N?
Thank you very much, Keitaro _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
Hi,
It has been a while since I wrote this and you could potentially be right
that I forget to devide by the second derivative, I'll have a look.
P
On 5 September 2013 02:31, Keitaro Yamashita
Dear cctbx developers,
I am interested in the implementation of model-based reflection outlier rejection. As I read the code mmtbx/scaling/outlier_rejection.py (lines 244-351), I noticed that maybe there was a discrepancy between what log_message explained and the actual code. The log_message in the code says:
Outliers are rejected on the basis of the assumption that a scaled log likelihood differnce 2(log[P(Fobs)]-log[P(Fmode)])/Q\" is distributed according to a Chi-square distribution (Q\" is equal to the second derivative of the log likelihood function of the mode of the distribution). The outlier threshold of the p-value relates to the p-value of the extreme value distribution of the chi-square distribution.
while actual p_value is calculated for each hkl as p_value = 1 - erf(sqrt(LLG))**N, where LLG = log p(F=Fbar | Fmodel) - log p(F=Fobs | Fmodel), and N is the number of reflections. Here, Fbar is F which gives the maximum value of p(F | Fmodel). At least, Q (the second derivative of p(F=Fbar | Fmodel)) is not used in the actual calculation.
Could someone please explain the meaning of the actual calculation? Why taking square-root and raising erf() result to the power of N?
Thank you very much, Keitaro _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
-- ----------------------------------------------------------------- P.H. Zwart Research Scientist Berkeley Center for Structural Biology Lawrence Berkeley National Laboratories 1 Cyclotron Road, Berkeley, CA-94703, USA Cell: 510 289 9246 BCSB: http://bcsb.als.lbl.gov PHENIX: http://www.phenix-online.org SASTBX: http://sastbx.als.lbl.gov -----------------------------------------------------------------
Dear all,
Thank you for your replies. I have already had a look at Read (1999)
paper, but I could not find direct explanation of this implementation
(or what the message in the code explains).
Thanks to an advice of my friend, I understand that what the code does
is something like likelihood-ratio test. The reason why taking square
root is because cumulative distribution function of chi-square
distribution with freedom of one is erf(sqrt(x/2)). However, I still
do not understand the reason why it is raised to the power of N (**N).
I would be grateful if you explained the reason.
Best regards,
Keitaro
2013/9/6 Peter Zwart
Hi,
It has been a while since I wrote this and you could potentially be right that I forget to devide by the second derivative, I'll have a look.
P
On 5 September 2013 02:31, Keitaro Yamashita
wrote: Dear cctbx developers,
I am interested in the implementation of model-based reflection outlier rejection. As I read the code mmtbx/scaling/outlier_rejection.py (lines 244-351), I noticed that maybe there was a discrepancy between what log_message explained and the actual code. The log_message in the code says:
Outliers are rejected on the basis of the assumption that a scaled log likelihood differnce 2(log[P(Fobs)]-log[P(Fmode)])/Q\" is distributed according to a Chi-square distribution (Q\" is equal to the second derivative of the log likelihood function of the mode of the distribution). The outlier threshold of the p-value relates to the p-value of the extreme value distribution of the chi-square distribution.
while actual p_value is calculated for each hkl as p_value = 1 - erf(sqrt(LLG))**N, where LLG = log p(F=Fbar | Fmodel) - log p(F=Fobs | Fmodel), and N is the number of reflections. Here, Fbar is F which gives the maximum value of p(F | Fmodel). At least, Q (the second derivative of p(F=Fbar | Fmodel)) is not used in the actual calculation.
Could someone please explain the meaning of the actual calculation? Why taking square-root and raising erf() result to the power of N?
Thank you very much, Keitaro _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
-- ----------------------------------------------------------------- P.H. Zwart Research Scientist Berkeley Center for Structural Biology Lawrence Berkeley National Laboratories 1 Cyclotron Road, Berkeley, CA-94703, USA Cell: 510 289 9246 BCSB: http://bcsb.als.lbl.gov PHENIX: http://www.phenix-online.org SASTBX: http://sastbx.als.lbl.gov -----------------------------------------------------------------
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
The reason for the power of N might have to do with extreme value statistics.
Tomorrow I'll take s detailed look.
Sent from my iPhone
On Sep 8, 2013, at 18:33, Keitaro Yamashita
Dear all,
Thank you for your replies. I have already had a look at Read (1999) paper, but I could not find direct explanation of this implementation (or what the message in the code explains).
Thanks to an advice of my friend, I understand that what the code does is something like likelihood-ratio test. The reason why taking square root is because cumulative distribution function of chi-square distribution with freedom of one is erf(sqrt(x/2)). However, I still do not understand the reason why it is raised to the power of N (**N). I would be grateful if you explained the reason.
Best regards, Keitaro
2013/9/6 Peter Zwart
: Hi,
It has been a while since I wrote this and you could potentially be right that I forget to devide by the second derivative, I'll have a look.
P
On 5 September 2013 02:31, Keitaro Yamashita
wrote: Dear cctbx developers,
I am interested in the implementation of model-based reflection outlier rejection. As I read the code mmtbx/scaling/outlier_rejection.py (lines 244-351), I noticed that maybe there was a discrepancy between what log_message explained and the actual code. The log_message in the code says:
Outliers are rejected on the basis of the assumption that a scaled log likelihood differnce 2(log[P(Fobs)]-log[P(Fmode)])/Q\" is distributed according to a Chi-square distribution (Q\" is equal to the second derivative of the log likelihood function of the mode of the distribution). The outlier threshold of the p-value relates to the p-value of the extreme value distribution of the chi-square distribution.
while actual p_value is calculated for each hkl as p_value = 1 - erf(sqrt(LLG))**N, where LLG = log p(F=Fbar | Fmodel) - log p(F=Fobs | Fmodel), and N is the number of reflections. Here, Fbar is F which gives the maximum value of p(F | Fmodel). At least, Q (the second derivative of p(F=Fbar | Fmodel)) is not used in the actual calculation.
Could someone please explain the meaning of the actual calculation? Why taking square-root and raising erf() result to the power of N?
Thank you very much, Keitaro _______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
-- ----------------------------------------------------------------- P.H. Zwart Research Scientist Berkeley Center for Structural Biology Lawrence Berkeley National Laboratories 1 Cyclotron Road, Berkeley, CA-94703, USA Cell: 510 289 9246 BCSB: http://bcsb.als.lbl.gov PHENIX: http://www.phenix-online.org SASTBX: http://sastbx.als.lbl.gov -----------------------------------------------------------------
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
_______________________________________________ cctbxbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/cctbxbb
participants (4)
-
Keitaro Yamashita
-
Pavel Afonine
-
Peter Zwart
-
Peter Zwart