Cross-validation when test set is miniscule
Hi everyone, Right now we have one of those very difficult Rfree situations where it's impossible to generate a single meaningful Rfree set. Since we're in a bit of a hurry with this structure it would be good if someone could point me in the right direction. We have crystals with 1542 non-H atoms in the asymmetric unit that diffract to only 3.6 Å in P65, which gives us a whopping 2300 reflections in total. 5% of this is only about 100 reflections. Luckily the protein is only a single point mutation of a wild type that has been solved to much better resolution, so we know what it should look like and I simply want to investigate the effect of different levels of conservatism in the refinement, e.g. NCS in xyz and B, group B-factors, reference model, Ramachandran restraints etc. However since the quality criterion for this is Rfree I'm not able to do this. I believe the correct approach is k-fold statistical cross-validation, but can someone remind me of the correct way to do this? I've done a bit of Googling without finding anything very helpful. Thanks Derek ________________________________________________________________________ Derek Logan tel: +46 46 222 1443 Associate Professor mob: +46 76 8585 707 Dept. of Biochemistry and Structural Biology www.cmps.lu.sehttp://www.cmps.lu.se Centre for Molecular Protein Science www.maxlab.lu.se/crystal Lund University, Box 124, 221 00 Lund, Sweden www.saromics.com
Hi Derek, choosing 5% for free set is not a dogma. I always use 10% and that's what CNS was doing for years. In your case this will make 200. Not a whole lot but better than 100. You can generate several (say 10-50) different test sets and independently refine the model against each of them (from the very beginning). Then make a note of differences (in model, R-factors). Those differences will be uncertainties likely due to different test sets used. I realize it may be tedious to do 10-50 refinements per each model parametrization and refinement strategy that you want to test. In this case I would simply reduce choices down to most reasonable given the resolution and model quality: - use individual B-factor refinement. With type of restraints we have it is ok to do in most cases. Switch to group B refinement only if you have strong reasons to believe that individual B refinement isn't good for your case. - Use torsion NCS; - Use Ramachandran plot restraints only to keep (preserve) good conformations during refinement, not to fix bad ones (outliers). That is: in case of outlier, for it manually first then refine with Ramachandran restraints so that it does not become outlier again. - If you have a higher resolution good model, you can use it as a reference model, if needed. In future we will investigate using ideas recently published in Acta D that suggest ways to overcome the problem of too small test sets. Pavel On 12/19/14 3:18 AM, Derek Logan wrote:
Hi everyone,
Right now we have one of those very difficult Rfree situations where it's impossible to generate a single meaningful Rfree set. Since we're in a bit of a hurry with this structure it would be good if someone could point me in the right direction. We have crystals with 1542 non-H atoms in the asymmetric unit that diffract to only 3.6 Å in P65, which gives us a whopping 2300 reflections in total. 5% of this is only about 100 reflections. Luckily the protein is only a single point mutation of a wild type that has been solved to much better resolution, so we know what it should look like and I simply want to investigate the effect of different levels of conservatism in the refinement, e.g. NCS in xyz and B, group B-factors, reference model, Ramachandran restraints etc. However since the quality criterion for this is Rfree I'm not able to do this.
I believe the correct approach is k-fold statistical cross-validation, but can someone remind me of the correct way to do this? I've done a bit of Googling without finding anything very helpful.
Thanks Derek ________________________________________________________________________ Derek Logan tel: +46 46 222 1443 Associate Professor mob: +46 76 8585 707 Dept. of Biochemistry and Structural Biology www.cmps.lu.se http://www.cmps.lu.se Centre for Molecular Protein Science www.maxlab.lu.se/crystal Lund University, Box 124, 221 00 Lund, Sweden www.saromics.com
_______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
One more item I forgot to mention: if necessary you may want to do weight optimization. Pavel On 12/19/14 8:38 AM, Pavel Afonine wrote:
Hi Derek,
choosing 5% for free set is not a dogma. I always use 10% and that's what CNS was doing for years. In your case this will make 200. Not a whole lot but better than 100.
You can generate several (say 10-50) different test sets and independently refine the model against each of them (from the very beginning). Then make a note of differences (in model, R-factors). Those differences will be uncertainties likely due to different test sets used.
I realize it may be tedious to do 10-50 refinements per each model parametrization and refinement strategy that you want to test. In this case I would simply reduce choices down to most reasonable given the resolution and model quality:
- use individual B-factor refinement. With type of restraints we have it is ok to do in most cases. Switch to group B refinement only if you have strong reasons to believe that individual B refinement isn't good for your case. - Use torsion NCS; - Use Ramachandran plot restraints only to keep (preserve) good conformations during refinement, not to fix bad ones (outliers). That is: in case of outlier, for it manually first then refine with Ramachandran restraints so that it does not become outlier again. - If you have a higher resolution good model, you can use it as a reference model, if needed.
In future we will investigate using ideas recently published in Acta D that suggest ways to overcome the problem of too small test sets.
Pavel
On 12/19/14 3:18 AM, Derek Logan wrote:
Hi everyone,
Right now we have one of those very difficult Rfree situations where it's impossible to generate a single meaningful Rfree set. Since we're in a bit of a hurry with this structure it would be good if someone could point me in the right direction. We have crystals with 1542 non-H atoms in the asymmetric unit that diffract to only 3.6 Å in P65, which gives us a whopping 2300 reflections in total. 5% of this is only about 100 reflections. Luckily the protein is only a single point mutation of a wild type that has been solved to much better resolution, so we know what it should look like and I simply want to investigate the effect of different levels of conservatism in the refinement, e.g. NCS in xyz and B, group B-factors, reference model, Ramachandran restraints etc. However since the quality criterion for this is Rfree I'm not able to do this.
I believe the correct approach is k-fold statistical cross-validation, but can someone remind me of the correct way to do this? I've done a bit of Googling without finding anything very helpful.
Thanks Derek ________________________________________________________________________ Derek Logan tel: +46 46 222 1443 Associate Professor mob: +46 76 8585 707 Dept. of Biochemistry and Structural Biology www.cmps.lu.se http://www.cmps.lu.se Centre for Molecular Protein Science www.maxlab.lu.se/crystal Lund University, Box 124, 221 00 Lund, Sweden www.saromics.com
_______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
participants (2)
-
Derek Logan
-
Pavel Afonine