You cannot compare the R-factors that were computed using different sets of reflections. Therefore the above comparison is not valid, obviously. Same applies to your "Longer version". Let's compare apples with apples.
Read the example. It's the same set of reflections. In particular I did that because it's the same set of F's which is not very easy to get to (phenix.refine's reflection utilities don't let me output F if I do a <= truncation on intensity). Specifically: default behavior of Phenix from the mtz is to take Imean, throw away anything with Imean < 0 and convert to F. That mtz file contained F's out of CCP4's TRUNCATE, but all the data. phenix.refine throws that data away internally. So I have three selection methods: 1. TRUNCATE F data which I have to force phenix.refine to use via labels='F,SIGF' 2. F data that phenix.refine converts from Imean 3. A subset of #1 reduced by the selection criteria used in #2 So I take a PDB file refined against option #1 and compare to #3. This contains all the Free R reflections that #2 has but since TRUNCATE modifies the F's. That's about as fair a comparison as I can find: a PDB file refined against #2 and compared to #2 vs a PDB file refined against #1 and compared to #3. The only difference between #1 and #3 is that #1 contains F's with Imean < 0, altered via TRUNCATE.
Comparing R-factors in this case does not tell that one refinement is better or worse than the other one. It just doesn't tell anything because the R-factor is not a good measure when you deal with two different datasets (datasets containing different amount of reflections).
This would mean that the whole thing is inherently untestable because of phenix.refine's rejection criteria - there will always be a difference in data count because of that. Propose a better experiment. Phil