sculptor : "Wrong alignment format:"
(starting a new thread?) trouble with alignment format. tried this : fubar.fas : Sorry: Unknown file(s): fubar.fas fubar.fasta : Sorry: Wrong alignment format: fubar.fasta i compared to one that worked yesterday and its not obvious : first sequence is the target, the pdb sequence is in there (somewhere), i changed a ">sp" to ">gi".... if anyone knows what i'm missing here,... -Bryan
On Fri, Nov 19, 2010 at 10:27 AM, Bryan Lepore
trouble with alignment format. tried this :
fubar.fas : Sorry: Unknown file(s): fubar.fas
fubar.fasta : Sorry: Wrong alignment format: fubar.fasta
i compared to one that worked yesterday and its not obvious : first sequence is the target, the pdb sequence is in there (somewhere), i changed a ">sp" to ">gi"....
The extension ".fas" is not recognized, but ".fasta" should be. My experience with the FASTA parser (for simple sequences, at least) is that it's extremely strict about what is an acceptable format, but that most of the programs/servers that generate/distribute these files are much looser. Could you send me the file (off-list, of course) for debugging? -Nat
It will almost certainly be the header not being in the right format. I will change the parser to be a more tolerant one, as sculptor does not need the database records stored on the header anyway. As a workaround, just change the extension to .ali, which will have the same effect. BW, Gabor On Nov 19 2010, Nathaniel Echols wrote:
On Fri, Nov 19, 2010 at 10:27 AM, Bryan Lepore
wrote: trouble with alignment format. tried this :
fubar.fas : Sorry: Unknown file(s): fubar.fas
fubar.fasta : Sorry: Wrong alignment format: fubar.fasta
i compared to one that worked yesterday and its not obvious : first sequence is the target, the pdb sequence is in there (somewhere), i changed a ">sp" to ">gi"....
The extension ".fas" is not recognized, but ".fasta" should be. My experience with the FASTA parser (for simple sequences, at least) is that it's extremely strict about what is an acceptable format, but that most of the programs/servers that generate/distribute these files are much looser. Could you send me the file (off-list, of course) for debugging?
-Nat
Sculptor is Python only, so OpenMP will not help. But in principle, we could make it parallel. Generally, sculptor only takes a couple of seconds to run, and there was not much point in making it use multiple CPUs. However, in certain cases it takes much longer for no apparent reason. There is actually a new version out there (I have checked it in yesterday) that should be faster. Is this what you are running (0.3.0)? On Nov 19 2010, Bryan Lepore wrote:
On Fri, Nov 19, 2010 at 4:59 PM, Dr G. Bunkoczi
wrote: change the extension to .ali
...like magic.
(its still running though) ... if i recompile with --openmp will sculptor use >1 processor?
-Bryan _______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
-- ################################################## Dr Gabor Bunkoczi Cambridge Institute for Medical Research Wellcome Trust/MRC Building Addenbrooke's Hospital Hills Road Cambridge CB2 0XY ##################################################
Is this what you are running (0.3.0)?
yes - Nat hooked me up with dev-586 :^)
-Bryan
Hmm, this is disappointing, I thought the speed issue has been resolved... Could point me to an example that takes very long? I can give another go in finding the bottleneck. BW, Gabor
Hi,
finally got back around to this one, but its about speed now, not format :
On Mon, Nov 22, 2010 at 5:07 AM, Dr G. Bunkoczi
Is this what you are running (0.3.0)?
yes (via dev-590)
Could point me to an example that takes very long? I can give another go in finding the bottleneck.
i could - but if i told you i have 190 sequences or 189529 characters (via `wc`) in the alignment, does that indicate anything? -Bryan
Hi Bryan, yes, it could be. Sculptor has to find the sequence corresponding to the protein model, and it will first align the chain sequence with all sequences in the alignment, and it picks the best one. This can take some time. On my machine, searching a 190-sequence alignment takes about 5 mins. However, if you have several chains in you model and you want all of them to be processed, the total time will be the multiple of 5 mins and the number of protein chains. Now, I am wondering what you are trying the achieve with using such a large alignment. If this is something you consider routine, I will spend some time speeding up the calculation. Obviously, you must be trying to extract as much information from the sequence alignment as possible, and I am not sure the sequence similarity calculation as implemented in sculptor is optimal for this (right now, sculptor will just take the minimum of all pairwise substitution scores for a certain position). This works well for a pairwise sequence alignment, but for a 190-sequence alignment just results in gap scores everywhere. Could you also give some advice on how this is best calculated? Would it be better to calculate the average? Best wishes, Gabor On Nov 26 2010, Bryan Lepore wrote:
Hi,
finally got back around to this one, but its about speed now, not format :
On Mon, Nov 22, 2010 at 5:07 AM, Dr G. Bunkoczi
wrote: Is this what you are running (0.3.0)?
yes (via dev-590)
Could point me to an example that takes very long? I can give another go in finding the bottleneck.
i could - but if i told you i have 190 sequences or 189529 characters (via `wc`) in the alignment, does that indicate anything?
-Bryan _______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
Hi Bryan, on second thought, I have put a quick pre-filtering into the alignment search, and now the 190-alignment takes only 2 s instead of 5 min. Could you try this version and let me know whether the speed issue has gone away? This will not speed up the calculation if all sequences in the alignment are almost identical. BW, Gabor
Hi Bryan,
yes, it could be. Sculptor has to find the sequence corresponding to the protein model, and it will first align the chain sequence with all sequences in the alignment, and it picks the best one. This can take some time. On my machine, searching a 190-sequence alignment takes about 5 mins. However, if you have several chains in you model and you want all of them to be processed, the total time will be the multiple of 5 mins and the number of protein chains.
Now, I am wondering what you are trying the achieve with using such a large alignment. If this is something you consider routine, I will spend some time speeding up the calculation.
Obviously, you must be trying to extract as much information from the sequence alignment as possible, and I am not sure the sequence similarity calculation as implemented in sculptor is optimal for this (right now, sculptor will just take the minimum of all pairwise substitution scores for a certain position). This works well for a pairwise sequence alignment, but for a 190-sequence alignment just results in gap scores everywhere. Could you also give some advice on how this is best calculated? Would it be better to calculate the average?
Best wishes, Gabor
On Nov 26 2010, Bryan Lepore wrote:
Hi,
finally got back around to this one, but its about speed now, not format :
On Mon, Nov 22, 2010 at 5:07 AM, Dr G. Bunkoczi
wrote: Is this what you are running (0.3.0)?
yes (via dev-590)
Could point me to an example that takes very long? I can give another go in finding the bottleneck.
i could - but if i told you i have 190 sequences or 189529 characters (via `wc`) in the alignment, does that indicate anything?
-Bryan _______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
_______________________________________________ phenixbb mailing list [email protected] http://phenix-online.org/mailman/listinfo/phenixbb
On Mon, Nov 29, 2010 at 6:36 AM, Dr G. Bunkoczi
On my machine, searching a 190-sequence alignment takes about 5 mins. (time output didn't make it for some reason).
reading ahead in the email, so i just tried dev-592, still takes a very long time. i finally understand that sculptor is in fact calculating an alignment. my alignments have been pre-aligned (mafft usually). i have not tried a .def to turn off the alignment - can i do that?
what you are trying the achieve with using such a large alignment....you must be trying to extract as much information from the sequence alignment as possible,
yes
I am not sure the sequence similarity calculation as implemented in sculptor is optimal for this
i see
Could you also give some advice on how this is best calculated?
in terms of algorithms out there, IIUC mafft has a very sophisticated algorithm for it, and at any rate it can be compiled with multithreading. its license appears not to be prohibitive. -Bryan
On Mon, Nov 29, 2010 at 10:10 AM, Bryan Lepore
On Mon, Nov 29, 2010 at 6:36 AM, Dr G. Bunkoczi
wrote: On my machine, searching a 190-sequence alignment takes about 5 mins. (time output didn't make it for some reason).
reading ahead in the email, so i just tried dev-592, still takes a very long time.
That's a few days old now - try this one: https://www.phenix-online.org/download/admin/?version=dev-595 -Nat
Hi Bryan,
i finally understand that sculptor is in fact calculating an alignment. my alignments have been pre-aligned (mafft usually). i have not tried a .def to turn off the alignment - can i do that?
Sculptor has to match your protein model with your alignment. For all other calculations, Sculptor uses the alignment you provide. You can also ask Sculptor to create this alignment for you (it is off by default).
in terms of algorithms out there, IIUC mafft has a very sophisticated algorithm for it, and at any rate it can be compiled with multithreading. its license appears not to be prohibitive.
I will have a look at MAFFT, but I do hope the speed issues will disappear with the new version (you may have to wait till tomorrow, as I have literally checked in the changes a couple of hours ago). BW, Gabor
On Mon, Nov 29, 2010 at 1:30 PM, Dr G. Bunkoczi
I will have a look at MAFFT, but I do hope the speed issues will disappear with the new version (you may have to wait till tomorrow, as I have literally checked in the changes a couple of hours ago).
in case this helps : i ran it like before using dev-595 : real 17m51.087s user 16m47.843s sys 0m6.064s however, i don't have a comparison to show at the moment. however, i need to think about my approach based on what i learned so far. thanks! -Bryan
participants (3)
-
Bryan Lepore
-
Dr G. Bunkoczi
-
Nathaniel Echols