PredicedSP.pl - Calculates predicted SP score for assessing the quality of protein multiple sequence alignments Copyright (C) 2008 Virpi Ahola virpi.ahola@mtt.fi This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . A copy of this license is in the COPYING file. ------------------------------------------------------------- Predicted SP score ------------------------------------------------------------- Predicted SP score is a alignment quality score for protein multiple sequence alignments (MSA). It can be applied for any protein MSA. AQ score is based on the prediction model which uses a whole alignment conservation score ConsAA. For more information of calculation of the whole alignment conservation score ConsAA, please visit: Ahola V, Aittokallio T, Vihinen M, Uusipaikka E. A statistical score for assessing the quality of multiple sequence alignments. BMC Bioinformatics 2006, 7:484. For more information of the predicted SP score: Ahola V, Aittokallio T, Vihinen M, Uusipaikka E. Model-based prediction of the quality of protein multiple sequence alignment. Submitted to Bioinformatics. ------------------------------------------------------------- Tarball predictedSP.tar.gz includes the following files and directories: PredictedSP.pl - Perl script for calculating predicted SP. ConsAA - C binary file for calculating the ConsAA values for multiple sequence alignment and conservation scores (p-values and log p-values) for each alignment position. test - Directory containing example test alignment files for SH2 domain and Peptidase_M13. PredictedSP.txt - File containing example quality scores for SH2 domain and Peptidase_M13. README - This file. COPYING - File contains a copy of the GNU General Public License. ------------------------------------------------------------- Requirements: To run predicted.pl you should have Perl, BioPerl and C. ------------------------------------------------------------ Usage: Unpack the predictedSP.tar.gz file. ------------------------------------------------------------- PredictedSP.pl: Reads fasta files from the directory (argument 0) and transforms them to "onlyaln" format. Filename extension of the fasta files (e.g. fasta or aln) must be given as an argument (argument 1). Reads "onlyaln" formatted files from the directory and calculates numbers of conserved amino acids in the alignment (consAA values) in 15 different FDR values using Blosum62 scoring matrix and parameter values alpha=0.4 and epsilon=0.7 (for changing the parameter values see chapter of ConsAA below). The number of samples from the importance sampling (IS) distribution must be given as an argument, 1000 usually gives a good estimate for calculation of the quality score. Note that the algorithm for calculation of ConsAA is very slow. Reads *.cons* files from the directory and uses one row from each file according to the given FDR value (argument 3). Calculates the predicted SP score and saves it to the output file (argument 4). The SP score is calculated using one of the parameter sets: mafft, muscle or probcons. It is prefered that the number of sequence in the alignment is between 5 and 100 and alignment length between 51 and 1233, 1212 or 1664 depending on the parameter set (mafft, muscle, probcons). Arguments: 0=Directory including test alignments in fasta format. Sequences must be in upper case and gaps must be marked as dashes. 1=filename extension of the alignment files 2=Number of IS samples 3=FDR value (0.01-0.15) 4=Output filename 5=Parameter set: "mafft", "muscle" or "probcons" Columns of the output file (argument 4) are: number of alignment file, name of alignment file, predicted SP score, consAA, identity of the alignment, number of sequences in the alignment, alignment length. Other output files include: *.onlyaln - alignment where each sequence is in one line and name of the sequences are omitted. *.p - conservation scores (-log(p) values) for each alignment positions. Columns of the *.p* files are: alignment position, p value, -log(p) value. *.cons - ConsAA values for each alignment in 15 FDR values (FDR=0.01-0.15). Columns of the *.cons* files are: FDR, number of sequences, alignment length, number of conserved alignment positions, consAA. ------------------------------------------------------------- Example: Example for calculating the predicted SP score for SH2 and Peptidase_M13 alignments: perl PredictedSP.pl test/ fasta 1000 0.05 PredictedSP.txt muscle --------------------------------------------------------------- consAA: C binary file for calulating the ConsAA values for multiple sequence alignment and conservation scores (p-values and log p-values) for each alignment position. If you would like to change the parameter values used in the predictedSP.pl file edit the last line of the code: system("./consAA $onlyalnfile $noSeq 1 $sample B 62 0.4 0.7 $pfile $consAAfile"); by changing the arguments. Arguments: 0=Alignment filename in "onlyaln" format (see examples of input files below). 1=Number of sequences in the alignment. 2=Beginning site of the sequence in a row (see examples of input files below). 3=Number of samples from the importance sampling distribution. 1000 gives a good estimate. 4=Scoring matrix: G=Gonnet250, B=Blosum, P=PAM250, I=Identity. 5=Number of the scoring matrix: 45 or 62 for Blosum, any integer for Gonnet, PAM and Identity scoring matrices. 6=Alpha parameter value. 0.4 usually works. 7=Epsilon parameter value. 0.7 usually works for alignments less than 100 sequences. 0.8 works for alignments with more than 100 sequences. 8=Conservation score (log(p) value) output file name. 9=ConsAA output file name. Input files: Input file format "onlyaln" looks like this: Example 1: ZA70_MOUSE/10-87 FFYGSISR-----AEAEEHLKLA----GMADGLFLLRQCLR-SLGG KSYK_HUMAN/15-92 FFFGNITR-----EEAEDYLVQG----GMSDGLYLLRQSRN-YLGG ZA70_HUMAN/163-239 WYHSSLTR-----EEAERKLYSG----AQTDGKFLLRPRKE--QGT Example 2: FFYGSISR-----AEAEEHLKLA----GMADGLFLLRQCLR-SLGG FFFGNITR-----EEAEDYLVQG----GMSDGLYLLRQSRN-YLGG WYHSSLTR-----EEAERKLYSG----AQTDGKFLLRPRKE--QGT Sequences must be in upper case and gaps are marked as dasches. Argument 2 tells the beginning site of the sequences (25 in the first and 1 in the second example). Output files: -log(p) output file (argument 8): conservation scores (-log(p) values) for each alignment position. Columns of the file are: alignment position, p value, -log(p) value. ConsAA output file (argument 9): ConsAA values for each alignment in 15 FDR values (FDR=0.01-0.15). Columns of the file are: FDR, number of sequences, alignment length, number of conserved alignment positions, consAA. ---------------------------------------------------------------