CLUSTALW(1L) UNIX System V (August 200) CLUSTALW(1L) NAME clustal, clustalw, clustalx, clustalw_mp, clustalx_mp - a general purpose multiple alignment program for DNA or proteins SYNOPSIS clustalw [-options] [-help] DESCRIPTION Clustal W is a general purpose multiple alignment program for DNA or proteins. It is designed to be run interactively. It is also possible to assign options on the command line The following implementations of Clustal W are available: clustalw provides a command-line or menu driven interface to the Clustal W multiple alignment program. clustalX provides a new window-based user interface to the Clustal W multiple alignment program. clustalw_mp clustalx_mp are optimized and parallelized versions of the programs which provide a significant speed improvement when run on SGI R10000 and R12000 systems. The environment variable OMP_NUM_THREADS should be used to designate the number of processors to run on; the default value is 2. More info about this implementation can be found at: http://www.sgi.com/solutions/sciences/chembio/resources/clustalw/ HT_Clustal is a wrapper program that starts multiple clustalw jobs across Origin server. (see separate ht_clustal man page) USING CLUSTALW Type 'clustalw' at the helix prompt. For a list of options, type 'clustalw -options'. For command-line help, type 'clustalw -help'. To do a multiple alignment on a set of sequences, use item 1 from the menu to input them, then item 2 to do the multiple alignment. Profile alignments (menu item 3) are used to align 2 alignments. Use this to add a new sequence to an old alignment, or to use secondary structure to guide the alignment process. Gaps in the old alignments are indicated using the "-" character. Profiles can be input in any of the allowed formats, just use "-" (or "." for MSF) for each gap position. Phylogenetic trees (menu item 4) can be calculated from old alignments (read in with "-" characters to indicate gaps) or after a multiple alignment while the alignment is still in memory. FILE FORMAT All sequences must be in one file, one after another. Seven formats are automatically recognized: NBRF/PIR, EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE Flat file. All non-alphabetic characters (spaces, digits, punctuation marks) are ignored except "-" which is used to indicate a GAP ("." in GCG/MSF). The program tries to automatically recognise the different file formats used and to guess whether the sequences are amino acid or nucleotide. This is not always foolproof. If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the sequence will be assumed to be nucleotide. This works in 97.3% of cases but watch out! Five output formats are offered. You can choose more than one (or all 5 if you wish). CLUSTAL format output is a self explanatory alignment format. It shows the sequences aligned in blocks. It can be read in again at a later date to (for example) calculate a phylogenetic tree or add a new sequence with a profile alignment. GCG output can be used by any of the GCG programs that can work on multiple alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG .msf format files (multiple sequence file); new in version 7 of GCG. PHYLIP format output can be used for input to the PHYLIP package of Joe Felsenstein. This is an extremely widely used package for doing every imaginable form of phylogenetic analysis (MUCH more than the the modest intro- duction offered by this program). NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap characters "-" are used to indicate the positions of gaps in the multiple alignment. These files can be re-used as input in any part of clustal that allows sequences (or alignments or profiles) to be read in. GDE: this format is used by the GDE package of Steven Smith. OPTIONS The clustalw program has the following options: -help prints out command line options -check same as -help -options lists the available command-line parameters -INFILE=xxx.yyy input sequence file -PROFILE=xxx.yyy file containing profiles from old alignment -PROFILE2=xxx.yyy file containing profiles from old alignment -align does a full multiple alignment -tree calculates an NJ tree -bootstrap=n boostraps an NJ tree (n=number of bootstraps, default=1000) -convert outputs the input sequences in a different file format -interactive reads command line, then enters normal interactive mode -quicktree uses FAST algorithm for the alignment guide tree -negative protein alignment with negative values in matrix -outfile=xxx.yyy output sequence alignment file name -output=xxx output has sequences in xxx format. Choices are GCG, GDE, PHYLIP or PIR. -outorder=xxx Order of sequences in output. This is used to control the order of the sequences in the output alignments. By default, it is the same as the input order. This switch can be used to make the order correspond to the order in which the sequences were aligned (from the guide tree/dendrogram), thus automatically grouping closely related sequences. Choices are INPUT (same order as input) or ALIGNED (in order of alignment) -case Case of output sequences; LOWER or UPPER. For GDE output only. -seqnos=xxx Sequence numbers in output; OFF or ON. For Clustal output only. OPTIONS for Fast pairwise alignments These similarity scores are calculated from fast, approximate, global align- ments, which are controlled by 4 parameters. Two techniques are used to make these alignments very fast: only exactly matching fragments (k-tuples) are used, and only the best diagonals (the ones with most k- tuple matches) are used. -KTUPLE=# word size. This is the size of exactly matching fragment that is used. INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity. For longer sequences (e.g. >1000 residues) you may need to increase the default. -topdiags=# number of best diagonals. The number of k-tuple matches on each diagonal (in an imaginary dot-matrix plot) is calculated. Only the best ones (with most matches) are used in the alignment. This parameter specifies how many. Decrease for speed; increase for sensitivity. -window=# window around best diagonals. This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity. -pairgap=# gap penalty. This is a penalty for each gap in the fast alignments. It has little affect on the speed or sensitivity except for extreme values. -score PERCENT or ABSOLUTE OPTIONS for Slow pairwise alignments These parameters do not have any affect on the speed of the alignments. They are used to give initial alignments which are then rescored to give percent identity scores. These percent scores are the ones which are displayed on the screen. The scores are converted to distances for the trees. -pwmatrix=xxx Protein weight matrix, i.e. the scoring table which describes the similarity of each amino acid to the other. For DNA, an identity matrix is used. Choices are BLOSUM, PAM, GONNET, ID or filename. -pwdnamtrix DNA weight matrix. Choices are IUB, CLUSTALW, or filename. -pwgapopen=# Gap opening penalty -pwgapext=# Gap extension penalty, i.e. the penalty for extending a gap by one residue. OPTIONS for Multiple alignements These parameters control the final multiple alignment. This is the core of the program and the details are complicated. Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the GUIDE TREE. The basic parameters to control this are two gap penalties and the scores for various identical/non- indentical residues. -newtree=xxx.yyy file for new guide tree -usetree=xxx.yyy file for old guide tree -matrix=xxxx Protein weight matrix. Chocies are BLOSUM, PAM, GONNET, ID or filename. The default is the BLOSUM series of matrices by Jorja and Steven Henikoff. Note, a series is used! The actual matrix that is used depends on how similar the sequences to be aligned at this alignment step are. Different matrices work differently at each evolutionary distance. Further help is offered in the weight matrix menu. -dnamatrix=xxxx DNA weight matrix. Choices are IUB, CLUSTALW, or filename. -gapopen=# Gap opening penalty. Increasing the gap opening penalty will make gaps less frequent. -gapext=# Gap extension penalty. Increasing the gap extension penalty will make gaps shorter. Terminal gaps are not penalised. -endgaps No end gap separation penalty. This treats end gaps just like internal gaps for the purposes of avoiding gaps that are too close (set by 'gapdist'). If you turn this off, end gaps will be ignored for this purpose. This is useful when you wish to align fragments where the end gaps are not biologically meaningful. -gapdist=# Gap separation penalty range. This tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalised more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like appearance of the alignment. -nopgap Residue-specific gaps off. These are amino acid specific gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence. See the documentation for details. As an example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine. -nohgap Hydrophilic gaps off. Hydrophilic gap penalties (protein sequences) are used to increase the chances of a gap within a run (5 or more residues) of hydrophilic amino acids; these are likely to be loop or random coil regions where gaps are more common. -hgapresidues List hydrophilic residues. -maxdiv=# Identity for delay. -type=xxx Specify sequence type: PROTEIN or DNA. -transweight=# Transitions weighting. OPTIONS for Profile alignments By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile alignments allow you to store alignments of your favourite sequences and add new sequences to them in small bunches at a time. A profile is simply an alignment of one or more sequences (e.g. an alignment output file from CLUSTAL W). Each input can be a single sequence. One or both sets of input sequences may include secondary structure assignments or gap penalty masks to guide the alignment. The profiles can be in any of the allowed input formats with "-" characters used to specify gaps (except for GCG/MSF where "." is used). You have to specify the 2 profiles by choosing menu items 1 and 2 and giving 2 file names. Then Menu item 3 will align the 2 profiles to each other. Secondary structure masks in either profile can be used to guide the alignment. Menu item 4 will take the sequences in the second profile and align them to the first profile, 1 at a time. This is useful to add some new sequences to an existing alignment, or to align a set of sequences to a known structure. In this case, the second profile need not be pre-aligned. The alignment parameters can be set using menu items 5, 6 and 7. These are EXACTLY the same parameters as used by the general, automatic multiple alignment procedure. The general multiple alignment procedure is simply a series of profile alignments. Carrying out a series of profile alignments on larger and larger groups of sequences, allows you to manually build up a complete alignment, if necessary editing intermediate alignments. -profile Merge two alignments by profile alignment -newtree1=xxx.yyy file for new guide tree for profile1 -newtree2=xxx.yyy file for new guide tree for profile2 -usetree1=xxx.yyy file for old guide tree for profile1 -usetree2=xxx.yyy file for old guide tree for profile2 OPTIONS for Structure Alignments If a solved structure is available, it can be used to guide the alignment by raising gap penalties within secondary structure elements, so that gaps will preferentially be inserted into unstructured surface loop regions. Alternatively, a user-specified gap penalty mask can be supplied for a similar purpose. A gap penalty mask is a series of numbers between 1 and 9, one per position in the alignment. Each number specifies how much the gap opening penalty is to be raised by at that position (raised by multiplying the basic gap opening penalty by the number) i.e. a mask figure of 1 at a positiion means no change in gap opening penalty; a figure of 4 means that the gap opening penalty is four times greater at that position, making gaps 4 times harder to open. The format for gap penalty masks and secondary structure masks is explained in the help under option 0 (secondary structure options). -nosecstr1 do not use secondary structure/gap penalty mask for profile 1 -nosecstr2 do not use secondary structure/gap penalty mask for profile 2 -secstrout STRUCTURE or MASK or BOTH or NONE output in alignment file -helixgap=# gap penalty for helix core residues -strandgap=# gap penalty for strand core residues -loopgap=# gap penalty for loop regions -terminalgap=# gap penalty for structure termini -helixendin=# number of residues inside helix to be treated as terminal -helixendout=# number of residues outside helix to be treated as terminal -strandendin=# number of residues inside strand to be treated as terminal -strandendout=# number of residues outside strand to be treated as terminal OPTIONS for Trees -outputtree=xxx Choices are NJ or PHYLIP or DIST -seed=# Seed number for bootstraps -kimura Use Kimra's correction -tossgaps Ignore positions with gaps More documentation ClustalW has online help; choose the help option in any of the menus. Additional documentation can be found at GCG- Lite's web interface to ClustalW: http://molbio.info.nih.gov/molbio/gcglite/ SEE ALSO njplot, GCG's Pileup, Readseq for format conversion. CLUSTALW(1L) UNIX System V (August 200) CLUSTALW(1L)