CLUSTALW(1L)

CLUSTALW(1L)       UNIX System V (August 200)        CLUSTALW(1L)

NAME
     clustal, clustalw, clustalx, clustalw_mp, clustalx_mp - a
     general purpose multiple alignment program for DNA or
     proteins

SYNOPSIS
     clustalw [-options] [-help]

DESCRIPTION
     Clustal W is a general purpose multiple alignment program
     for DNA or proteins. It is designed to be run interactively.
     It is also possible to assign options on the command line

     The following implementations of Clustal W are available:

     clustalw
          provides a command-line or menu driven interface to the
          Clustal W multiple alignment program.

     clustalX
          provides a new window-based user interface to the
          Clustal W multiple alignment program.

     clustalw_mp clustalx_mp
          are optimized and parallelized versions of the programs
          which provide a significant speed improvement when run
          on SGI R10000 and R12000 systems.  The environment
          variable OMP_NUM_THREADS should be used to designate
          the number of processors to run on; the default value
          is 2.  More info about this implementation can be found
          at:
          http://www.sgi.com/solutions/sciences/chembio/resources/clustalw/

     HT_Clustal
          is a wrapper program that starts multiple clustalw jobs
          across Origin server.  (see separate ht_clustal man
          page)

USING CLUSTALW
     Type 'clustalw' at the helix prompt. For a list of options,
     type 'clustalw -options'. For command-line help, type
     'clustalw -help'.

     To do a multiple alignment on a set of sequences, use item 1
     from the menu to input them, then item 2 to do the multiple
     alignment.

     Profile alignments (menu item 3) are used to align 2
     alignments. Use this to add a new sequence to an old
     alignment, or to use secondary structure to guide the
     alignment process. Gaps in the old alignments are indicated
     using the "-" character. Profiles can be input in any of the

     allowed formats, just use "-" (or "."  for MSF) for each gap
     position.

     Phylogenetic trees (menu item 4) can be calculated from old
     alignments (read in with "-" characters to indicate gaps) or
     after a multiple alignment while the alignment is still in
     memory.

FILE FORMAT
     All sequences must be in one file, one after another. Seven
     formats are automatically recognized: NBRF/PIR,
     EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF
     (Pileup), GCG9/RSF and GDE Flat file. All non-alphabetic
     characters (spaces, digits, punctuation marks) are ignored
     except "-" which is used to indicate a GAP ("." in GCG/MSF).

     The program tries to automatically recognise the different
     file formats used and to guess whether the sequences are
     amino acid or nucleotide. This is not always foolproof. If
     85% or more of the characters in the sequence are from
     A,C,G,T,U or N, the sequence will be assumed to be
     nucleotide. This works in 97.3% of cases but watch out!

     Five output formats are offered. You can choose more than
     one (or all 5 if you wish).

     CLUSTAL format output
          is a self explanatory alignment format. It shows the
          sequences aligned in blocks. It can be read in again at
          a later date to (for example) calculate a phylogenetic
          tree or add a new sequence with a profile alignment.

     GCG output
          can be used by any of the GCG programs that can work on
          multiple alignments (e.g. PRETTY, PROFILEMAKE,
          PLOTALIGN). It is the same as the GCG .msf format files
          (multiple sequence file); new in version 7 of GCG.

     PHYLIP format output
          can be used for input to the PHYLIP package of Joe
          Felsenstein. This is an extremely widely used package
          for doing every imaginable form of phylogenetic
          analysis (MUCH more than the the modest intro- duction
          offered by this program).

     NBRF/PIR:
          this is the same as the standard PIR format with ONE
          ADDITION. Gap characters "-" are used to indicate the
          positions of gaps in the multiple alignment. These
          files can be re-used as input in any part of clustal
          that allows sequences (or alignments or profiles) to be
          read in.

     GDE: this format is used by the GDE package of Steven Smith.

OPTIONS
     The clustalw program has the following options:

     -help
          prints out command line options

     -check
          same as -help

     -options
          lists the available command-line parameters

     -INFILE=xxx.yyy
          input sequence file

     -PROFILE=xxx.yyy
          file containing profiles from old alignment

     -PROFILE2=xxx.yyy
          file containing profiles from old alignment

     -align
          does a full multiple alignment

     -tree
          calculates an NJ tree

     -bootstrap=n
          boostraps an NJ tree (n=number of bootstraps,
          default=1000)

     -convert
          outputs the input sequences in a different file format

     -interactive
          reads command line, then enters normal interactive mode

     -quicktree
          uses FAST algorithm for the alignment guide tree

     -negative
          protein alignment with negative values in matrix

     -outfile=xxx.yyy
          output sequence alignment file name

     -output=xxx
          output has sequences in xxx format. Choices are GCG,
          GDE, PHYLIP or PIR.

     -outorder=xxx
          Order of sequences in output. This is used to control
          the order of the sequences in the output alignments. By
          default, it is the same as the input order.  This
          switch can be used to make the order correspond to the
          order in which the sequences were aligned (from the
          guide tree/dendrogram), thus automatically grouping
          closely related sequences. Choices are INPUT (same
          order as input) or ALIGNED (in order of alignment)

     -case
          Case of output sequences; LOWER or UPPER. For GDE
          output only.

     -seqnos=xxx
          Sequence numbers in output; OFF or ON. For Clustal
          output only.

OPTIONS for Fast pairwise alignments
     These similarity scores are calculated from fast,
     approximate, global align- ments, which are controlled by 4
     parameters. Two techniques are used to make these alignments
     very fast: only exactly matching fragments (k-tuples) are
     used, and only the best diagonals (the ones with most k-
     tuple matches) are used.

     -KTUPLE=#
          word size. This is the size of exactly matching
          fragment that is used.  INCREASE for speed (max= 2 for
          proteins; 4 for DNA), DECREASE for sensitivity. For
          longer sequences (e.g. >1000 residues) you may need to
          increase the default.

     -topdiags=#
          number of best diagonals. The number of k-tuple matches
          on each diagonal (in an imaginary dot-matrix plot) is
          calculated. Only the best ones (with most matches) are
          used in the alignment. This parameter specifies how
          many. Decrease for speed; increase for sensitivity.

     -window=#
          window around best diagonals. This is the number of
          diagonals around each of the 'best' diagonals that will
          be used. Decrease for speed; increase for sensitivity.

     -pairgap=#
          gap penalty.  This is a penalty for each gap in the
          fast alignments. It has little affect on the speed or
          sensitivity except for extreme values.

     -score
          PERCENT or ABSOLUTE

OPTIONS for Slow pairwise alignments
     These parameters do not have any affect on the speed of the
     alignments. They are used to give initial alignments which
     are then rescored to give percent identity scores. These
     percent scores are the ones which are displayed on the
     screen. The scores are converted
      to distances for the trees.

     -pwmatrix=xxx
          Protein weight matrix, i.e. the scoring table which
          describes the similarity of each amino acid to the
          other. For DNA, an identity matrix is used. Choices are
          BLOSUM, PAM, GONNET, ID or filename.

     -pwdnamtrix
          DNA weight matrix. Choices are IUB, CLUSTALW, or
          filename.

     -pwgapopen=#
          Gap opening penalty

     -pwgapext=#
          Gap extension penalty, i.e. the penalty for extending a
          gap by one residue.

OPTIONS for Multiple alignements
     These parameters control the final multiple alignment. This
     is the core of the program and the details are complicated.
     Each step in the final multiple alignment consists of
     aligning two alignments or sequences. This is done
     progressively, following the branching order in the GUIDE
     TREE. The basic parameters to control this are two gap
     penalties and the scores for various identical/non-
     indentical residues.

     -newtree=xxx.yyy
          file for new guide tree

     -usetree=xxx.yyy
          file for old guide tree

     -matrix=xxxx
          Protein weight matrix. Chocies are BLOSUM, PAM, GONNET,
          ID or filename. The default is the
               BLOSUM series of matrices by Jorja and Steven
          Henikoff.
               Note, a series is used! The actual matrix that is
          used depends
               on how similar the sequences to be aligned at this
          alignment
               step are. Different matrices work differently at
          each

               evolutionary distance. Further help is offered in
          the weight
               matrix menu.

     -dnamatrix=xxxx
          DNA weight matrix. Choices are IUB, CLUSTALW, or
          filename.

     -gapopen=#
          Gap opening penalty.  Increasing the gap opening
          penalty will
               make gaps less frequent.

     -gapext=#
          Gap extension penalty. Increasing the gap extension
          penalty
               will make gaps shorter.  Terminal gaps are not
          penalised.

     -endgaps
          No end gap separation penalty. This treats end gaps
          just like internal
               gaps for the purposes of avoiding gaps that are
          too close (set
               by 'gapdist'). If you turn this
               off, end gaps will be ignored for this purpose.
          This is useful
               when you wish to align fragments where the end
          gaps are not
               biologically meaningful.

     -gapdist=#
          Gap separation penalty range. This tries to decrease
          the chances
               of gaps being too close to each other. Gaps that
          are less than
               this distance apart are penalised more than other
          gaps. This
               does not prevent close gaps; it makes them less
          frequent,
               promoting a block-like appearance of the
          alignment.

     -nopgap
          Residue-specific gaps off. These are amino acid
          specific
               gap penalties that reduce or increase the gap
          opening penalties
               at each position in the alignment or sequence. See
          the
               documentation for details. As an example,
          positions that are

               rich in glycine are more likely to have an
          adjacent gap than
               positions that are rich in valine.

     -nohgap
          Hydrophilic gaps off. Hydrophilic gap penalties
          (protein sequences) are used to increase
               the chances of a gap within a run (5 or more
          residues) of
               hydrophilic amino acids; these are likely to be
          loop or random
               coil regions where gaps are more common.

     -hgapresidues
          List hydrophilic residues.

     -maxdiv=#
          Identity for delay.

     -type=xxx
          Specify sequence type: PROTEIN or DNA.

     -transweight=#
          Transitions weighting.

OPTIONS for Profile alignments
     By PROFILE ALIGNMENT, we mean alignment using existing
     alignments. Profile alignments allow you to store alignments
     of your favourite sequences and add new sequences to them in
     small bunches at a time. A profile is simply an alignment of
     one or more sequences (e.g. an alignment output file from
     CLUSTAL W). Each input can be a single sequence. One or both
     sets of input sequences may include secondary structure
     assignments or gap penalty masks to guide the alignment.

     The profiles can be in any of the allowed input formats with
     "-" characters used to specify gaps (except for GCG/MSF
     where "." is used).

     You have to specify the 2 profiles by choosing menu items 1
     and 2 and giving 2 file names. Then Menu item 3 will align
     the 2 profiles to each other. Secondary structure masks in
     either profile can be used to guide the alignment.

     Menu item 4 will take the sequences in the second profile
     and align them to the first profile, 1 at a time. This is
     useful to add some new sequences to an existing alignment,
     or to align a set of sequences to a known structure. In this
     case, the second profile need not be pre-aligned.

     The alignment parameters can be set using menu items 5, 6
     and 7.  These are EXACTLY the same parameters as used by the

     general, automatic multiple alignment procedure. The general
     multiple alignment procedure is simply a series of profile
     alignments.  Carrying out a series of profile alignments on
     larger and larger groups of sequences, allows you to
     manually build up a complete alignment, if necessary editing
     intermediate alignments.

     -profile
          Merge two alignments by profile alignment

     -newtree1=xxx.yyy
          file for new guide tree for profile1

     -newtree2=xxx.yyy
          file for new guide tree for profile2

     -usetree1=xxx.yyy
          file for old guide tree for profile1

     -usetree2=xxx.yyy
          file for old guide tree for profile2

OPTIONS for Structure Alignments
     If a solved structure is available, it can be used to guide
     the alignment by raising gap penalties within secondary
     structure elements, so that gaps will preferentially be
     inserted into unstructured surface loop regions.
     Alternatively, a user-specified gap penalty mask can be
     supplied for a similar purpose.

     A gap penalty mask is a series of numbers between 1 and 9,
     one per position in the alignment. Each number specifies how
     much the gap opening penalty is to be raised by at that
     position (raised by multiplying the basic gap opening
     penalty by the number) i.e. a mask figure of 1 at a
     positiion means no change in gap opening penalty; a figure
     of 4 means that the gap opening penalty is four times
     greater at that position, making gaps 4 times harder to
     open. The format for gap penalty masks and secondary
     structure masks is explained in the help under option 0
     (secondary structure options).

     -nosecstr1
          do not use secondary structure/gap penalty mask for
          profile 1

     -nosecstr2
          do not use secondary structure/gap penalty mask for
          profile 2

     -secstrout
          STRUCTURE or MASK or BOTH or NONE  output in alignment

          file

     -helixgap=#
          gap penalty for helix core residues

     -strandgap=#
          gap penalty for strand core residues

     -loopgap=#
          gap penalty for loop regions

     -terminalgap=#
          gap penalty for structure termini

     -helixendin=#
          number of residues inside helix to be treated as
          terminal

     -helixendout=#
          number of residues outside helix to be treated as
          terminal

     -strandendin=#
          number of residues inside strand to be treated as
          terminal

     -strandendout=#
          number of residues outside strand to be treated as
          terminal

OPTIONS for Trees
     -outputtree=xxx
          Choices are NJ or PHYLIP or DIST

     -seed=#
          Seed number for bootstraps

     -kimura
          Use Kimra's correction

     -tossgaps
          Ignore positions with gaps

More documentation
     ClustalW has online help; choose the help option in any of
     the menus. Additional documentation can be found at GCG-
     Lite's web interface to ClustalW:
     http://molbio.info.nih.gov/molbio/gcglite/

SEE ALSO
     njplot, GCG's Pileup, Readseq for format conversion.

CLUSTALW(1L)       UNIX System V (August 200)        CLUSTALW(1L)