CLUSTALW(1L) UNIX System V (August 200) CLUSTALW(1L)
NAME
clustal, clustalw, clustalx, clustalw_mp, clustalx_mp - a
general purpose multiple alignment program for DNA or
proteins
SYNOPSIS
clustalw [-options] [-help]
DESCRIPTION
Clustal W is a general purpose multiple alignment program
for DNA or proteins. It is designed to be run interactively.
It is also possible to assign options on the command line
The following implementations of Clustal W are available:
clustalw
provides a command-line or menu driven interface to the
Clustal W multiple alignment program.
clustalX
provides a new window-based user interface to the
Clustal W multiple alignment program.
clustalw_mp clustalx_mp
are optimized and parallelized versions of the programs
which provide a significant speed improvement when run
on SGI R10000 and R12000 systems. The environment
variable OMP_NUM_THREADS should be used to designate
the number of processors to run on; the default value
is 2. More info about this implementation can be found
at:
http://www.sgi.com/solutions/sciences/chembio/resources/clustalw/
HT_Clustal
is a wrapper program that starts multiple clustalw jobs
across Origin server. (see separate ht_clustal man
page)
USING CLUSTALW
Type 'clustalw' at the helix prompt. For a list of options,
type 'clustalw -options'. For command-line help, type
'clustalw -help'.
To do a multiple alignment on a set of sequences, use item 1
from the menu to input them, then item 2 to do the multiple
alignment.
Profile alignments (menu item 3) are used to align 2
alignments. Use this to add a new sequence to an old
alignment, or to use secondary structure to guide the
alignment process. Gaps in the old alignments are indicated
using the "-" character. Profiles can be input in any of the
allowed formats, just use "-" (or "." for MSF) for each gap
position.
Phylogenetic trees (menu item 4) can be calculated from old
alignments (read in with "-" characters to indicate gaps) or
after a multiple alignment while the alignment is still in
memory.
FILE FORMAT
All sequences must be in one file, one after another. Seven
formats are automatically recognized: NBRF/PIR,
EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF
(Pileup), GCG9/RSF and GDE Flat file. All non-alphabetic
characters (spaces, digits, punctuation marks) are ignored
except "-" which is used to indicate a GAP ("." in GCG/MSF).
The program tries to automatically recognise the different
file formats used and to guess whether the sequences are
amino acid or nucleotide. This is not always foolproof. If
85% or more of the characters in the sequence are from
A,C,G,T,U or N, the sequence will be assumed to be
nucleotide. This works in 97.3% of cases but watch out!
Five output formats are offered. You can choose more than
one (or all 5 if you wish).
CLUSTAL format output
is a self explanatory alignment format. It shows the
sequences aligned in blocks. It can be read in again at
a later date to (for example) calculate a phylogenetic
tree or add a new sequence with a profile alignment.
GCG output
can be used by any of the GCG programs that can work on
multiple alignments (e.g. PRETTY, PROFILEMAKE,
PLOTALIGN). It is the same as the GCG .msf format files
(multiple sequence file); new in version 7 of GCG.
PHYLIP format output
can be used for input to the PHYLIP package of Joe
Felsenstein. This is an extremely widely used package
for doing every imaginable form of phylogenetic
analysis (MUCH more than the the modest intro- duction
offered by this program).
NBRF/PIR:
this is the same as the standard PIR format with ONE
ADDITION. Gap characters "-" are used to indicate the
positions of gaps in the multiple alignment. These
files can be re-used as input in any part of clustal
that allows sequences (or alignments or profiles) to be
read in.
GDE: this format is used by the GDE package of Steven Smith.
OPTIONS
The clustalw program has the following options:
-help
prints out command line options
-check
same as -help
-options
lists the available command-line parameters
-INFILE=xxx.yyy
input sequence file
-PROFILE=xxx.yyy
file containing profiles from old alignment
-PROFILE2=xxx.yyy
file containing profiles from old alignment
-align
does a full multiple alignment
-tree
calculates an NJ tree
-bootstrap=n
boostraps an NJ tree (n=number of bootstraps,
default=1000)
-convert
outputs the input sequences in a different file format
-interactive
reads command line, then enters normal interactive mode
-quicktree
uses FAST algorithm for the alignment guide tree
-negative
protein alignment with negative values in matrix
-outfile=xxx.yyy
output sequence alignment file name
-output=xxx
output has sequences in xxx format. Choices are GCG,
GDE, PHYLIP or PIR.
-outorder=xxx
Order of sequences in output. This is used to control
the order of the sequences in the output alignments. By
default, it is the same as the input order. This
switch can be used to make the order correspond to the
order in which the sequences were aligned (from the
guide tree/dendrogram), thus automatically grouping
closely related sequences. Choices are INPUT (same
order as input) or ALIGNED (in order of alignment)
-case
Case of output sequences; LOWER or UPPER. For GDE
output only.
-seqnos=xxx
Sequence numbers in output; OFF or ON. For Clustal
output only.
OPTIONS for Fast pairwise alignments
These similarity scores are calculated from fast,
approximate, global align- ments, which are controlled by 4
parameters. Two techniques are used to make these alignments
very fast: only exactly matching fragments (k-tuples) are
used, and only the best diagonals (the ones with most k-
tuple matches) are used.
-KTUPLE=#
word size. This is the size of exactly matching
fragment that is used. INCREASE for speed (max= 2 for
proteins; 4 for DNA), DECREASE for sensitivity. For
longer sequences (e.g. >1000 residues) you may need to
increase the default.
-topdiags=#
number of best diagonals. The number of k-tuple matches
on each diagonal (in an imaginary dot-matrix plot) is
calculated. Only the best ones (with most matches) are
used in the alignment. This parameter specifies how
many. Decrease for speed; increase for sensitivity.
-window=#
window around best diagonals. This is the number of
diagonals around each of the 'best' diagonals that will
be used. Decrease for speed; increase for sensitivity.
-pairgap=#
gap penalty. This is a penalty for each gap in the
fast alignments. It has little affect on the speed or
sensitivity except for extreme values.
-score
PERCENT or ABSOLUTE
OPTIONS for Slow pairwise alignments
These parameters do not have any affect on the speed of the
alignments. They are used to give initial alignments which
are then rescored to give percent identity scores. These
percent scores are the ones which are displayed on the
screen. The scores are converted
to distances for the trees.
-pwmatrix=xxx
Protein weight matrix, i.e. the scoring table which
describes the similarity of each amino acid to the
other. For DNA, an identity matrix is used. Choices are
BLOSUM, PAM, GONNET, ID or filename.
-pwdnamtrix
DNA weight matrix. Choices are IUB, CLUSTALW, or
filename.
-pwgapopen=#
Gap opening penalty
-pwgapext=#
Gap extension penalty, i.e. the penalty for extending a
gap by one residue.
OPTIONS for Multiple alignements
These parameters control the final multiple alignment. This
is the core of the program and the details are complicated.
Each step in the final multiple alignment consists of
aligning two alignments or sequences. This is done
progressively, following the branching order in the GUIDE
TREE. The basic parameters to control this are two gap
penalties and the scores for various identical/non-
indentical residues.
-newtree=xxx.yyy
file for new guide tree
-usetree=xxx.yyy
file for old guide tree
-matrix=xxxx
Protein weight matrix. Chocies are BLOSUM, PAM, GONNET,
ID or filename. The default is the
BLOSUM series of matrices by Jorja and Steven
Henikoff.
Note, a series is used! The actual matrix that is
used depends
on how similar the sequences to be aligned at this
alignment
step are. Different matrices work differently at
each
evolutionary distance. Further help is offered in
the weight
matrix menu.
-dnamatrix=xxxx
DNA weight matrix. Choices are IUB, CLUSTALW, or
filename.
-gapopen=#
Gap opening penalty. Increasing the gap opening
penalty will
make gaps less frequent.
-gapext=#
Gap extension penalty. Increasing the gap extension
penalty
will make gaps shorter. Terminal gaps are not
penalised.
-endgaps
No end gap separation penalty. This treats end gaps
just like internal
gaps for the purposes of avoiding gaps that are
too close (set
by 'gapdist'). If you turn this
off, end gaps will be ignored for this purpose.
This is useful
when you wish to align fragments where the end
gaps are not
biologically meaningful.
-gapdist=#
Gap separation penalty range. This tries to decrease
the chances
of gaps being too close to each other. Gaps that
are less than
this distance apart are penalised more than other
gaps. This
does not prevent close gaps; it makes them less
frequent,
promoting a block-like appearance of the
alignment.
-nopgap
Residue-specific gaps off. These are amino acid
specific
gap penalties that reduce or increase the gap
opening penalties
at each position in the alignment or sequence. See
the
documentation for details. As an example,
positions that are
rich in glycine are more likely to have an
adjacent gap than
positions that are rich in valine.
-nohgap
Hydrophilic gaps off. Hydrophilic gap penalties
(protein sequences) are used to increase
the chances of a gap within a run (5 or more
residues) of
hydrophilic amino acids; these are likely to be
loop or random
coil regions where gaps are more common.
-hgapresidues
List hydrophilic residues.
-maxdiv=#
Identity for delay.
-type=xxx
Specify sequence type: PROTEIN or DNA.
-transweight=#
Transitions weighting.
OPTIONS for Profile alignments
By PROFILE ALIGNMENT, we mean alignment using existing
alignments. Profile alignments allow you to store alignments
of your favourite sequences and add new sequences to them in
small bunches at a time. A profile is simply an alignment of
one or more sequences (e.g. an alignment output file from
CLUSTAL W). Each input can be a single sequence. One or both
sets of input sequences may include secondary structure
assignments or gap penalty masks to guide the alignment.
The profiles can be in any of the allowed input formats with
"-" characters used to specify gaps (except for GCG/MSF
where "." is used).
You have to specify the 2 profiles by choosing menu items 1
and 2 and giving 2 file names. Then Menu item 3 will align
the 2 profiles to each other. Secondary structure masks in
either profile can be used to guide the alignment.
Menu item 4 will take the sequences in the second profile
and align them to the first profile, 1 at a time. This is
useful to add some new sequences to an existing alignment,
or to align a set of sequences to a known structure. In this
case, the second profile need not be pre-aligned.
The alignment parameters can be set using menu items 5, 6
and 7. These are EXACTLY the same parameters as used by the
general, automatic multiple alignment procedure. The general
multiple alignment procedure is simply a series of profile
alignments. Carrying out a series of profile alignments on
larger and larger groups of sequences, allows you to
manually build up a complete alignment, if necessary editing
intermediate alignments.
-profile
Merge two alignments by profile alignment
-newtree1=xxx.yyy
file for new guide tree for profile1
-newtree2=xxx.yyy
file for new guide tree for profile2
-usetree1=xxx.yyy
file for old guide tree for profile1
-usetree2=xxx.yyy
file for old guide tree for profile2
OPTIONS for Structure Alignments
If a solved structure is available, it can be used to guide
the alignment by raising gap penalties within secondary
structure elements, so that gaps will preferentially be
inserted into unstructured surface loop regions.
Alternatively, a user-specified gap penalty mask can be
supplied for a similar purpose.
A gap penalty mask is a series of numbers between 1 and 9,
one per position in the alignment. Each number specifies how
much the gap opening penalty is to be raised by at that
position (raised by multiplying the basic gap opening
penalty by the number) i.e. a mask figure of 1 at a
positiion means no change in gap opening penalty; a figure
of 4 means that the gap opening penalty is four times
greater at that position, making gaps 4 times harder to
open. The format for gap penalty masks and secondary
structure masks is explained in the help under option 0
(secondary structure options).
-nosecstr1
do not use secondary structure/gap penalty mask for
profile 1
-nosecstr2
do not use secondary structure/gap penalty mask for
profile 2
-secstrout
STRUCTURE or MASK or BOTH or NONE output in alignment
file
-helixgap=#
gap penalty for helix core residues
-strandgap=#
gap penalty for strand core residues
-loopgap=#
gap penalty for loop regions
-terminalgap=#
gap penalty for structure termini
-helixendin=#
number of residues inside helix to be treated as
terminal
-helixendout=#
number of residues outside helix to be treated as
terminal
-strandendin=#
number of residues inside strand to be treated as
terminal
-strandendout=#
number of residues outside strand to be treated as
terminal
OPTIONS for Trees
-outputtree=xxx
Choices are NJ or PHYLIP or DIST
-seed=#
Seed number for bootstraps
-kimura
Use Kimra's correction
-tossgaps
Ignore positions with gaps
More documentation
ClustalW has online help; choose the help option in any of
the menus. Additional documentation can be found at GCG-
Lite's web interface to ClustalW:
http://molbio.info.nih.gov/molbio/gcglite/
SEE ALSO
njplot, GCG's Pileup, Readseq for format conversion.
CLUSTALW(1L) UNIX System V (August 200) CLUSTALW(1L)