RATT Documentation

Documentation

1 Overview
2 Installation at Sanger
3 Installation
4 How to call the program
5 Example Files
6 Output files
7 Post visualization
8 For the biological interpretation
9 Functionality of main.ratt.pl

Overview

RATT is software to transfer annotation from a reference (annotated) genome to an unannotated query genome.

It was first developed to transfer annotations between different genome assembly versions. However, can also transfer annotations between strains and even different species, like Plasmodium chabaudi onto P. berghei or Salmonella enterica onto Salmonella virchow. RATT is able to transfer any entries present on a reference sequence, such as the systematic id or an annotator's notes; such information would be lost in a de novo annotation. Furthermore, RATT checks whether gene models have changed between the two sequences and can correct changed start and stop codons, or frameshifts.

Please visit the http://ratt.sourceforge.net page for examples.

Installation at Sanger

At Sanger, the program is installed so far in ~tdo/Bin/ratt. Just be sure that the set the variable RATT_HOME:

   RATT_HOME=/nfs/users/nfs_t/tdo/Bin/ratt; export RATT_HOME  (for bash)

RATT needs http://mummer.sourceforge.net/ - Mummer tool to generate the sequence comparison. So in the PATH the following files of MuMMer should be contained: nucmer, delta-filter, show-snps and show-coords. The program will not run without those files in the path. (These programs should be in the standard path at Sanger.)

Installation

RATT was tested on Linux/Unix. It should run on OS X 10, but again, third party tools must be installed. All the installation help is uniquely written for Linux/Unix.

1. Install the MUMmer package. Make sure the programs are in your path e.g. PATH=$PATH:/path/to/Mummer/; export PATH. For the visualization it is useful to have NCBI-BLAST installed (to compare genomes in ACT) - download ncbi BLAST (but this is not mandatory for RATT).

2. Please download the RATT tar ball to a specific directory. Unzip it with tar xvzf ratt.v0.95.tgz.

3. Set the variable RATT_HOME to the directory where you unpacked the program. That is, if you downloaded RATT to ~/programs/. RATT will be unpacked into ~/programs/ratt/. Set the variable to (for bash):

    RATT_HOME=~/programs/ratt/; export RATT_HOME.

These lines should be written into the ~/.bashrc (or equivalent system file).

4. As start codons and splice sites might vary between organism, it will be necessary to adapt the $RATT_HOME/RATT.config file to your specific needs. There are example configuration files for bacteria or eukaryotes called RATT.config_bac and RATT.config_euk in the $RATT_HOME directory. If you need to generate your own please do not change the ### tags. Example of config file:

  #START
  ATG 
  #STOP
  TGA
  TAA
  TAG
  #SPLICE
  GT..AG
  #CORRECTSPLICE
  1

5. You are ready to go.

How to call the program

RATT should be easy to call. The most difficult settings to get right are the nucmer parameters for the determining synteny. To aid the user we have predefined several parameter sets which should be suitable for most transfers. However, advanced users can alter the the nucmer parameters if they wish.

You will need embl files of the reference (parent) sequence, and these should be copied to a subdirectory within your working directory e.g. embl. For the query you will need a (multi-) fasta file of each contig/chromosome to be annotated.

Once you have the above files you can use RATT to transfer your annotations. For example, if you wished to transfer annotations between two strains of the same species, you would use:

$RATT_HOME/start.ratt.sh embl query.fasta Transfer1 Strain

More specifically, you can start RATT using our example dataset with:

start.ratt.sh ./embl Tb_F11.fasta F11 Strain

Here is the explanation of the paramters:

  $RATT_HOME/start.ratt.sh <Directory with embl-files> <Query-fasta sequence> <Resultname> <Transfer type> <optional: reference (multi) Fasta>


  
  Directory name with 
  embl-annotation files  - This directory contains all the embl files that should be transfered to the query.
  Query.fasta            - A multifasta file to, which the annotation will be mapped.
  ResultName             - The prefix you wish to give to each result file.
  Transfer type          - Following parameters can be used (see below for the different used sets)
       (i)   Assembly:             Transfer between different assemblies. 
       (ii)  Assembly.Repetitive:  As before, but the genome is extremely repetitive. 
                   This should be run, only if the parameter Assembly doesn't return good results (misses too many annotation tags).  
       (iii) Strain:              Transfer between strains. Similarity is between 95-99%.
       (iv)  Strain.Repetitive:   As before, but the genome is extremely repetitive. 
                   This should be run, only if the parameter Strain doesn't return good results (misses too many annotation tags).
       (v)   Species:              Transfer between species. Similarity is between 50-94%.
       (vi)  Species.Repetitive:   As before, but the genome is extremely repetitive. 
                   This should be run, only if the parameter Species doesn't return good results (misses too many annotation tags).
       (vii) Multiple:             When many annotated strains are used as a reference, and you assume the newly sequenced genome has many insertions
                   compared to the strains in the query (reference?). This parameter will use the best regions of each reference strain to transfer tags.    
       (viii)Free:                 The user sets all parameter individually.


  reference fasta        - Name of multi-fasta. VERY I M P O R T A N T The name of each sequence in the fasta description, 
                   MUST be the same name as its corresponding embl file. So if your embl file is call Tuberculosis.embl, in your reference.fasta file, 
                   the description has to be 
                           >Tuberculsosis
                           ATTGCGTACG
                           ...

Here is the explanation of the parameter used for the synteny with MUMer:

**Parameter set for RATT**
parameter name	word size	identity cutoff	cluster size	max extend cluster	anchor choice	rearrange	example use
Assembly	25	99	400	1000		-r	Plasmodium falciparum onto itself
Assembly.Repetitive	25	99	400	1000	--maxmatch	-r	Plasmodium berghei onto itself
Strain	25	85	300	500		-r	Mycobacterium tuberculosis H37Rv onto M.tuberculosis F11
Strain.Repetitive	25	85	300	5000	--maxmatch	-r
Species	10	40	400	500		-r	Salmonella thypirium onto S. virkow
Species.Repetitive	10	40	400	500	--maxmatch	-r	Plasmodium chabaudi onto P. berghei
Multiple	25	98	400	1000	--maxmatch	-q	Different Salmonella onto S. virkow
Free*	RATT_l	RATT_ind	RATT_c	RATT_g	RATT_anchor	RATT_rearrange

(*) - must be set as bash variables. Alternatively the user might just update the start.ratt.sh file.

Example Files

We included an example, see http://ratt.sourceforge.net/example.html. It describes the transfer Mycobacterium tuberculosis H37Rv onto M.tuberculosis F11.

Output files

There are several types of output file: Statistics that report differences, files that refer to the query and files that refer to the reference. The files start with the resultName prefix specified by the user when starting RATT. Report files end with .csv and can be imported into spreadsheet programs. These files ends with gff or embl, and can be loaded into Artemis or ACT, see below. All files that have the name of a replicon of the reference, are relative to the reference. Those files that contain the name of the query replicons, are relative to the query sequence.

Reports:
The first report is given when the program is running. It tells the user how many regions of the reference are syntenic with the query and vice versa. It also reports, how many tags are transferred and how many are not. Tags include features like ncRNA, UTR, gap-tags, repetitive regions or CDS.

The file ResultName-prefix.replicon.report.csv - Reports how many gene model were wrong after the transfer, and how they could be corrected.

Files for the reference:

ResultName-prefix.replicon.NOTTransfered.embl - These are annotations that couldn't be transfered. This can include whole genes, or just exons.
Reference/ResultName-prefix.replicon.Mutations.gff - This files contains all the difference of the query compared to the reference. Also it shows the regions that are not syntenic between both genomes. This can be due to insertions/deletions, low similarity, or 100% similar repeats. Important the annotation of those regions cannot be transferred!

Files for the query:

ResultName-prefix.replicon.embl - These are the uncorrected transfered annotations from the reference onto the query.
ResultName-prefix.replicon.Final.embl - These are the corrected annotations for the query.
ResultName-prefix.replicon.report.gff - An important file, as it shows, where RATT has corrected CDS models, or where errors remain. This includes corrections/errors in start/stop codon, splice sites, frameshifts and joined exons.
Query/ResultName-prefix.replicon.Mutations.gff - This files contains all the differences between the reference and query. In addition, it shows regions that are not syntenic between both genomes. This can be due to insertions/deletions, low similarity, or 100% similar repeats. Important the annotation of these regions will not be transferred! These regions in the query the annotation must be determined by other tools.

Post visualization

The best way to visualize RATT results is to use Artemis and ACT. http://ratt.sourceforge.net/example.html - gives examples using these tools but we include a brief tutorial here as well.

First, if your target genome has more than one replicon, the Query.fasta must be split into single contigs:

   mkdir Seq;
   cd Seq;
   $RATT_HOME/main.ratt.pl Split F11.fasta
   cd ..

Assuming your ResultName was F11 and the query is called F11.fasta.

To view the annotation:

   art Seq/F11.fasta + F11.embl + F11.final.embl + Query/F11.Mutations.gff + F11.Report.gff

To see a comparative view with the transferred and untransferred gene models, you must first generate a comparison file (-m8) using BLAST. To perform this with the example set, make sure blastall is installed.

  formatdb -p F -i embl/Tb_H37Rv.fasta
  blastall -p blastn -m 8 -e 1e-10 -d embl/Tb_H37Rv -i Seq/F11.fasta -o comp.tb.blast

Now it can be opened in act:

   act embl/Tb_H37Rv.embl comp.Tb.blast F11.fasta

Then open the annotation files, by clicking on File -> F11.fasta -> open entries and select the files F11.final.embl and F11.report.gff.

One can see that the first gene models have transfered perfectly.

To see regions where the annotation couldn't be transferred, load the file F11.H37Rv.NOTtransfer.embl onto the Tb_H37Rv.embl file (Menu: File -> Tb_H37Rv.embl -> New Entry). For comparative purposes load the entries F11.orignal.embl and F11.embl onto the F11.fasta file (Menu: File -> Tb_H37Rv.embl -> New Entry). Next right mouse click over the F11 genome sequence, a pop-up will show: Select "one line per entry". Please repeat this for the H37Rv genome.

For the biological interpretation

Here we describe how we would propose to analyse the output of RATT. Generally, there is not much interest in genes that are similar between two genomes. Deleted genes, new genes, genes that are or were pseudo genes, or changes in genes are usually more informative. We don't want to postulate that a few SNP's could change the transcription of a whole promoter, as this kind of analysis is one step after the annotation. Having said that, with RATT SNP differences between genomes are shown, so this kind of analysis could be run.

First the results should be loaded into artemis:

   art Seq/F11.fasta + F11.final.embl + Query/F11.Mutations.gff + F11.Report.gff

Seq/F11.fasta is the sequence file.
F11.final.embl contains the final annotation.
Query/F11.Mutations.gff contains the differences between the two genomes (SNP's/indels) as well as the regions of the genomes that are not in synteny.
F11.Report.gff reports the changes made by RATT - therefore, it also indicates where genes are different.

Obviously, these files can also be loaded into a act view, as described in Post visualization. Here we describe the use in Artemis, which is nearly identical to ACT.

First, one should have a look at the regions that have no synteny with the reference:

  Menu: Select -> Feature Selector: As key, replace "CDS" with "Synteny". 
  Check the Key box and uncheck the Qualifier box. 
  Then press view.

A new window will open. Browsing through this window, each line records a region with no synteny If the region is small, less than 200 base pairs, it is probably from a deletion. RATT should be able to fix the gene models where these deletions occur, but the resulting genes are likely to be quite different. If the region is bigger, it might be a gap or a real insertion in the query. Therefore there might be genes that:

  (i) Have lower similarity than specified in the comparison
  (ii) Are deleted in the reference
  (iii) Are a possible horizontal transfer

Next we propose to look for changes in the genes. First just tick the entry F11.Report.gff in the Artemis window. (Disable the entries F11.final.embl and F11.Mutations.gff. The lines in Artemis you see are: Error, Frameshift, CorrectStart, CorrectStop. By systematically going through this list, and checking the new annotation (enable again the entry F11.final.embl). You can find:

  Extended genes 
  Shorter genes - important domain deleted?
  Genes that are now pseudo genes
  Genes that were pseudo genes

This is very useful for getting a feeling for what kind of genes have changed. A biologist working with the species, will can easily determine whether important genes have changed.

The last step is to open the not NOTtransfered genes. These are the genes that couldn't be transfered due to deletions or too low similarity. The file can be seen it directly using ACT, or in Artemis:

  art Tb_H37Rv.embl + F11.H37Rv.NOTTransfered.embl

Just unselect the Tb_H37Rv.embl and you will see the non mapped annotation features.

For more information about Artemis and Act, please find user manuals here: http://www.sanger.ac.uk/Software/Artemis/manual/ and http://www.sanger.ac.uk/Software/ACT/v7/manual/.

Functionality of main.ratt.pl

The main program is main.ratt.pl. Normally a user won't need to call this program directly. Never-the-less we describe here its different functions and how to call it:

$RATT_HOME/main.ratt.pl Transfer <embl Directory> <mummer SNP file> <mummer coord file> <ResultName>

This functionality uses the mummer output to map the annotation from embl files, which are in the <embl Directory>, to the query. It generates all the new annotation files (ResultName.replicon.embl), as well as files describing which annotations remain untransferred (Replicon_reference.NOTtransfered.embl).

$RATT_HOME/main.ratt.pl Correct <EMBL file> <fasta file> <ResultName>

Corrects a given annotation, as described previously. The corrections are reported and the new file is saved as <ResultName>.embl.

$RATT_HOME/main.ratt.pl Check <EMBL file> <fasta file> <ResultName>

Similar to the correct option, but it will only report errors in an EMBL file.

$RATT_HOME/main.ratt.pl EMBLFormatCheck <EMBL file> <ResultName postfix>

Some EMBL files have feature positions spanning several lines, this function consolidates these features so they appear on one line. The result name is <EMBL File>.<ResultName postfix>.

$RATT_HOME/main.ratt.pl Mutate <(multi-)fasta-file>

Every 250 base pairs a base is changed (mutated). The result is saved as <fastafile>.mutated. This is necessary to recalibrate RATT for similar genomes.

$RATT_HOME/main.ratt.pl Split <multifasta-file>

Splits a given multifasta file into individual files containing one sequence. This is necessary as visualization tools (e.g. Artemis) prefer single fasta files.

$RATT_HOME/main.ratt.pl Difference <mummer SNP file> <mummer coord file> <ResultName>

Generates files that report the SNP, indels and regions not shared by the reference and query. It also prints a statistic reporting coverage for each replicon.

$RATT_HOME/main.ratt.pl Embl2Fasta <EMBL dir> <fasta file>

Extracts the sequence from embl files in the <EMBL directory> and saves it as a <fasta file>.

main page