Navigate to Tutorial Home. Iterative Training Set Construction. Predicting Genes.

Show all / no details.

Training AUGUSTUS

This manual is intended for those who want to train AUGUSTUS for another species. Please do not rely on this manual and the scripts and programs. Check what they do on your data!

1. COMPILE A SET OF TRAINING AND TEST GENES

You will need a set of genomic sequences with bona fide gene structures (sequence coordinates of starts and ends of exons and genes). In many cases, or as a first step towards modelling complete genes, it is sufficient to have only the coding parts of the gene structure (CDS).

[+] Number and quality of gene structures...

[+] The gene set should be non-redundant...

Each sequence can contain one or more genes, the genes can be on either strand. However, the genes must not overlap, and only one transcript per gene is allowed. Store the sequences together with their annotation in a simple genbank format. For the exact format that the training program can read in look as an example at one of the training genbank files at the augustus web server: http://bioinf.uni-greifswald.de/augustus/datasets/

1.1 Options for compiling a set of gene structures

pre-existing gene structures (e.g. from GenBank)
spliced alignments of ESTs against the assembled genomic sequence (e.g. using PASA)
spliced alignments of de novo assembled transcriptome short reads (RNA-Seq)
spliced alignments of protein sequences of the same or a very closely related species against the assembled genomic sequence, e.g. using Scipio.
This approach is described in section Using Scipio to create a training set.
gene structures from a related species
Iteration of training with predicted genes, starting with an existing parameter set
BRAKER: a new pipeline that combines GeneMark-ET and AUGUSTUS. It uses a genome and RNA-Seq alignments as input. Both programs are automatically trained and genes are predicted genome-wide using the RNA-Seq. This was tested to work very well on Drosophila, C. elegans, Arabidopsis and S. pompe, but it may fail on very complex genomes.
the CEGMA pipeline to identify the structure of core eukaryotic genes

1.2 Split gene structure set into training and test set

Randomly split the set of annotated sequences in a training and a test set.

randomSplit.pl genes.gb 100

This generates a file genes.gb.test with 100 randomly chosen loci and a disjoint file genes.gb.train with the rest of the loci from genes.gb:

grep -c LOCUS genes.gb*
# genes.gb:492
# genes.gb.test:100
# genes.gb.train:392

In order for the test accuracy to be statistically meaningful the test set should also be large enough (100-200 genes). You should split the set of gene structures really randomly! Do not just take the first and the last part of the file as then the test set is unlikely to be representative. The script randomSplit.pl is in the scripts directory.

[+] Additional splice site training set...

In rare cases it may be useful to specify a set of splice site sequences that should be used for training in addition to the complete genes. This would apply, e.g. if the number of intron examples in your training set is small but you have additional intron example, e.g. from spliced alignments of ESTs.
The format is as in the following example.

dss gccgagaactccgctcgttctgtgcgttctcctgtcccaggtagggaagaggggctgccgggcgcgctctgcgccccgtttc
dss cgtgattgtcggggggaaagacatccagggctccttgcaggtaacacatctgtttgagataacttgggttcaaggaggacat
dss agagaatcagagacagcctttcccaagagatgttggcaaggtaagtcagacaaacagcaaatgacaaaaacatgtttttatg
dss cattgtcactgttgtgtcacctgcgctgctggaccgagaggtgagctgaaaagaataccactttctttttcacgagaataga
dss tgacaaaaatgatcactcaccaaaattcaccaagaaagaggtaaacccctgtgccaaacaccaaccaccactgtggtcacag
ass gttagtatgcttctttaattttttttctccctgaaattataggaaccagatgttaaaaaattagaagaccaacttcaaggcg
ass --------------------------ggctttgtctttgcagaatttatagagcggcagcacgcaaagaacaggtattacta
ass gattccttgtgattagcctctcttgctccttttctccaccagcaaagtcgaccaagaaattatcaacattatgcaggatcgg
ass aaccgtagtaaacagcatgaatcgtgttttgtttttgaacagaccactggccttgtgggattggctgtgtgcaatactcctc

dss: donor (=5') splice site. 40 letters + gt + 40 letters
ass: donor (=3') splice site. 40 letters + ag + 40 letters
use '-' for unknown characters

2. CREATE A META PARAMETERS FILE FOR YOUR SPECIES

We call parameters like the size of the window of the splice site models and the order of the Markov model meta parameters, in contrast to parameters like the distribution of splice site patterns, the k-mer probabilities of coding and noncoding regions. There are a few dozen meta parameters but many thousands of parameters. The meta parameters determine how the parameters are calculated.

Create the files for training "bug" from a template.

new_species.pl --species=bug

new_species.pl uses the environment variable AUGUSTUS_CONFIG_PATH to determine the directory in which AUGUSTUS stores the species parameters. You should see a report like this:

creating directory /home/mario/augustus/trunk/config/species/bug/ ...
creating /home/mario/augustus/trunk/config/species/bug/bug_parameters.cfg ...
creating /home/mario/augustus/trunk/config/species/bug/bug_weightmatrix.txt ...
creating /home/mario/augustus/trunk/config/species/bug/bug_metapars.cfg ...
...

The file bug_parameters.cfg contains besides meta-parameters also parameters to augustus and etraining like defaults for output format settings.

[+] The *_parameters.cfg file...

contains output format options like

`protein`	turn on/off the inclusion of predicted peptide sequences in the output
`codingseq`	turn on/off the inclusion of predicted coding sequences in the output

and settings like

`alternatives-from-evidence`	turn on/of prediction of alternative transcripts based on hints
`UTR`	turn on/off the prediction of untranslated regions

All of the parameters set in this file can equally be given on the command line, e.g.

augustus --species=bug --protein=on input.fa

in which case they override the setting in this file. The parameters at the top of the file can be edited as desired or required for the species.

The optional file with the filename given by /IntronModel/splicefile may contain a list of sequence windows of known splice sites as described above.

For more info have a look at the comments in the file and at the README.TXT that is included with AUGUSTUS.

[+] Distinguishing GC content classes...

3. MAKE AN INITIAL TRAINING

This step may be skipped if step 4 below is done. However, in a semi-automatic setting (you type the commands in this document) it is not recommended to skip it. Train augustus for the species bug on the training set of gene structures.

etraining --species=bug genes.gb.train

This creates/updates parameter files for exon, intron and intergenic region in the directory $AUGUSTUS_CONFIG_PATH/species/bug.

ls -ort $AUGUSTUS_CONFIG_PATH/species/bug/

now yields

-rw------- 1 mario    810 Jun 23 16:48 bug_weightmatrix.txt
-rw------- 1 mario   2057 Jun 23 16:48 bug_metapars.cfg
-rw------- 1 mario   1356 Jun 23 16:48 bug_metapars.utr.cfg
-rw-rw-r-- 1 mario   1125 Jun 23 16:48 bug_metapars.cgp.cfg
-rw-rw-r-- 1 mario   7162 Jun 23 16:49 bug_parameters.cfg~
-rw-rw-r-- 1 mario   7163 Jun 23 16:50 bug_parameters.cfg
-rw-rw-r-- 1 mario 350278 Jun 23 16:51 bug_intron_probs.pbl
-rw-rw-r-- 1 mario 256132 Jun 23 16:51 bug_exon_probs.pbl
-rw-rw-r-- 1 mario  32545 Jun 23 16:51 bug_igenic_probs.pbl

where bug_{intron,exon,igenic}.pbl are our newly created parameter files.

Now we make a first try and predict the genes in genes.gb.train ab initio.

augustus --species=bug genes.gb.test | tee firsttest.out # takes ~1m

This predicts the genes in all 100 sequences and will at the end print a report about the prediction accuracy, comparing the strucures in the input file genes.gb.test with the ones predicted. Of course, for the predictions only the sequences are used, not the input gene structures.

Look at the accuracy report at the end of firsttest.out:

grep -A 22 Evaluation firsttest.out

*******      Evaluation of gene prediction     *******

---------------------------------------------\
                 | sensitivity | specificity |
---------------------------------------------|
nucleotide level |       0.873 |       0.626 |
---------------------------------------------/

----------------------------------------------------------------------------------------------------------\
           |  #pred |  #anno |      |    FP = false pos. |    FN = false neg. |             |             |
           | total/ | total/ |   TP |--------------------|--------------------| sensitivity | specificity |
           | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |             |             |
----------------------------------------------------------------------------------------------------------|
           |        |        |      |                253 |                101 |             |             |
exon level |    484 |    332 |  231 | ------------------ | ------------------ |       0.696 |       0.477 |
           |    484 |    332 |      |   35 |    0 |  218 |   36 |    0 |   65 |             |             |
----------------------------------------------------------------------------------------------------------/

----------------------------------------------------------------------------\
transcript | #pred | #anno |   TP |   FP |   FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level |   156 |   100 |   47 |  109 |   53 |        0.47 |       0.301 |
----------------------------------------------------------------------------/

These numbers mean, for example, that

of the 100 genes 47 were predicted exactly
69.6% of the exons were predicted exactly
47.7% of the predicted exons were exactly as in the test set.

[+] Compare with the shipped fly parameters...

The fly parameters that come with augustus yield a somewhat better accuracy:

augustus --species=fly genes.gb.test --UTR=off | grep -A 22 Evaluation # takes ~1m

gives the following accuracy report:

*******      Evaluation of gene prediction     *******

---------------------------------------------\
                 | sensitivity | specificity |
---------------------------------------------|
nucleotide level |       0.956 |       0.633 |
---------------------------------------------/

----------------------------------------------------------------------------------------------------------\
           |  #pred |  #anno |      |    FP = false pos. |    FN = false neg. |             |             |
           | total/ | total/ |   TP |--------------------|--------------------| sensitivity | specificity |
           | unique | unique |      | part | ovlp | wrng | part | ovlp | wrng |             |             |
----------------------------------------------------------------------------------------------------------|
           |        |        |      |                280 |                 55 |             |             |
exon level |    557 |    332 |  277 | ------------------ | ------------------ |       0.834 |       0.497 |
           |    557 |    332 |      |   31 |    2 |  247 |   31 |    0 |   24 |             |             |
----------------------------------------------------------------------------------------------------------/

----------------------------------------------------------------------------\
transcript | #pred | #anno |   TP |   FP |   FN | sensitivity | specificity |
----------------------------------------------------------------------------|
gene level |   158 |   100 |   59 |   99 |   41 |        0.59 |       0.373 |
----------------------------------------------------------------------------/

4. RUN THE SCRIPT `optimize_augustus.pl`

This script optimizes the prediction accuracy by adjusting the meta parameters in the *_parameters.cfg file. The script alternatingly used the programs augustus and etraining. This ususally increases prediction accuracy by a few percent points, but runs for hours or days. It may be skipped and only etraining be run once (step 3. above), which is very quick. augustus and etraining must be in the $PATH.

You need to tell optimize_augustus.pl, which metaparameters it should optimize. Do this by adjusting the file config/species/generic/generic_metapars.cfg. (You may also make a copy of it and then use the command line parameter --metapars=nameofmycopy to the script optimize_augustus.pl.)

Run

optimize_augustus.pl --species=bug genes.gb.train  # takes ~1d

[+] What optimize_augustus.pl does...

After optimize_augustus.pl has finished or (after you have interrupted it) you should (re)train AUGUSTUS with the meta parameters it has set.

etraining --species=bug genes.gb.train

If you have a test set, you can now check the prediction accuracy on this test set by running

augustus --species=bug genes.test.gb

The end of the output will then contain a summary of the accuracy of the prediction. If the gene level sensitivity is below 20% it is likely that the training set is not large enough, that it doesn't have a good quality or that the species is somehow 'special'.
If you succeeded in creating a good AUGUSTUS version for your species I would be very interested in your results. If possible please share your results by giving me the packed config/yourspecies folder.

4. SPECIAL CASE: ORGANISM WITH DIFFERENT GENETIC CODE

AUGUSTUS can be told to use a different translation table, in particular one with a different set of stop codons. This is useful for a small number of species such as Tetrahymena thermophilia, in which some codons translate to a different amino acid than usual. If you train AUGUSTUS for such a species set the variable translation_table in the parameter file of your species. Further, adjust the stop codon probabilities in the same config file. E.g. say

translation_table 6
/Constant/amberprob 0 # Prob(stop codon = tag), if 0 tag is assumed to code for amino acid
/Constant/ochreprob 0 # Prob(stop codon = taa), if 0 taa is assumed to code for amino acid
/Constant/opalprob 1 # Prob(stop codon = tga), if 0 tga is assumed to code for amino acid

in the case of Tetrahymena, where taa and tag are coding for glutamine (Q).

Choose the translation table number accoding to this table. translation_table=1 is the default value and the standard with stop codons taa, tga, tag. If you have a species with the standard genetic code you don't have to do anything. In case your species' code is not covered by this table send us a note with the string of 64 one-letter amino acid codes in the codon order below.

translation	a	a	a	a	a	a	a	a	a	a	a	a	a	a	a	a	c	c	c	c	c	c	c	c	c	c	c	c	c	c	c	c	g	g	g	g	g	g	g	g	g	g	g	g	g	g	g	g	t	t	t	t	t	t	t	t	t	t	t	t	t	t	t	t
table	a	a	a	a	c	c	c	c	g	g	g	g	t	t	t	t	a	a	a	a	c	c	c	c	g	g	g	g	t	t	t	t	a	a	a	a	c	c	c	c	g	g	g	g	t	t	t	t	a	a	a	a	c	c	c	c	g	g	g	g	t	t	t	t
number	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t	a	c	g	t
1	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	*	C	W	C	L	F	L	F
2	K	N	K	N	T	T	T	T	*	S	*	S	M	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	W	C	W	C	L	F	L	F
3	K	N	K	N	T	T	T	T	R	S	R	S	M	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	T	T	T	T	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	W	C	W	C	L	F	L	F
4	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	W	C	W	C	L	F	L	F
5	K	N	K	N	T	T	T	T	S	S	S	S	M	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	W	C	W	C	L	F	L	F
6	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	Q	Y	Q	Y	S	S	S	S	*	C	W	C	L	F	L	F
9	N	N	K	N	T	T	T	T	S	S	S	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	W	C	W	C	L	F	L	F
10	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	C	C	W	C	L	F	L	F
11	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	*	C	W	C	L	F	L	F
12	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	S	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	*	C	W	C	L	F	L	F
13	K	N	K	N	T	T	T	T	G	S	G	S	M	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	W	C	W	C	L	F	L	F
14	N	N	K	N	T	T	T	T	S	S	S	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	Y	Y	*	Y	S	S	S	S	W	C	W	C	L	F	L	F
15	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	Q	Y	S	S	S	S	*	C	W	C	L	F	L	F
16	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	L	Y	S	S	S	S	*	C	W	C	L	F	L	F
21	N	N	K	N	T	T	T	T	S	S	S	S	M	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	W	C	W	C	L	F	L	F
22	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	L	Y	*	S	S	S	*	C	W	C	L	F	L	F
23	K	N	K	N	T	T	T	T	R	S	R	S	I	I	M	I	Q	H	Q	H	P	P	P	P	R	R	R	R	L	L	L	L	E	D	E	D	A	A	A	A	G	G	G	G	V	V	V	V	*	Y	*	Y	S	S	S	S	*	C	W	C	*	F	L	F