Iterative Training Set Construction

This page describes a method to construct a training set of gene structure using AUGUSTUS itself and RNA-Seq data. The training genes are selected from among AUGUSTUS genes predicted with previously existing parameters.

[+] Some comments about when this option may be appropriate...

1. Align RNA-Seq to genome

We will here use the STAR spliced aligner as it is fast. We first make a genome index, in our case just for chr2L, then run STAR using the index. Run STAR:

mkdir gindex
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir gindex --genomeFastaFiles chr2L.sm.fa # ~1m
STAR --runThreadN 4 --genomeDir gindex --readFilesIn rnaseq1.fq rnaseq2.fq --alignIntronMax 100000 \
  --outSAMtype BAM  SortedByCoordinate --outWigType wiggle --outWigStrand Unstranded # ~2m

These commands will create a file Aligned.sortedByCoord.out.bam with the alignments and a file Signal.Unique.str1.out.wig with coverage per base information, among others. The files will be used below for display and to generate hints for AUGUSTUS.

2. Find an appropriate existing AUGUSTUS parameter set

[+] Cross species-training,...

Run AUGUSTUS with a few existing parameter sets:

for s in nasonia zebrafish tomato; do
  augustus --species=$s chr2L.sm.fa --softmasking=on --predictionEnd=1000000 > aug.$s.1-1M.gff &
done

Above command will run AUGUSTUS in the background with three different parameter sets on the first megabase of chr2L of Drosophila melanogaster. The species selection is just an example. For Drosophila and many other insects there are already parameter sets (fly), which we chose to ignore for the purpose of the tutorial. The options --predictionEnd and --predictionStart can be used to cut out a window for prediction. We do this here to save time and as we will only visually inspect a small region. The files aug.nasonia.1-1M.gff, aug.tomato.1-1M.gff, aug.zebrafish.1-1M.gff now contain the coordinates and protein sequences of predicted genes.

...
chr2L   AUGUSTUS    gene        6693    9276    0.51    +    .    g1
chr2L   AUGUSTUS    transcript  6693    9276    0.51    +    .    g1.t1
chr2L   AUGUSTUS    start_codon 6693    6695    .       +    0    transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS    intron      6809    7573    0.94    +    .    transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS    intron      8117    8192    1       +    .    transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS    intron      8590    8667    1       +    .    transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS    CDS         6693    6808    0.57    +    0    transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS    CDS         7574    8116    0.96    +    1    transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS    CDS         8193    8589    1       +    1    transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS    CDS         8668    9276    1       +    0    transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS    stop_codon  9274    9276    .       +    0    transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MMKRDRFYWDKNDRSRERSGLNFTLTKMLAIQNGGGMKSNREENPRLNKYRRVDNSYPRFITNEAAEDLIYKKSMGER
...

3. Visualize alignments and gene predictions in a browser

We will here use the UCSC genome browser, which conveniently already exists for the genome assembly that we use (dm6). There will usually no browser for a new genome, but nevertheless the UCSC genome browser may be used to display the genome and other tracks using Track Hubs. Make "big" indexed files:

echo "chr2L   23513712" > chrom.sizes # size of chromosomes
mv Signal.Unique.str1.out.wig rnaseq.wig    # rename for convenience
mv Aligned.sortedByCoord.out.bam rnaseq.bam 
wigToBigWig rnaseq.wig chrom.sizes rnaseq.bw
bamtools index -in rnaseq.bam # index required by UCSC browser
scp rnaseq.bw rnaseq.bam rnaseq.bam.bai mario@hgwdev:~/public_html/tutorial2015/results/ # copy to web space

The data is now web-accessible through the URL http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/. The UCSC browser will load the parts requested through browsing only. Create a custom track file:

echo "browser hide all" > customtrack
echo "track name=\"STAR RNA-Seq alignments\" type=bam visibility=4 bigDataUrl=http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/rnaseq.bam" >> customtrack
echo "track name=\"STAR coverage\" type=bigWig visibility=2 bigDataUrl=http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/rnaseq.bw" >> customtrack

for s in nasonia zebrafish tomato; do
   echo "track name=\"AUGUSTUS $s abinitio\"  db=dm6 visibility=3" >> customtrack
   grep -P "AUGUSTUS\t(CDS|exon)\t" aug.$s.1-1M.gff >> customtrack # use only the relevant coordinate lines
done

The file customtrack will now look like this (grep -B 1 -A 2 track customtrack):

browser hide all
track name="STAR RNA-Seq alignments" type=bam visibility=4 bigDataUrl=http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/rnaseq.bam
track name="STAR coverage" type=bigWig visibility=2 bigDataUrl=http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/rnaseq.bw
track name="AUGUSTUS nasonia abinitio"  db=dm6 visibility=3
chr2L   AUGUSTUS        CDS     6693    6808    0.57    +       0       transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS        CDS     7574    8116    0.96    +       1       transcript_id "g1.t1"; gene_id "g1";
--
chr2L   AUGUSTUS        CDS     995604  995891  0.59    +       0       transcript_id "g159.t1"; gene_id "g159";
track name="AUGUSTUS zebrafish abinitio"  db=dm6 visibility=3
chr2L   AUGUSTUS        CDS     7574    8116    0.98    +       1       transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS        CDS     8193    8589    0.85    +       1       transcript_id "g1.t1"; gene_id "g1";
--
chr2L   AUGUSTUS        CDS     957279  957449  1       +       0       transcript_id "g135.t1"; gene_id "g135";
track name="AUGUSTUS tomato abinitio"  db=dm6 visibility=3
chr2L   AUGUSTUS        exon    6486    6808    .       +       .       transcript_id "g1.t1"; gene_id "g1";
chr2L   AUGUSTUS        CDS     6591    6808    0.76    +       0       transcript_id "g1.t1"; gene_id "g1";

This is a format that the UCSC Genome Browser understands. The first line tell it to hide the existing tracks (no cheating). The first two track lines refer to the RNA-Seq tracks and specify the URL where the data lies. The AUGUSTUS tracks are included in the file as they are small enough.

open the browser for Drosophila melanogaster,
click on "add custom tracks" or "manage custom tracks",
upload the file customtrack and
click on the link in the column "Pos" in the table of custom tracks.

The lazy ones can look at the results by clicking this link to a previously prepared custom track.

The UCSC Genome Browser group has an excellent help page for setting up UCSC custom tracks.

Browse the predictions and alignments and pick a parameter set for which the genes match the evidence best. Let us assume here that the nasonia parameter set is best.

4. Predict the genes again using the best parameters and RNA-Seq evidence

When predicting genes, AUGUSTUS can incorporate evidence on the location and structure of genes in the form of hints. Hints are extrinsic evidence a particular GFF format and will change the likelihood of gene structures candidates. AUGUSTUS will tend to predict gene structures that are in agreement with the hints. For more information see this tutorial and our Wiki page

Create hints from RNA-Seq:

bam2hints --intronsonly --in=rnaseq.bam --out=hints.intron.gff
cat rnaseq.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep \
  --radius=4.5 --pri=4 --strand="." > hints.ep.gff
cat hints.intron.gff hints.ep.gff > hints.gff

Above commands have the effect that spliced reads are used as evidence for introns, and coverage as evidence for exons. Both types of evidence are joined in one file hints.gff.

Predict genes using hints:

for i in {0..3}; do
    augustus --species=nasonia chr2L.sm.fa --softmasking=on --predictionStart=$((i*2000000)) --predictionEnd=$(((i+1)*2000000+50000)) \
    --hintsfile=hints.gff --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg  > aug.nasonia.hints.$i.gff &
done
# wait ~6m until done
cat aug.nasonia.hints.{0..3}.gff | join_aug_pred.pl > aug.nasonia.hints.gff

For quick results during the tutorial we restrict the prediction to the first 8 Mb of chr2L instead of predicting genome-wide (4x2Mb each + 50Kb overlap). The overlap would not strictly be necessary here, but it is common practice for genome-wide predictions in parallel. This way genes that are broken by the region boundary can often be predicted completely in the neighboring region if the overlap is larger than most genes. The script join_aug_pred.pl 'repairs' the break and also renumbers genes sequentially so that their names are unique again.

The files aug.nasonia.hints.gff also contain a summary for each transcrip how well it matches the evidence:

...
chr2L   AUGUSTUS        transcript      11215   17136   1       -       .       g2.t1
...
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 100
# CDS exons: 8/8
#      W:   8 
# CDS introns: 7/7
#      E:   7

Add the browser track of genes with evidence:

echo "track name=\"AUGUSTUS nasonia hints\"  db=dm6 visibility=3" >> customtrack
grep -P "AUGUSTUS\t(CDS|exon)\t" aug.nasonia.hints.gff  >> customtrack
gzip -c customtrack > customtrack.whints.gz
scp customtrack.whints.gz mario@hgwdev:~/public_html/tutorial2015/results/ # copy to web space

This will add the newly predicted track after the upload to the UCSC genome browser (replace the existing custom tracks). Click this link to see all the tracks including the new predictions with hints: customtrack.whints.gz. The predictions aug.nasonia.hints.gff should be our best shot so far for the genome annotation. However, they still include genes unsupported by RNA-Seq that are likely to have a larger error rate. We follow the strategy to select a subset of gene structures with higher reliability. This selection can include

a homology search of predicted proteins (has hit, coverage query, coverage target)
RNA-Seq support
manual selection.

Be aware: We are likely to introduce a bias when we restrict ourselves to this set, e.g. to the highly expressed genes. Also, we are introducing a bias towards genes that fit the previous model well. I usually condone these biases.
In this tutorial we apply a very simple filter: Select all transcripts that are fully supported by RNA-Seq, i.e. every intron has support from an exactly matching RNA-Seq gap and every exon has some coverage. Select subset of well-supported transcripts:

cat aug.nasonia.hints.gff | perl -ne 'if (/\ttranscript\t.*\t(\S+)/){$tx=$1;} if (/transcript supported.*100/) {print "$tx\n";}' | tee supported.lst | wc -l

Above one line script prints the transcript id of all 492 transcripts with 100 percent suppport (they may still be too short or otherwise wrong).

5. Create a training set

The training program etraining of the AUGUSTUS package requires input in the form of Genbank format, that includes both the genomic sequences and the sequence coordinates of the coding regions CDS and optionally of the UTR. In many cases the coordinates of the gene structures are available in GFF format or GTF format. The tool gff2gbSmallDNA.pl that comes with AUGUSTUS can be used to convert files in some GFF formats or GTF to Genbank format:

gff2gbSmallDNA.pl --good=supported.lst aug.nasonia.hints.gff chr2L.sm.fa 5000 genes.gb

This command takes each gene in aug.nasonia.hints.gff together with up to 5000 bp flanking intergenic region upstream and downstream of the gene and creates a LOCUS in Genbank format.

[+] Input format of gff2gbSmallDNA.pl...

The last column (column 9) of the GFF input file is used to group exons of the same transcript together. For an example see the scipio.gff from above. In GTF format this is based on whether they agree on the transcript_id field. If the last column contains the string Parent= (as in GFF3) then this is used for grouping. If no grouping from the standard formats is detected then the whole 9th column is used for grouping exons to genes. It is possible to convert genes just containing the coding regions (CDS in the third column) but also to convert genes including UTR annotation (keyword UTR or exon) in the third column. It is also possible to restrict output using lists of gene names to include or exclude:

Usage: gff2gbSmallDNA.pl gff-file seq-file max-size-of-gene-flanking-DNA output-file [options]
options:
--bad=badfile    Specify a file with gene names. All except these are included in the output.
--good=goodfile  Specify a file with gene names. Only these genes are included in the output.
--overlap        Overlap filtering turned off.

[+] Size of flanking regions...

The file genes.gb now contains 492 training genes in the right format for AUGUSTUS training. Continue with Training AUGUSTUS.