This page describes a method to construct a training set of gene structure using AUGUSTUS itself and RNA-Seq data. The training genes are selected from among AUGUSTUS genes predicted with previously existing parameters.
mkdir gindex STAR --runThreadN 4 --runMode genomeGenerate --genomeDir gindex --genomeFastaFiles chr2L.sm.fa # ~1m STAR --runThreadN 4 --genomeDir gindex --readFilesIn rnaseq1.fq rnaseq2.fq --alignIntronMax 100000 \ --outSAMtype BAM SortedByCoordinate --outWigType wiggle --outWigStrand Unstranded # ~2mThese commands will create a file Aligned.sortedByCoord.out.bam with the alignments and a file Signal.Unique.str1.out.wig with coverage per base information, among others. The files will be used below for display and to generate hints for AUGUSTUS.
for s in nasonia zebrafish tomato; do augustus --species=$s chr2L.sm.fa --softmasking=on --predictionEnd=1000000 > aug.$s.1-1M.gff & doneAbove command will run AUGUSTUS in the background with three different parameter sets on the first megabase of chr2L of Drosophila melanogaster. The species selection is just an example. For Drosophila and many other insects there are already parameter sets (fly), which we chose to ignore for the purpose of the tutorial. The options --predictionEnd and --predictionStart can be used to cut out a window for prediction. We do this here to save time and as we will only visually inspect a small region. The files aug.nasonia.1-1M.gff, aug.tomato.1-1M.gff, aug.zebrafish.1-1M.gff now contain the coordinates and protein sequences of predicted genes.
... chr2L AUGUSTUS gene 6693 9276 0.51 + . g1 chr2L AUGUSTUS transcript 6693 9276 0.51 + . g1.t1 chr2L AUGUSTUS start_codon 6693 6695 . + 0 transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS intron 6809 7573 0.94 + . transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS intron 8117 8192 1 + . transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS intron 8590 8667 1 + . transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS CDS 6693 6808 0.57 + 0 transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS CDS 7574 8116 0.96 + 1 transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS CDS 8193 8589 1 + 1 transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS CDS 8668 9276 1 + 0 transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS stop_codon 9274 9276 . + 0 transcript_id "g1.t1"; gene_id "g1"; # protein sequence = [MMKRDRFYWDKNDRSRERSGLNFTLTKMLAIQNGGGMKSNREENPRLNKYRRVDNSYPRFITNEAAEDLIYKKSMGER ...
echo "chr2L 23513712" > chrom.sizes # size of chromosomes mv Signal.Unique.str1.out.wig rnaseq.wig # rename for convenience mv Aligned.sortedByCoord.out.bam rnaseq.bam wigToBigWig rnaseq.wig chrom.sizes rnaseq.bw bamtools index -in rnaseq.bam # index required by UCSC browser scp rnaseq.bw rnaseq.bam rnaseq.bam.bai mario@hgwdev:~/public_html/tutorial2015/results/ # copy to web spaceThe data is now web-accessible through the URL http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/. The UCSC browser will load the parts requested through browsing only. Create a custom track file:
echo "browser hide all" > customtrack echo "track name=\"STAR RNA-Seq alignments\" type=bam visibility=4 bigDataUrl=http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/rnaseq.bam" >> customtrack echo "track name=\"STAR coverage\" type=bigWig visibility=2 bigDataUrl=http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/rnaseq.bw" >> customtrack for s in nasonia zebrafish tomato; do echo "track name=\"AUGUSTUS $s abinitio\" db=dm6 visibility=3" >> customtrack grep -P "AUGUSTUS\t(CDS|exon)\t" aug.$s.1-1M.gff >> customtrack # use only the relevant coordinate lines doneThe file customtrack will now look like this (grep -B 1 -A 2 track customtrack):
browser hide all track name="STAR RNA-Seq alignments" type=bam visibility=4 bigDataUrl=http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/rnaseq.bam track name="STAR coverage" type=bigWig visibility=2 bigDataUrl=http://hgwdev.cse.ucsc.edu/~mario/tutorial2015/results/rnaseq.bw track name="AUGUSTUS nasonia abinitio" db=dm6 visibility=3 chr2L AUGUSTUS CDS 6693 6808 0.57 + 0 transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS CDS 7574 8116 0.96 + 1 transcript_id "g1.t1"; gene_id "g1"; -- chr2L AUGUSTUS CDS 995604 995891 0.59 + 0 transcript_id "g159.t1"; gene_id "g159"; track name="AUGUSTUS zebrafish abinitio" db=dm6 visibility=3 chr2L AUGUSTUS CDS 7574 8116 0.98 + 1 transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS CDS 8193 8589 0.85 + 1 transcript_id "g1.t1"; gene_id "g1"; -- chr2L AUGUSTUS CDS 957279 957449 1 + 0 transcript_id "g135.t1"; gene_id "g135"; track name="AUGUSTUS tomato abinitio" db=dm6 visibility=3 chr2L AUGUSTUS exon 6486 6808 . + . transcript_id "g1.t1"; gene_id "g1"; chr2L AUGUSTUS CDS 6591 6808 0.76 + 0 transcript_id "g1.t1"; gene_id "g1";This is a format that the UCSC Genome Browser understands. The first line tell it to hide the existing tracks (no cheating). The first two track lines refer to the RNA-Seq tracks and specify the URL where the data lies. The AUGUSTUS tracks are included in the file as they are small enough.
The UCSC Genome Browser group has an excellent help page for setting up UCSC custom tracks.
Browse the predictions and alignments and pick a parameter set for which the genes match the evidence best. Let us assume here that the nasonia parameter set is best.When predicting genes, AUGUSTUS can incorporate evidence on the location and structure of genes in the form of hints. Hints are extrinsic evidence a particular GFF format and will change the likelihood of gene structures candidates. AUGUSTUS will tend to predict gene structures that are in agreement with the hints. For more information see this tutorial and our Wiki page
Create hints from RNA-Seq:bam2hints --intronsonly --in=rnaseq.bam --out=hints.intron.gff cat rnaseq.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 --prune=0.1 --src=W --type=ep \ --radius=4.5 --pri=4 --strand="." > hints.ep.gff cat hints.intron.gff hints.ep.gff > hints.gffAbove commands have the effect that spliced reads are used as evidence for introns, and coverage as evidence for exons. Both types of evidence are joined in one file hints.gff.
Predict genes using hints:
for i in {0..3}; do augustus --species=nasonia chr2L.sm.fa --softmasking=on --predictionStart=$((i*2000000)) --predictionEnd=$(((i+1)*2000000+50000)) \ --hintsfile=hints.gff --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg > aug.nasonia.hints.$i.gff & done # wait ~6m until done cat aug.nasonia.hints.{0..3}.gff | join_aug_pred.pl > aug.nasonia.hints.gffFor quick results during the tutorial we restrict the prediction to the first 8 Mb of chr2L instead of predicting genome-wide (4x2Mb each + 50Kb overlap). The overlap would not strictly be necessary here, but it is common practice for genome-wide predictions in parallel. This way genes that are broken by the region boundary can often be predicted completely in the neighboring region if the overlap is larger than most genes. The script join_aug_pred.pl 'repairs' the break and also renumbers genes sequentially so that their names are unique again.
The files aug.nasonia.hints.gff also contain a summary for each transcrip how well it matches the evidence:
... chr2L AUGUSTUS transcript 11215 17136 1 - . g2.t1 ... # Evidence for and against this transcript: # % of transcript supported by hints (any source): 100 # CDS exons: 8/8 # W: 8 # CDS introns: 7/7 # E: 7
Add the browser track of genes with evidence:
echo "track name=\"AUGUSTUS nasonia hints\" db=dm6 visibility=3" >> customtrack grep -P "AUGUSTUS\t(CDS|exon)\t" aug.nasonia.hints.gff >> customtrack gzip -c customtrack > customtrack.whints.gz scp customtrack.whints.gz mario@hgwdev:~/public_html/tutorial2015/results/ # copy to web spaceThis will add the newly predicted track after the upload to the UCSC genome browser (replace the existing custom tracks). Click this link to see all the tracks including the new predictions with hints: customtrack.whints.gz. The predictions aug.nasonia.hints.gff should be our best shot so far for the genome annotation. However, they still include genes unsupported by RNA-Seq that are likely to have a larger error rate. We follow the strategy to select a subset of gene structures with higher reliability. This selection can include
cat aug.nasonia.hints.gff | perl -ne 'if (/\ttranscript\t.*\t(\S+)/){$tx=$1;} if (/transcript supported.*100/) {print "$tx\n";}' | tee supported.lst | wc -lAbove one line script prints the transcript id of all 492 transcripts with 100 percent suppport (they may still be too short or otherwise wrong).
gff2gbSmallDNA.pl --good=supported.lst aug.nasonia.hints.gff chr2L.sm.fa 5000 genes.gbThis command takes each gene in aug.nasonia.hints.gff together with up to 5000 bp flanking intergenic region upstream and downstream of the gene and creates a LOCUS in Genbank format.
The file genes.gb now contains 492 training genes in the right format for AUGUSTUS training. Continue with Training AUGUSTUS.