Navigate to
Iterative Trainingset Construction.
Training AUGUSTUS.
Tutorial on Gene Prediction with AUGUSTUS
If you want to follow tomorrow in real-time, download the data and install the software.
In this lab session we practice the most common steps when predicting the protein-coding genes in
a eukaryotic genome with AUGUSTUS. We will assume the case of a "new"
genome, for which AUGUSTUS has not been trained before, but will use a well-studied species as example because
example data is readily available and visualization is easier.
Styles
Assignments are in this color. The lazy ones may go through very
fast through this tutorial by just reading these assignments and cutting and pasting the commands
that follow them (more or less).
Results are in this color.
[+]
Details are hidden...
You don't have to read this. If you get bored with the speed of the tutorial then you can read these details boxes.
Example Input Data
All example files are in the data directory. I recommend
you work directly in this directory.
Drosophila melanogaster
- chr2L.sm.fa: softmasked chromosome 2L of assembly dm6 of the genome of the fruit fly
- rnaseq1.fq, rnaseq2.fq: paired RNA-Seq reads, an excerpt of SRR1732756 that maps to first 10Mb of chr2L, 2x100bp, HiSeq 2500
For Cheaters: Result Files
You can use the files in the results directory to catch on if you are behind or to compare your results.
Software
In order to run these examples, you will need to have installed below software. As all important results are in the results folder, you can skip any step/program.
- augustus.current.tar.gz, make sure the binaries augustus, etraining and bam2ints (auxprogs) are compiled
and in your path as well as the augustus/scripts directory, put export AUGUSTUS_CONFIG_PATH=/path/to/your/installation/config/ in your ~/.bashrc
- STAR, an RNA-Seq spliced aligner
- bamtools (may be a package on your system)
- wigToBigWig
Exercise 1: Compile a Training Set
There are several typical options for creating a training set
to estimate the parameters of gene finders. We will here go through option 6.
We assume that we have RNA-Seq data only and no substantial homology data. We will reuse an existing parameter set for AUGUSTUS.
- Follow the tutorial on "Iteratative Training Set Construction"
and create a training set genes.gb.
- Partition genes.gb into a training set and a holdout test setas described in 1.2 Split gene structure set....
Exercise 2: Train the Coding Regions of AUGUSTUS
Let's name our species "bug". Pretending that there was not already a parameters set of AUGUSTUS for
Drosophila (named "fly"), we will estimate the parameters from the training set.
- Create a meta parameters file for bug as described in 2. CREATE A META PARAMETERS FILE...
- Estimate the parameters using your training set as described in 3. MAKE AN INITIAL TRAINING
For further tutorial parts on prediction, hint preparation and homology-based training set construction and prediction, see lab session tutorial.