PRICE Documentation: Sample Job
Back to PRICE Documentation main page
PRICE Sample Overview
This document will walk you through a provided PRICE sample job.
It will assemble the genome of an isolate of parainfluenza 4. The
data is real, but it has been filtered, truncated, and supplemented for the purpose
of this demonstration.
A total of 288,456 actual Solexa paired-end reads are provided in two .
A small number of "seeds" are provided from which the assembler will build
the entire Parainfluenza-4 genome. The coverage of the reads to the Para-4
genome is highly uneven and very typical of metagenomic viral data we see
in many other samples.
For more general information on how to use PRICE, consult the README.txt file.
Sample PRICE Command
To be used in conjunction with the included read files:
./PriceTI -fpp s_2_1_sequence.txt s_2_2_sequence.txt 300 95 -icf sangerReads.fasta 1 1 5 -nc 30 -dbmax 72 -mol 30 -tol 20 -mpi 80 -target 90 2 1 1 -o practice.fasta
EXPLANATION OF FLAGS USED ABOVE:
(more complete and extensive explanation of commands can be found in the User Manual)
- -fpp s_2_1_sequence.txt s_2_2_sequence.txt 300 95
- This specifies two input files of paired-end Illumina FASTQ files. The "300" specifies the amplicon
insert size in nucleotides. This is the region at the end of contigs to which PRICE actually maps reads.
The "95" indicates that 95% identity will be required for aligning a read to a contig.
You can also use FASTA files here.
- If you are using Illumina FASTQ, the files must end in "_sequence.txt"
- If you are using non illumina FASTQ, the files must end in ".fq" or ".fastq"
- If you are using FASTA files, the files must end in ".fa" or ".fasta"
- -icf sangerReads.fasta 1 1 5
- This flag specifies the input file (fasta) of seed sequences. These can be sanger reads, 454 reads,
individual Illumina reads, etc. The first "1" tells price to use all the input seeds at once, rather
than adding a portion of them at various cycles. The following "1" indicates the number of cycles to
wait before adding another portion of seeds. In this case, it doesn't matter, since all the seeds are
being added at once. If the command was "-icf [file] 2 10 5" then half the seeds would be added at the
first cycle, and the other half would be added at cycle 11. The last number, in this case "5", is the
constant multiplier for the quality score. Refer to official documentation to learn more about the quality
score, since it affects behavior differently in different contexts. For fasta files of input seeds, setting
this value to something greater than 1 is a good place to start.
- -nc 30
- This tells PRICE to run for 30 cycles.
- -dbmax 72
- This tells PRICE to use the de Bruijn graph sub-assembler for assemblies of small regions, with a
maximum sequence length of 72 nucleotides in this case. The word size of kmers is given by
-dbk [number].
- -mol 30
- This sets the minimum overlap of a sequence match to 30nt. This number is automatically scaled
upwards as coverage increases with a threshold given by the -tol flag. For many assembly jobs,
a length of 40nt - 50nt is sufficient. Setting this too high, when coverage is low, could cause
stalling of the contig growth, for lack of suitable reads.
-tol 20
- This is the threshold flag for scaling the minimum length overlap. After there are 20 reads in an assembly job,
the minimum overlap will be scaled upwards.
- -mpi 80
- This is the minimum percent identity for a sequence match, here set to 80 percent. This value
is also scaled with coverage depth. The threshold to begin scaling is given by the "-tpi" flag.
Setting this lower allows more loose matches, will raising the value increases the stringency of the match.
- -target 90 2 1 1
- This specifies that PRICE run in "target" mode, meaning that the intent to limit the final
contigs to extensions of the input seeds. The first number, "90" is the percent identity
minimum to the initial seed contigs required to count it as a match. The second number "2"
tells PRICE to run 2 cycles, retaining the outboard contigs (generated by collections of
paired ends), before limiting the output to what has mapped or collapsed with the initial
seed contigs. The following two numbers, "1 1", tell price to subsequently alternate between
saving the outboard contigs and limiting to the seed contigs thereafter.
- -o practice.fasta
- This tells PRICE to put the output in the file "practice.fasta". Each cycle will have its own file name with a cycle number appended.
Run Notes
If your compiled PRICE is only locally executable, use "./PriceTI" instead of just "PriceTI".
The target is one contig of almost 17.4kb. If you got more contigs, try supplementing the input contigs using
-picf (see below) or running the job for additional cycles.
Speed up the assembly by using the -a tag along with the number of CPU cores that your computer has. For
example, for an 8-core Mac Pro with hyperthreading (2X per core):
-a 16
Note that the small size of this sample job makes threading less effective than it would be for a real job,
as there is a trade-off between additional set-up steps to insure thread safety versus the actual wallclock
advantage gained by threading. The utility of threading increases with job size.
What to Expect
The assembly will generally result in a single contig of ~17,362nt, representing the full genome of Para-4. Try BLASTing it to GenBank to check the assembly.
Due to non-deterministic aspects of PRICE (the order in which discordant sequences are collapsed, for instance), the output may vary slightly from
run to run or machine to machine.
TIME TO RUN | System description |
3m 24s | Ubuntu Linux, Intel X7542, 2.26Ghz, running on a single core |
5m 29s | MacOS 10.6.8, Intel Xeon 2.26Ghz, running on a single core |
Variations
- Try changing the target flag to: -target 90 0
- This forces PRICE to only extend the input contigs, rather than retaining outboard contigs (generated by collections of paired-ends). Runtime will be somewhat slower as a result.
- Running it with even more "outboard" cycles: -target 90 15 2 2
- This results in PRICE assembling Para-4 in 14 cycles, with a runtime of 2min 45 seconds.
- Sprinkling in arbitrary reads with a flag like this: -picf 400 s_2_1_sequence.txt 4 2 1
- Can make the assembly happen more quickly by seeding from more diverse loci across the genome, requiring PRICE contigs to not be extended as far (through fewer cycles)
before consolidating into a coherent genome.
Found a Bug?
Email: price@derisilab.ucsf.edu