PRICE Documentation: Sample Job

PRICE Sample Overview

This document will walk you through a provided PRICE sample job. It will assemble the genome of an isolate of parainfluenza 4. The data is real, but it has been filtered, truncated, and supplemented for the purpose of this demonstration.

A total of 288,456 actual Solexa paired-end reads are provided in two . A small number of "seeds" are provided from which the assembler will build the entire Parainfluenza-4 genome. The coverage of the reads to the Para-4 genome is highly uneven and very typical of metagenomic viral data we see in many other samples.

For more general information on how to use PRICE, consult the README.txt file.

Sample PRICE Command

To be used in conjunction with the included read files:
./PriceTI -fpp s_2_1_sequence.txt s_2_2_sequence.txt 300 95 -icf sangerReads.fasta 1 1 5 -nc 30 -dbmax 72 -mol 30 -tol 20 -mpi 80 -target 90 2 1 1 -o practice.fasta

EXPLANATION OF FLAGS USED ABOVE:
(more complete and extensive explanation of commands can be found in the User Manual)

-fpp s_2_1_sequence.txt s_2_2_sequence.txt 300 95

This specifies two input files of paired-end Illumina FASTQ files. The "300" specifies the amplicon insert size in nucleotides. This is the region at the end of contigs to which PRICE actually maps reads. The "95" indicates that 95% identity will be required for aligning a read to a contig. You can also use FASTA files here.
If you are using Illumina FASTQ, the files must end in "_sequence.txt"
If you are using non illumina FASTQ, the files must end in ".fq" or ".fastq"
If you are using FASTA files, the files must end in ".fa" or ".fasta"

-icf sangerReads.fasta 1 1 5

This flag specifies the input file (fasta) of seed sequences. These can be sanger reads, 454 reads, individual Illumina reads, etc. The first "1" tells price to use all the input seeds at once, rather than adding a portion of them at various cycles. The following "1" indicates the number of cycles to wait before adding another portion of seeds. In this case, it doesn't matter, since all the seeds are being added at once. If the command was "-icf [file] 2 10 5" then half the seeds would be added at the first cycle, and the other half would be added at cycle 11. The last number, in this case "5", is the constant multiplier for the quality score. Refer to official documentation to learn more about the quality score, since it affects behavior differently in different contexts. For fasta files of input seeds, setting this value to something greater than 1 is a good place to start.

-nc 30

This tells PRICE to run for 30 cycles.

-dbmax 72

This tells PRICE to use the de Bruijn graph sub-assembler for assemblies of small regions, with a maximum sequence length of 72 nucleotides in this case. The word size of kmers is given by -dbk [number].

-mol 30

This sets the minimum overlap of a sequence match to 30nt. This number is automatically scaled upwards as coverage increases with a threshold given by the -tol flag. For many assembly jobs, a length of 40nt - 50nt is sufficient. Setting this too high, when coverage is low, could cause stalling of the contig growth, for lack of suitable reads.

This is the threshold flag for scaling the minimum length overlap. After there are 20 reads in an assembly job, the minimum overlap will be scaled upwards.

-mpi 80

This is the minimum percent identity for a sequence match, here set to 80 percent. This value is also scaled with coverage depth. The threshold to begin scaling is given by the "-tpi" flag. Setting this lower allows more loose matches, will raising the value increases the stringency of the match.

-target 90 2 1 1

This specifies that PRICE run in "target" mode, meaning that the intent to limit the final contigs to extensions of the input seeds. The first number, "90" is the percent identity minimum to the initial seed contigs required to count it as a match. The second number "2" tells PRICE to run 2 cycles, retaining the outboard contigs (generated by collections of paired ends), before limiting the output to what has mapped or collapsed with the initial seed contigs. The following two numbers, "1 1", tell price to subsequently alternate between saving the outboard contigs and limiting to the seed contigs thereafter.

-o practice.fasta

This tells PRICE to put the output in the file "practice.fasta". Each cycle will have its own file name with a cycle number appended.

Run Notes

If your compiled PRICE is only locally executable, use "./PriceTI" instead of just "PriceTI".

The target is one contig of almost 17.4kb. If you got more contigs, try supplementing the input contigs using -picf (see below) or running the job for additional cycles.

Speed up the assembly by using the -a tag along with the number of CPU cores that your computer has. For example, for an 8-core Mac Pro with hyperthreading (2X per core): -a 16

Note that the small size of this sample job makes threading less effective than it would be for a real job, as there is a trade-off between additional set-up steps to insure thread safety versus the actual wallclock advantage gained by threading. The utility of threading increases with job size.

What to Expect

The assembly will generally result in a single contig of ~17,362nt, representing the full genome of Para-4. Try BLASTing it to GenBank to check the assembly.

Due to non-deterministic aspects of PRICE (the order in which discordant sequences are collapsed, for instance), the output may vary slightly from run to run or machine to machine.

TIME TO RUN	System description
3m 24s	Ubuntu Linux, Intel X7542, 2.26Ghz, running on a single core
5m 29s	MacOS 10.6.8, Intel Xeon 2.26Ghz, running on a single core

Variations

Try changing the target flag to: -target 90 0

This forces PRICE to only extend the input contigs, rather than retaining outboard contigs (generated by collections of paired-ends). Runtime will be somewhat slower as a result.

Running it with even more "outboard" cycles: -target 90 15 2 2

This results in PRICE assembling Para-4 in 14 cycles, with a runtime of 2min 45 seconds.

Sprinkling in arbitrary reads with a flag like this: -picf 400 s_2_1_sequence.txt 4 2 1

Can make the assembly happen more quickly by seeding from more diverse loci across the genome, requiring PRICE contigs to not be extended as far (through fewer cycles) before consolidating into a coherent genome.

Found a Bug?

Email: price@derisilab.ucsf.edu