PRICE Documentation: Problem Analysis

Overview

Paired-Read Iterative Contig Extension. PRICE executes a number of assembly "cycles", each of which has three steps:

mapping of paired-end reads to the edges of existing contigs;
assembling the overhanging pairs of mapped reads for each contig edge (contig-edge assemblies);
performing a meta-assembly of all existing contigs whose aim is to remove redundancy.

Step 1: Paired-end read mapping

Paired-end reads are provided as file pairs by the -fp and -fpp commands. The two files must be of equal length, and the Nth sequence in the first file is taken to be the paired-end of the Nth sequence of the second file. The reads are assumed to be from opposite strands of their template molecule and facing one another 5p->3p. This is the format of Illumina's s_X_1_sequence.txt and s_X_2_sequence.txt files. Fasta or fastq (_sequence.txt) files are accepted. An additional arg describes the sizes of the amplicons (distance from the 5p edge of one read to the 5p edge on the opposite strand of its paired end). It defines an acceptable distance from the contig edge. Go with the average amplicon size or a standard deviation or two above that.

	\|-----size-------\|
5p	---->.............
	.............<----	5p

Each pair of files can have a % identity required for mapping to contigs separately specified. This is useful if you are mixing datasets from different runs where the quality of the data from those runs is very different. For now, this is provided using -fpp instead of -fp.

Step 2: Contig-edge assemblies

Overhanging reads are assembled. These are reads whose paired-ends mapped to a contig, but they themselves did not map to any existing contig. Since they are supposed to extend the contig, that contig itself is also included in the mini-assembly. If a paired-end maps to another contig in the dataset, then that contig is included in the mini-assembly instead of the read. This is how contigs from neighboring parts of a genome are brought together. The amount of sequence identity required to collapse two sequences into one contig is scaled to the number of sequences in any given mini-assembly. There are currently arbitrary default values, but they can also be user-adjusted at the command line. If you don't like the way your assembly looks, these are the first things to tweak:

General assembly parameters

Preperty description	Command-line args (see usage description)
Minimum alignment overlap	-mol, -tol
Minimum alignment % identity	-mpi, -tpi

Initial contigs

At the beginning of the assembly, there are no contigs to be extended, so some must be provided. Generally, I use a small subset of the input reads. If you have other, non-paired-end data (like 454 sequence), that is also good, and the longer the initial sequences are relative to the short read lengths, the faster the assembly will get going. Note that the job size can be controlled by adding initial contigs incrementally rather than all at once.

You can also input fasta-formatted contigs from previous assemblies. But beware: they will be treated as individual reads unless you use the last arg of -icf/-picf to specify a higher read count. At the end of each cycle, "contigs" whose max score count is only 1 are interpreted as leftover individual reads and they are removed. This is appropriate when they were just single reads, but is inappropriate when previously-assembled contigs are being used. In that case, provide a number >1 so that the contigs aren't lost. But making the value too high will make it hard for them to be modified in light of conflicting sequence data. Recommended: ~2-5.

Command-line args (see usage description): -icf, -picf.

Step 3: Meta-assembly

This step is only in place to remove redundancy, not to put together contigs with small amounts of overlap. Requirements are still scaled with the number of input sequences just like in the case of the mini-assemblies, but this is for the sake of computational efficiency. Setting the requirements for collapse too high will prevent truly redundant contigs from collapsing, while setting the requirements too low could allow sequences deriving from distinct genomic regions or distinct metagenomic components to collapse into a consensus sequence.

Command-line args (see usage description): -MPI, -TPI.