PRICE Documentation: User Manual

Command-line Arguments

The various command-line arguments for PRICE are described below. Note that brief descriptions, including current default values, can be accessed at the command line by executing PRICE with the -h or --help flags.

Usage: ./PriceTI [args]

INPUT FILES:

Accepted formats are fasta (.fa or .fasta), fastq (.fq, .fastq, or _sequence.txt), or priceq (.pq or .priceq)
NOTE: The nucleotide scoring specifications for traditional Sanger fastq files and Solexa/Illumina fastq files differ. For PRICE input, all .fastq and .fq files will be interpreted according to the Sanger specifications, while all _sequence.txt files will be interpreted according to the Solexa/Illumina specifications after and including pipeline v.1.3 (see the wikipedia article on FASTQ format for more information).
READ FILES:

NOTE: these flags can be used multiple times in the same command to include multiple read datasets.
PAIRED-END FILES (reads are 3p of one another on opposite strands, i.e. pointing towards one another).

This orientation is typical of amplicons generated from short, single fragments of DNA read from either end.
-fp a b c [d e]

(a,b)input file pair, (c)amplicon insert size (including read)
(d,e) are optional; (d)the num. of cycles skipped before this file is used, (e)the num cycles for which it is used

-fpp a b c d [e f]

(a,b)input file pair, (c)amplicon insert size (including read), (d)required % identity for match (25-100 allowed)
(e,f) are optional; (e)the num. of cycles skipped before this file is used, (f)the num cycles for which it is used

-fs a b [c d]

(a)input paired-end file (alternating entries are paired-end reads), (b)amplicon insert size (including read)
(c,d) are optional; (c)the num. cycles to be skipped before this file is used, (d)the num cycles for which it is used

-fsp a b c [d e]

(a)input paired-end file (alternating entries are paired-end reads), (b)amplicon insert size (including read), (c)required % identity for match (25-100 allowed)
(d,e) are optional; (d)the num. cycles to be skipped before this file is used, (e)the num cycles for which it is used

MATE-PAIR FILES (reads are 5p of one another on opposite strands, i.e. pointing away from one another).

This orientation is typical of amplicons generated from the circularization of a long fragment of DNA, then further fragmented and read towards the circularization junction.
-mp a b c [d e]

like -fp above, but with reads in the opposite orientation.

-mpp a b c d [e f]

like -fpp above, but with reads in the opposite orientation.

-ms a b [c d]

like -fs above, but with reads in the opposite orientation.

-msp a b c [d e]

like -fsp above, but with reads in the opposite orientation.

FALSE PAIRED-END FILES (unpaired reads are split into paired ends, with the scores of double-use nuceotides halved)

One of two-ways of using unpaired reads in PRICE assemblies (the other being as initial seed contigs). Each read is treated as an amplicon, and paired-end (inward-facing) "reads" are derived from windows on either side of the "amplicon", according to the user specifications below. If the paired-end "reads" end up overlapping, quality/confidence scores are divided by two so as to not over-count the quantity of primary data support for nucleotide identities in those overlapping regions.
-spf a b c [d e [f]]

(a)input file, (b)the length of the 'reads' that will be taken from each side of the input reads, (c)amplicon insert size (including read) (d,e,f) are optional; (d)the num. cycles to be skipped before this file is used; if (f) is provided, then the file will alternate between being used for (e) cycles and not used for (f) cycles; otherwise, the file will be used for (e) cycles then will not be used again.

-spfp a b c d [e f [g]]

(a)input file, (b)the length of the 'reads' that will be taken from each side of the input reads, (c)amplicon insert size (including read), (d)required % identity for match (25-100 allowed) (e,f,g) are optional; (e)the num. cycles to be skipped before this file is used; if (g) is provided, then the file will alternate between being used for (f) cycles and not used for (g) cycles; otherwise, the file will be used for (f) cycles then will not be used again.

INITIAL CONTIG FILES:

NOTE: these flags can be used multiple times in the same command to include multiple initial contig datasets.
-icf a b c d

(a)initial contig file, (b)number of addition steps, (c)number of cycles per step, (d)const by which to multiply quality scores

-picf a b c d e

(a)num of initial contigs from this file, (b)initial contig file, (c)num addition steps, (d)num cycles per step (e)const by which to multiply quality scores

-icfNt / -picfNt

same as -icf/-picf, but if target mode is invoked, contigs with matches to these input sequences will not necessarily be retained
This option for initial files is intended to accelerate targeted assembly and possibly to allow "stuck" contigs that would be introduced for a targeted assembly job to be approached from the outside. If an initial contig set was already introduced, a large number of arbitrary reads can be introduced that may spawn contigs from nearby the primary set of initial contigs. Those reads will spawn contigs of their own, and if those contigs are from nearby the primary input contigs, they may combine with them. Thus, the primary initial contigs will expand more rapidly. However, most of teh arbitrary reads that are introduced will probably derive from distal genetic elements that are spatially irrelevant to the primary initial contigs. These options allow those irrelevant contigs to be disposed of during targeting steps if they do not combine with the primary contigs of interest.

OUTPUT FILES:

accepted formats are fasta (.fa or .fasta) or priceq (.pq or .priceq)
-o a

(a)output file name (.fasta or .priceq)
Multiple output files can be separately specified for parallel output (i.e. -o out.fasta -o out.priceq).

-nco a

(a)num. cycles that pass in between output files being written

OTHER PARAMS:

-nc a

(a)num. of cycles
Recommendation: if too few cycles were specified, a job that was previously run can be virtually re-started with very little loss of information if the output of the previous cycle was written in .priceq format. The final .priceq file can be added as the initial contigs file using the -icf command.

-link a

(a)max. number of contigs that are allowed to replace a read in a contig-edge assembly
Edge assembly jobs can become unreasonably complex if the sequence into which the contig is being extended includes a repetitive element. The ability of even a single repeat-derived read to map to all contigs in the current assembly that contain a copy of that repeat opens the possibility of a huge number of contigs being brought into a single assembly job, despite the fact that they do not truly derive from nearby genomic loci. -link provides an opportunity to avoid that situation by placing a maximum on the number of contigs that are allowed to replace a single repeat-mapping read. The read itself will be retained in the assembly job, allowing for the assembly to extend into the local copy of the repetitive element, but irrelevant contigs with inappropriate sequence flanking the repetitive element will not be included.
It is not recommended for the user to change this parameter substantially from the default value.

-mol a

(a)minimum overlap length for mini-assembly
NOTE: -mol does not affect the parameters for de-Bruijn-graph-based assembly.
This is the global minimum alignment length for two sequences to be combined into a contig during contig-edge assembly jobs. Alignments are performed semi-globally, so this is the minimum extent of overlap that is allowed to exist between two sequences for them to be combined. This value is logarithmically increased with the number of sequences in an assembly job, starting when that number exceeds that specified by -tol. For a gapped alignment, the lesser of the overlapping nucleotide counts for the two sequences is compared to this value. As noted above, if the de Bruijn graph strategy is applied and has a lower k-mer size (-dbk, below) than -mol, short sequences with less overlap can still be combined through that strategy.

-tol a

(a)threshold seq num for scaling overlap for contig-edge assemblies
NOTE: -tol does not affect the parameters for de-Bruijn-graph-based assembly.
For contig-edge assembly jobs, this is the number of sequences above which the minimum overlap of two sequences for combining them into a contig will be logarithmically increased. At and below this number of sequences, the minimum overlap value will be equal to that specified by -mol.

-mpi a

(a)minimum % identity for contig-edge assembly
In a gapped alignment, the aligned strand with the lower percent of matching nucleotides will give rise to the % identity. Competing alignments will always be selected based on their calculated semi-global scores, but alignments with less than the minimum percent identity will be excluded. -mpi imposes a global minimum on contig-edge assemblies, but like the minimum overlap lenth, this requirement will be scaled as the number of reads increases over that specified by -tpi (see below). The minimum % ID value will asymptotically approach 100% from the value given by -mpi, decreasing the distance by half for every log-scale increase to the number of input sequences.

-tpi a

(a)threshold seq num for scaling % ID for contig-edge assemblies
For contig-edge assembly jobs, this is the number of sequences above which the minimum % identity of two sequences for combining them into a contig will be drawn closer to 100% (see -mpi above). At and below this number of sequences, the minimum % identity will be equal to that specified by -mpi.

-MPI a, -TPI a

same as -mpi and -tpi above, but for meta-assembly
NOTE: there is no minimum overlap value for meta-assembly. Meta-assembly only collapses highly-redundant sequences that overlap entirely (or nearly entirely).

-dbmax a

(a) the maximum length sequence that will be fed into de Bruijn assembly (recommended: max paired-end read length)
The de Bruijn graph approach to genome assembly is highly efficient. However, when k-mers are repeated in a genome, or when k-mers are palindromic, the graph representation loses information about the overall structure of the genome. PRICE applies the de Bruijn graph approach only to local contig-edge assembly jobs, thereby limiting the opportunity for such errors to cases of very local repeats and, by creating strand-specific assembly jobs using paired-end topology to orient sequences, avoiding the confusion caused by palindromic sequences. In order for PRICE to not dismantle the structure of the larger contigs that are also included in contig-edge assembly jobs, an upper limit must be placed on the length of sequence that will be subjected to de Bruijn graph representation. If -dbmax is set to the length of the input reads, then they will be efficiently collapsed by this method, while the larger contigs will be retained as they are for pairwise alignment and collapse with one another and with the contig(s) generated by the de Bruijn graph method.

-dbk a

(a) the k-mer size for de Bruijn assembly (recommended: keep less than the read length)
The de Bruijn assembly strategy is only applied during the contig-edge assemblies, and then it is only applied to short sequences (see -dbmax). The single-stranded nature of the contig-edge assembly jobs allows for the k-mer size to be an even number without introducing errors from palindromic k-mers.

-dbms a

(a) the minimum number of sequences to which de Bruijn assembly will be applied
The de Bruijn graph is a graph of nucleotide k-mers. Therefore, k-mers containing non-canonical nucleotide characters (like N, which is often used in raw sequence files to represent a nucleotide whose identity was ambiguous) cannot be included in the de Bruijn graph. Therefore, a single N in an input sequence will remove (k-mer size) k-mers from being entered into the de Bruijn graph. In areas of redundant coverage, other sequences will be able to provide the path through the de Bruijn graph to span the error. But in places of low coverage, a single error could prevent two sequences capable of forming a very high-quality alignment from being combined. Places of low coverage can be identified as contig-edge assembly jobs with small numbers of short (read-sized) sequences. -dbms allows one to bypass de Bruijn-based assembly of short sequences when very few such sequences are present.

-r a

(a) alignment score reward for a nucleotide match; should be a positive integer (default=1)
These alignment parameters (-r, -q, -G, and -E) are applied to all contexts in which alignments are generated using the dynamic programming algorithm. Alignments are evaluated as satisfactory or unsatisfactory in terms of the specified % IDs (fraction of nucleotides that match), but optimal alignments are sought and selected based on the alignment scores.

-q a

(a) alignment score penalty for a nucleotide mismatch; should be a negative integer (default=-2)
See -r above.

-G a

(a) alignment score penalty for opening a gap; should be a negative integer (default=-5)
See -r above.

-E a

(a) alignment score penalty for extending a gap; should be a negative integer (default=-2)
See -r above.

FILTERING READS:

-rqf a b [c d]

filters pairs of reads if either has an unaccptably high number of low-quality nucleotides, as defined by the provided quality scores (only applies to files whose formats include quality score information).
(a) the percentage of nucleotides in a read that must be high-quality; (b) the minimum allowed probability of a nucleotide being correct (must be between 0 and 1, and will usually be a decimal value close to 1); (c) and (d) optionally constrain this filter to use after (c) cycles have passed, to run for (d) cycles.
This flag may be called multiple times to generate variable behavior across a PRICE run.

-rnf a [b c]

filters pairs of reads if either has an unaccptably high number of uncalled nucleotides (Ns or other ambiguous IUPAC codes). Like -rqf, but will also filter fasta-format data.
(a) the percentage of nucleotides in a read that must be called; (b) and (c) optionally constrain this filter to use after (b) cycles have passed, to run for (c) cycles.
This flag may be called multiple times to generate variable behavior across a PRICE run.

-maxHp a

filters out a pair of reads if either read has a homo-polymer track >(a) nucleotides in length.
This feature was inspired by mRNA transcript assembly jobs, for which reaching the poly-A tail at the end of the message could spell disaster without it.

-maxDi a

filters out a pair of reads if either read has a repeating di-nucleotide track >(a) nucleotides in length.
Note that this will also catch mono-nucleotide repeats of the specified length (a string of A's is also a string of AA dinucleotides), so calling -maxHp in addition to -maxDi is superfluous unless -maxHp is given a smaller max value.

-badf a b

prevents reads with an ungapped match of at least (b)% identity to a sequence in file (a) from being mapped to contigs.
This feature is intended to allow the user to avoid certain specified sequences. It is especially useful for avoiding performance issues associated with repetitive elements (i.e. LINEs or SINEs).

-repmask a b c d e f [g]

uses coverage levels of constructed and/or input contigs to find repetitive elements and mask them as if they were sequences input using -badf.
(a) = cycle number (1-indexed) at which repeats will be detected.
(b) = 's' if repeats will be sought at the start of the cycle or 'f' if they will be sought at the finish.
(c) = the min. number of variance units above the median that will be counted as high-coverage.
(d) = the min. fold increase in coverage above the median that will be counted as high-coverage.
(e) = the min. size in nt for a detected repeat.
(f) = reads with a match of at least this % identity to a repeat will not be mapped to contigs.
(g) = an optional output file (.fasta or .priceq) to which the detected repeats will be written.

-reset a

re-introduces contigs that were previously not generating assembly jobs of their own. (a) is the one-indexed cycle where the contigs will be reset. Same with b, c, d. Any number of args may be added.
In order to maintain efficiency, contigs whose sequence does not change from one cycle to another are not re-extended, i.e. paired-end reads are not mapped to their edges. They can still be included in assembly jobs if the reads that map to the edges of still-active contigs have paired-ends that map to them (which will bring them into those contigs' assembly jobs). But sometimes, a contig that has stopped growing will be approached by another contig that is still growing, and the same troublesome (probably low-coverage) area will stop that contig as well. In that case, the sum of the overhanging reads from both of the two stalled contigs could allow the low-coverage region to be traversed. This arg provides an opportunity for that to happen by using all of the contigs in the assembly with no regard for whether or not they had changed during the prior cycle.

FILTERING INITIAL CONTIGS:

-icbf a b [c]

prevents input sequences with a match of at least (b)% identity to a sequence in file (a) from being used. This filter is optionally not applied to sequences of length greater than (c) nucleotides.
similar to -badf (above)

-icmHp a [b]

filters out an initial contig if it has a homo-polymer track >(a) nucleotides in length. This filter is optionally not applied to sequences of length greater than (b) nucleotides.
similar to -maxHp (above

-icmDi a [b]

filters out an initial contig if it has a repeating di-nucleotide track >(a) nucleotides in length. This filter is optionally not applied to sequences of length greater than (b) nucleotides.
NOTE: this will also catch mono-nucleotide repeats of the specified length (a string of A's is also a string of AA's), so calling -icmHp in addition to -icmDi is superfluous unless -icmHp is given a smaller max value.
similar to -maxDi (above)

-icqf a b [c]

filters out an initial contig if it has an unaccptably high number of low-quality nucleotides, as defined by the provided quality scores (only applies to files whose formats include quality score information).
(a) the percentage of nucleotides in a read that must be high-quality; (b) the minimum allowed probability of a nucleotide being correct (must be between 0 and 1, and will usually be a decimal value close to 1);
This filter is optionally not applied to sequences of length greater than (c) nucleotides.
similar to -rqf (above)

-icnf a [b]

filters pairs of reads if either has an unaccptably high number of uncalled nucleotides (Ns or other ambiguous IUPAC codes). Like -icqf, but will also filter fasta-format data.
(a) the percentage of nucleotides in a read that must be called. This filter is optionally not applied to sequences of length greater than (c) nucleotides.

FILTERING/PROCESSING ASSEMBLED CONTIGS:

-lenf a b

filters out contigs shorter than (a) nt at the end of every cycle, after skipping (b) cycles.
Recommended use is to set this to the read length or higher. De Bruijn-graph assembly can yield very short sequences (paths that traverse too few k-mers to cover even a read length), but those are not eliminated unless a minimum contig length is specified with -lenf.
This filter can be applied variably across assembly iterations of a PRICE run. Since the object of PRICE is to continually expand a set of existing contigs, length filters that are appropriate for removal of rogue de Bruijn k-mers, as described above, may not be appropriate for later cycles, where contigs have grown to many kilobases and shorter contigs are likely failures or off-target distractions. On the other hand, a 1kb length filter could remove all of the contigs from early cycles. The minimum contig length can be scaled up by using the -lenf flag multiple times for the same job. For instance, if you used the flag twice: "-lenf 50 0 -lenf 300 2", then a minimal length filter of 50nt would be applied immediately, and a more stringent filter of 300nt would be applied starting in cycle 3. If you expected the contigs to continue to grow, then adding "-lenf 1000 7" would reflect that expectation.
Note that the length filters do not need to always be increasing. For instance, if you had a genomic region for which there was zero read coverage, but there were paired-end reads spanning that region, you may want to build outboard contigs that could be scaffolded with (but not joined to) your existing, large contigs. Initially, those outboard contigs would be small, so you would want to decrease the length filter. In the above example, adding "-lenf 50 9" would lower the length requirement from 1000nt to 50nt in the 10th cycle (i.e. after skipping 9 cycles). That would allow scaffold-able outboard contigs to form. If you then wanted to get rid of the unsuccessful outboard contigs, you could ramp the length requirement back up with an additional call, i.e. "-lenf 500 14". So for any given cycle, the imposed length requirement is the most-recently invoked requirement, NOT necessarily the longest requirement invoked up to that point.

-trim a b [c]

at the end of the (a)th cycle (indexed from 1), trim off the edges of conigs until reaching the minimum coverage level (b), optionally deleting contigs shorter than (c) after trimming; this flag may be used repeatedly.
This is an attempt to clear low-coverage sequence from the edges of contigs. This strategy presumes that such sequence derives from poorly-supported mis-assemblies rather than low-coverage regions. If there is a small window of low coverage but legitimate sequence, this will also be trimmed, but it should be able to be re-constructed in the next cycle. This trim method will chew off nucleotides starting at each of the two edges of each contig and will continue to do so until a score is found meeting the minimum coverage level. If (c) is specified and the resulting contig is shorter than it allows, the whole contig will be culled.

-trimB a b [c]

basal trim; after skipping (a) cycles, trim off the edges of conigs until reaching the minimum coverage level (b) at the end of EVERY cycle, optionally deleting contigs shorter than (c) after trimming. -trimB may be called many times, and multiple calls will interact in the same way as multiple -lenf calls (explained above). A call to -trim will override the basal trim values for that specified cycle only.
Errors can occur at any time; this command allows a basal level of trim-back to be set up for the entire PRICE run. It is also set up to be flexible, so that the basal trimming can be modified through different phases of the assembly process (say, as the read sets being used are modified, and with them the expected level of coverage for newly-added contig sequence).

-trimI a [b]

initial trim; input initial contigs are trimmed before being used by PRICE to seed assemblies. Contigs are trimmed from their outside edges until reaching the minimum coverage level (a), optionally deleting contigs shorter than (b) after trimming. This flag is most appropriate for .priceq input, can be appropriate for .fastq input, and is inappropriate for .fasta input. It will be equally applied to ALL input contigs.
When restarting an assembly, this allows edge errors from the prior assembly to be dealt with before the new assembly gets started. This feature was intended for use with .priceq input files, which retain the coverage scores that -trim normally deals with. It can also be used with .fastq-formatted data to remove low-quality nucleotides from the end of a read (decimal values are allowed for argument (a)). It is not appropriate for use with .fasta-format data (the scores are uniform across those sequences, so this filter will either not modify the input data or, with a high value of (a), delete the entire input dataset).

-target a b [c d]

limit output contigs to those with matches to input initial contigs at the end of each cycle. (a) % identity to an input initial contig to count as a match (ungapped); (b)num cycles to skip before applying this filter. [c and d are optional, but must both be provided if either is] After target filtering has begin, target-filtered/-unfiltered cycles will alternate with (c) filtered cycles followed by (d) unfiltered cycles.
This feature was inspired by target virus assembly jobs using metagenomic datasets generated through random hexamer priming of RNA templates. The tendency of RT to switch templates generated chimeric paired-ends, which would allow jobs seeded with viral sequences to tangentially begin assembling contigs from other metagenome components. Even with more reliable library prep methods, low-frequency chimeric amplicons can spawn such off-target assemblies, as can repetitive genomic sequence or incorrect read mapping. This feature eliminates contigs at the end of each cycle that do not retain identity to the seed sequences. Cycles can be skipped both initially and/or periodically through the PRICE run using the args. This allows short contigs to spawn the assembly of larger contigs that will not be able to immediately close the gap between them and their parent contig.

-targetF a b [c d]

the same as -target, but now matches to all reads in the input set will be specified, not just the ones that have been introduced up to that point (this is FullFile mode).

COMPUTATIONAL EFFICIENCY:

-a x

(x)num threads to use
Many aspects of PRICE are threaded for multi-core CPUs. Thread use will rise and fall as data from parallel operations is re-synchronized in the main thread. Threading will generally be at its most efficient during mapping (dropping while files are being read) and during contig-edge assembly. Thread use will generally be the most uneven during meta-assembly (or at the beginning of contig-edge assembly if there are a small number of very large contig-edge assembly jobs).

-mtpf a

(a)max threads per file
This variable allows the reading from a single file to be threaded. It is not expected to improve performance reading files stored on disc drives but may enhance performance on solid-state drives (not tested).

USER INTERFACE:

-log a

determines the type of output and can make the output verbose (lots of time stamp tags)

(a) = c: concise stdout (default)
(a) = n: no stdout
(a) = v: verbose stdout

-logf a

(a)the name of an output file for verbose log info to be written (doesn't change stdout format)
Recommendation: use option c (concise/default) output for stdout for viewing and, if desired, use -logf to create a supplemental verbose record of your run. While you may want to mine the verbose file for details about the run, it will generally not be a good interface for keeping track of the job status.

-h, or --help

user interface info.
NOTE: no job will be run if this flag is used, regardless of whatever flags are used.