PRICE Documentation: User Manual
Back to PRICE Documentation main page
Command-line Arguments
The various command-line arguments for PRICE are described below. Note that brief descriptions, including current
default values, can be accessed at the command line by executing PRICE with the -h or --help flags.
- INPUT FILES:
- Accepted formats are fasta (.fa or .fasta), fastq (.fq, .fastq, or _sequence.txt), or priceq (.pq or .priceq)
- NOTE: The nucleotide scoring specifications for traditional Sanger fastq files and Solexa/Illumina fastq files differ. For PRICE input,
all .fastq and .fq files will be interpreted according to the Sanger specifications, while all _sequence.txt files will be interpreted
according to the Solexa/Illumina specifications after and including pipeline v.1.3 (see the
wikipedia article on FASTQ format for more information).
- READ FILES:
- NOTE: these flags can be used multiple times in the same command to include multiple read datasets.
- PAIRED-END FILES (reads are 3p of one another on opposite strands, i.e. pointing towards one another).
- This orientation is typical of amplicons generated from short, single fragments of DNA read from either end.
- -fp a b c [d e]
- (a,b)input file pair, (c)amplicon insert size (including read)
(d,e) are optional; (d)the num. of cycles skipped before this file is used, (e)the num cycles for which it is used
- -fpp a b c d [e f]
- (a,b)input file pair, (c)amplicon insert size (including read), (d)required % identity for match (25-100 allowed)
(e,f) are optional; (e)the num. of cycles skipped before this file is used, (f)the num cycles for which it is used
- -fs a b [c d]
- (a)input paired-end file (alternating entries are paired-end reads), (b)amplicon insert size (including read)
(c,d) are optional; (c)the num. cycles to be skipped before this file is used, (d)the num cycles for which it is used
- -fsp a b c [d e]
- (a)input paired-end file (alternating entries are paired-end reads), (b)amplicon insert size (including read), (c)required % identity for match (25-100 allowed)
(d,e) are optional; (d)the num. cycles to be skipped before this file is used, (e)the num cycles for which it is used
- MATE-PAIR FILES (reads are 5p of one another on opposite strands, i.e. pointing away from one another).
- This orientation is typical of amplicons generated from the circularization of a long fragment of DNA, then
further fragmented and read towards the circularization junction.
- -mp a b c [d e]
- like -fp above, but with reads in the opposite orientation.
- -mpp a b c d [e f]
- like -fpp above, but with reads in the opposite orientation.
- -ms a b [c d]
- like -fs above, but with reads in the opposite orientation.
- -msp a b c [d e]
- like -fsp above, but with reads in the opposite orientation.
- FALSE PAIRED-END FILES (unpaired reads are split into paired ends, with the scores of double-use nuceotides halved)
- One of two-ways of using unpaired reads in PRICE assemblies (the other being as initial seed contigs). Each read is
treated as an amplicon, and paired-end (inward-facing) "reads" are derived from windows on either side of the "amplicon",
according to the user specifications below. If the paired-end "reads" end up overlapping, quality/confidence scores
are divided by two so as to not over-count the quantity of primary data support for nucleotide identities in those
overlapping regions.
- -spf a b c [d e [f]]
- (a)input file, (b)the length of the 'reads' that will be taken from each side of the input reads,
(c)amplicon insert size (including read)
(d,e,f) are optional; (d)the num. cycles to be skipped before this file is used;
if (f) is provided, then the file will alternate between being used for (e) cycles and not used for (f) cycles;
otherwise, the file will be used for (e) cycles then will not be used again.
- -spfp a b c d [e f [g]]
- (a)input file, (b)the length of the 'reads' that will be taken from each side of the input reads,
(c)amplicon insert size (including read), (d)required % identity for match (25-100 allowed)
(e,f,g) are optional; (e)the num. cycles to be skipped before this file is used;
if (g) is provided, then the file will alternate between being used for (f) cycles and not used for (g) cycles;
otherwise, the file will be used for (f) cycles then will not be used again.
- INITIAL CONTIG FILES:
- NOTE: these flags can be used multiple times in the same command to include multiple initial contig datasets.
- -icf a b c d
- (a)initial contig file, (b)number of addition steps, (c)number of cycles per step, (d)const by which to multiply quality scores
- -picf a b c d e
- (a)num of initial contigs from this file, (b)initial contig file, (c)num addition steps, (d)num cycles per step (e)const by which to multiply quality scores
- -icfNt / -picfNt
- same as -icf/-picf, but if target mode is invoked, contigs with matches to these input sequences will not necessarily be retained
- This option for initial files is intended to accelerate targeted assembly and possibly to allow "stuck" contigs that would be
introduced for a targeted assembly job to be approached from the outside. If an initial contig set was already introduced, a
large number of arbitrary reads can be introduced that may spawn contigs from nearby the primary set of initial contigs. Those
reads will spawn contigs of their own, and if those contigs are from nearby the primary input contigs, they may combine with them.
Thus, the primary initial contigs will expand more rapidly. However, most of teh arbitrary reads that are introduced will probably
derive from distal genetic elements that are spatially irrelevant to the primary initial contigs. These options allow those
irrelevant contigs to be disposed of during targeting steps if they do not combine with the primary contigs of interest.
- OUTPUT FILES:
- accepted formats are fasta (.fa or .fasta) or priceq (.pq or .priceq)
- -o a
- (a)output file name (.fasta or .priceq)
- Multiple output files can be separately specified for parallel output (i.e. -o out.fasta -o out.priceq).
- -nco a
- (a)num. cycles that pass in between output files being written
- OTHER PARAMS:
- -nc a
- (a)num. of cycles
- Recommendation: if too few cycles were specified, a job that was previously run can be virtually re-started
with very little loss
of information if the output of the previous cycle was written in .priceq format. The final .priceq
file can be added as the initial contigs file using the -icf command.
- -link a
- (a)max. number of contigs that are allowed to replace a read in a contig-edge assembly
- Edge assembly jobs can become unreasonably complex if the sequence into which the contig is being
extended includes a repetitive element. The ability of even a single repeat-derived read to map to
all contigs in the current assembly that contain a copy of that repeat opens the possibility of a
huge number of contigs being brought into a single assembly job, despite the fact that they do not
truly derive from nearby genomic loci. -link provides an opportunity to avoid that situation by
placing a maximum on the number of contigs that are allowed to replace a single repeat-mapping read.
The read itself will be retained in the assembly job, allowing for the assembly to extend into the
local copy of the repetitive element, but irrelevant contigs with inappropriate sequence flanking
the repetitive element will not be included.
- It is not recommended for the user to change this parameter substantially from the default value.
- -mol a
- (a)minimum overlap length for mini-assembly
- NOTE: -mol does not affect the parameters for de-Bruijn-graph-based assembly.
- This is the global minimum alignment length for two sequences to be combined into a contig during
contig-edge assembly jobs. Alignments are performed semi-globally, so this is the minimum extent of overlap
that is allowed to exist between two sequences for them to be combined. This value is logarithmically increased
with the number of sequences in an assembly job, starting when that number exceeds that specified by -tol.
For a gapped alignment, the lesser of the overlapping nucleotide counts for the two sequences is compared to this value.
As noted above, if the de Bruijn graph strategy is applied and has a lower k-mer size (-dbk, below) than -mol, short
sequences with less overlap
can still be combined through that strategy.
- -tol a
- (a)threshold seq num for scaling overlap for contig-edge assemblies
- NOTE: -tol does not affect the parameters for de-Bruijn-graph-based assembly.
- For contig-edge assembly jobs, this is the number of sequences above which the minimum overlap of two sequences for
combining them into a contig
will be logarithmically increased. At and below this number of sequences, the minimum overlap value will be equal to that
specified by -mol.
- -mpi a
- (a)minimum % identity for contig-edge assembly
- In a gapped alignment, the aligned strand with the lower percent of matching nucleotides will give rise to the % identity.
Competing alignments will always be selected based on their calculated semi-global scores, but alignments with less than
the minimum percent identity will be excluded. -mpi imposes a global minimum on contig-edge assemblies, but like the
minimum overlap lenth, this requirement will be scaled as the number of reads increases over that specified by -tpi (see below).
The minimum % ID value will asymptotically approach 100% from the value given by -mpi, decreasing the distance by half for every
log-scale increase to the number of input sequences.
- -tpi a
- (a)threshold seq num for scaling % ID for contig-edge assemblies
- For contig-edge assembly jobs, this is the number of sequences above which the minimum % identity of two sequences for
combining them into a contig
will be drawn closer to 100% (see -mpi above). At and below this number of sequences, the minimum % identity will
be equal to that specified by -mpi.
- -MPI a, -TPI a
- same as -mpi and -tpi above, but for meta-assembly
- NOTE: there is no minimum overlap value for meta-assembly. Meta-assembly only collapses highly-redundant
sequences that overlap entirely (or nearly entirely).
- -dbmax a
- (a) the maximum length sequence that will be fed into de Bruijn assembly
(recommended: max paired-end read length)
- The de Bruijn graph approach to genome assembly is highly efficient. However, when k-mers are repeated in
a genome, or when k-mers are palindromic, the graph representation loses information about the overall
structure of the genome. PRICE applies the de Bruijn graph approach only to local contig-edge assembly jobs,
thereby limiting the opportunity for such errors to cases of very local repeats and, by creating strand-specific
assembly jobs using paired-end topology to orient sequences, avoiding the confusion caused by palindromic sequences.
In order for PRICE to not dismantle the structure of the larger contigs that are also included in contig-edge
assembly jobs, an upper limit must be placed on the length of sequence that will be subjected to de Bruijn graph
representation. If -dbmax is set to the length of the input reads, then they will be efficiently collapsed
by this method, while the larger contigs will be retained as they are for pairwise alignment and collapse with one
another and with the contig(s) generated by the de Bruijn graph method.
- -dbk a
- (a) the k-mer size for de Bruijn assembly (recommended: keep less than the read length)
- The de Bruijn assembly strategy is only applied during the contig-edge assemblies, and then it is only applied
to short sequences (see -dbmax). The single-stranded nature of the contig-edge assembly jobs allows for the k-mer size
to be an even number without introducing errors from palindromic k-mers.
- -dbms a
- (a) the minimum number of sequences to which de Bruijn assembly will be applied
- The de Bruijn graph is a graph of nucleotide k-mers. Therefore, k-mers containing non-canonical nucleotide characters
(like N, which is often used in raw sequence files to represent a nucleotide whose identity was ambiguous) cannot be included
in the de Bruijn graph. Therefore, a single N in an input sequence will remove (k-mer size) k-mers from being entered into the
de Bruijn graph. In areas of redundant coverage, other sequences will be able to provide the path through the de Bruijn graph
to span the error. But in places of low coverage, a single error could prevent two sequences capable of forming a very high-quality
alignment from being combined. Places of low coverage can be identified as contig-edge assembly jobs with small numbers of short
(read-sized) sequences. -dbms allows one to bypass de Bruijn-based assembly of short sequences when very few such sequences are
present.
- -r a
- (a) alignment score reward for a nucleotide match; should be a positive integer (default=1)
- These alignment parameters (-r, -q, -G, and -E) are applied to all contexts in which alignments are generated using the
dynamic programming algorithm. Alignments are evaluated as satisfactory or unsatisfactory in terms of the specified % IDs
(fraction of nucleotides that match), but optimal alignments are sought and selected based on the alignment scores.
- -q a
- (a) alignment score penalty for a nucleotide mismatch; should be a negative integer (default=-2)
- See -r above.
- -G a
- (a) alignment score penalty for opening a gap; should be a negative integer (default=-5)
- See -r above.
- -E a
- (a) alignment score penalty for extending a gap; should be a negative integer (default=-2)
- See -r above.
- FILTERING READS:
- -rqf a b [c d]
- filters pairs of reads if either has an unaccptably high number of low-quality nucleotides, as defined
by the provided quality scores (only applies to files whose formats include quality score information).
- (a) the percentage of nucleotides in a read that must be high-quality; (b) the minimum allowed probability
of a nucleotide being correct (must be between 0 and 1, and will usually be a decimal value close to 1);
(c) and (d) optionally constrain this filter to use after (c) cycles have passed, to run for (d) cycles.
- This flag may be called multiple times to generate variable behavior across a PRICE run.
- -rnf a [b c]
- filters pairs of reads if either has an unaccptably high number of uncalled nucleotides (Ns or other ambiguous
IUPAC codes). Like -rqf, but will also filter fasta-format data.
- (a) the percentage of nucleotides in a read that must be called; (b) and (c) optionally constrain this filter
to use after (b) cycles have passed, to run for (c) cycles.
- This flag may be called multiple times to generate variable behavior across a PRICE run.
- -maxHp a
- filters out a pair of reads if either read has a homo-polymer track >(a) nucleotides in length.
- This feature was inspired by mRNA transcript assembly jobs, for which reaching the poly-A tail at the end of the message could
spell disaster without it.
- -maxDi a
- filters out a pair of reads if either read has a repeating di-nucleotide track >(a) nucleotides in length.
- Note that this will also catch mono-nucleotide repeats of the specified length (a string of A's is also a string of
AA dinucleotides), so calling -maxHp in addition to -maxDi is superfluous unless -maxHp is given a smaller max value.
- -badf a b
- prevents reads with an ungapped match of at least (b)% identity to a sequence in file (a) from being mapped to contigs.
- This feature is intended to allow the user to avoid certain specified sequences. It is especially useful for avoiding performance
issues associated with repetitive elements (i.e. LINEs or SINEs).
- -repmask a b c d e f [g]
- uses coverage levels of constructed and/or input contigs to find repetitive elements and mask them
as if they were sequences input using -badf.
- (a) = cycle number (1-indexed) at which repeats will be detected.
- (b) = 's' if repeats will be sought at the start of the cycle or 'f' if they will be sought at the finish.
- (c) = the min. number of variance units above the median that will be counted as high-coverage.
- (d) = the min. fold increase in coverage above the median that will be counted as high-coverage.
- (e) = the min. size in nt for a detected repeat.
- (f) = reads with a match of at least this % identity to a repeat will not be mapped to contigs.
- (g) = an optional output file (.fasta or .priceq) to which the detected repeats will be written.
- -reset a
- re-introduces contigs that were previously not generating assembly jobs of their own.
(a) is the one-indexed cycle where the contigs will be reset. Same with b, c, d.
Any number of args may be added.
- In order to maintain efficiency, contigs whose sequence does not change from one cycle to another are not re-extended,
i.e. paired-end reads are not mapped to their edges. They can still be included in assembly jobs if the reads that map
to the edges of still-active contigs have paired-ends that map to them (which will bring them into those contigs' assembly
jobs). But sometimes, a contig that has stopped growing will be approached by another contig that is still growing, and
the same troublesome (probably low-coverage) area will stop that contig as well. In that case, the sum of the overhanging
reads from both of the two stalled contigs could allow the low-coverage region to be traversed. This arg provides an
opportunity for that to happen by using all of the contigs in the assembly with no regard for whether or not they had
changed during the prior cycle.
- FILTERING INITIAL CONTIGS:
- -icbf a b [c]
- prevents input sequences with a match of at least (b)% identity to a sequence in file (a) from being used.
This filter is optionally not applied to sequences of length greater than (c) nucleotides.
- similar to -badf (above)
- -icmHp a [b]
- filters out an initial contig if it has a homo-polymer track >(a) nucleotides in length.
This filter is optionally not applied to sequences of length greater than (b) nucleotides.
- similar to -maxHp (above
- -icmDi a [b]
- filters out an initial contig if it has a repeating di-nucleotide track >(a) nucleotides in length.
This filter is optionally not applied to sequences of length greater than (b) nucleotides.
- NOTE: this will also catch mono-nucleotide repeats of the specified length (a string of A's is also a string
of AA's), so calling -icmHp in addition to -icmDi is superfluous unless -icmHp is given a smaller max value.
- similar to -maxDi (above)
- -icqf a b [c]
- filters out an initial contig if it has an unaccptably high number of low-quality nucleotides, as defined
by the provided quality scores (only applies to files whose formats include quality score information).
- (a) the percentage of nucleotides in a read that must be high-quality; (b) the minimum allowed probability
of a nucleotide being correct (must be between 0 and 1, and will usually be a decimal value close to 1);
- This filter is optionally not applied to sequences of length greater than (c) nucleotides.
- similar to -rqf (above)
- -icnf a [b]
- filters pairs of reads if either has an unaccptably high number of uncalled nucleotides (Ns or other ambiguous
IUPAC codes). Like -icqf, but will also filter fasta-format data.
- (a) the percentage of nucleotides in a read that must be called. This filter is optionally not applied to
sequences of length greater than (c) nucleotides.
- FILTERING/PROCESSING ASSEMBLED CONTIGS:
- -lenf a b
- filters out contigs shorter than (a) nt at the end of every cycle, after skipping (b) cycles.
- Recommended use is to set this to the read length or higher. De Bruijn-graph assembly can yield very short sequences (paths that
traverse too few k-mers to cover even a read length), but those are not eliminated unless a minimum contig length is specified
with -lenf.
- This filter can be applied variably across assembly iterations of a PRICE run. Since the object of PRICE is to continually
expand a set of existing contigs, length filters that are appropriate for removal of rogue de Bruijn k-mers, as described above,
may not be appropriate for later cycles, where contigs have grown to many kilobases and shorter contigs are likely failures or
off-target distractions. On the other hand, a 1kb length filter could remove all of the contigs from early cycles. The minimum
contig length can be scaled up by using the -lenf flag multiple times for the same job. For instance, if you used the flag twice:
"-lenf 50 0 -lenf 300 2", then a minimal length filter of 50nt would be applied immediately, and a more stringent filter of 300nt
would be applied starting in cycle 3. If you expected the contigs to continue to grow, then adding "-lenf 1000 7" would reflect that
expectation.
- Note that the length filters do not need to always be increasing. For instance, if you had a genomic region for which there was zero
read coverage, but there were paired-end reads spanning that region, you may want to build outboard contigs that could be scaffolded
with (but not joined to) your existing, large contigs. Initially, those outboard contigs would be small, so you would want to decrease
the length filter. In the above example, adding "-lenf 50 9" would lower the length requirement from 1000nt to 50nt in the 10th cycle
(i.e. after skipping 9 cycles). That would allow scaffold-able outboard contigs to form. If you then wanted to get rid of the
unsuccessful outboard contigs, you could ramp the length requirement back up with an additional call, i.e. "-lenf 500 14". So for
any given cycle, the imposed length requirement is the most-recently invoked requirement, NOT necessarily the longest requirement
invoked up to that point.
- -trim a b [c]
- at the end of the (a)th cycle (indexed from 1), trim off the edges of conigs until reaching the minimum coverage
level (b), optionally deleting contigs shorter than (c) after trimming; this flag may be used repeatedly.
- This is an attempt to clear low-coverage sequence from the edges of contigs. This strategy presumes that such
sequence derives from poorly-supported mis-assemblies rather than low-coverage regions. If there is a small window of low
coverage but legitimate sequence, this will also be trimmed, but it should be able to be re-constructed in the next
cycle. This trim method will chew off nucleotides starting at each of the two edges of each contig and will continue
to do so until a score is found meeting the minimum coverage level. If (c) is specified and the resulting contig is
shorter than it allows, the whole contig will be culled.
- -trimB a b [c]
- basal trim; after skipping (a) cycles, trim off the edges of conigs until reaching the minimum coverage
level (b) at the end of EVERY cycle, optionally deleting contigs shorter than (c) after trimming.
-trimB may be called many times, and multiple calls will interact in the same way as multiple -lenf calls
(explained above). A call to -trim will override the basal trim values for that specified cycle only.
- Errors can occur at any time; this command allows a basal level of trim-back to be set up for the entire PRICE
run. It is also set up to be flexible, so that the basal trimming can be modified through different phases
of the assembly process (say, as the read sets being used are modified, and with them the expected level of
coverage for newly-added contig sequence).
- -trimI a [b]
- initial trim; input initial contigs are trimmed before being used by PRICE to seed assemblies. Contigs are
trimmed from their outside edges until reaching the minimum coverage level (a), optionally deleting contigs
shorter than (b) after trimming. This flag is most appropriate for .priceq input, can be appropriate for
.fastq input, and is inappropriate for .fasta input. It will be equally applied to ALL input contigs.
- When restarting an assembly, this allows edge errors from the prior assembly to be dealt with before the new
assembly gets started. This feature was intended for use with .priceq input files, which retain the coverage
scores that -trim normally deals with. It can also be used with .fastq-formatted data to remove low-quality
nucleotides from the end of a read (decimal values are allowed for argument (a)). It is not appropriate for
use with .fasta-format data (the scores are uniform across those sequences, so this filter will either not
modify the input data or, with a high value of (a), delete the entire input dataset).
- -target a b [c d]
- limit output contigs to those with matches to input initial contigs at the end of each cycle.
(a) % identity to an input initial contig to count as a match (ungapped); (b)num cycles to skip
before applying this filter. [c and d are optional, but must both be provided if either is]
After target filtering has begin, target-filtered/-unfiltered cycles will alternate with (c)
filtered cycles followed by (d) unfiltered cycles.
- This feature was inspired by target virus assembly jobs using metagenomic datasets generated through random hexamer
priming of RNA templates. The tendency of RT to switch templates generated chimeric paired-ends, which would
allow jobs seeded with viral sequences to tangentially begin assembling contigs from other metagenome components.
Even with more reliable library prep methods, low-frequency chimeric amplicons can spawn such off-target assemblies,
as can repetitive genomic sequence or incorrect read mapping. This feature eliminates contigs at the end of each cycle
that do not retain identity to the seed sequences. Cycles can be skipped both initially and/or periodically through the
PRICE run using the args. This allows short contigs to spawn the assembly of larger contigs that will not be able to
immediately close the gap between them and their parent contig.
- -targetF a b [c d]
- the same as -target, but now matches to all reads in the input set will be specified, not just
the ones that have been introduced up to that point (this is FullFile mode).
- COMPUTATIONAL EFFICIENCY:
- -a x
- (x)num threads to use
- Many aspects of PRICE are threaded for multi-core CPUs. Thread use will rise and fall as data from parallel operations
is re-synchronized in the main thread. Threading will generally be at its most efficient during mapping (dropping while files are
being read) and during contig-edge assembly. Thread use will generally be the most uneven during meta-assembly (or at the beginning
of contig-edge assembly if there are a small number of very large contig-edge assembly jobs).
- -mtpf a
- (a)max threads per file
- This variable allows the reading from a single file to be threaded. It is not expected to improve performance reading
files stored on disc drives but may enhance performance on solid-state drives (not tested).
- USER INTERFACE:
- -log a
- determines the type of output and can make the output verbose (lots of time stamp tags)
- (a) = c: concise stdout (default)
- (a) = n: no stdout
- (a) = v: verbose stdout
- -logf a
- (a)the name of an output file for verbose log info to be written (doesn't change stdout format)
- Recommendation: use option c (concise/default) output for stdout for viewing and, if desired, use -logf to create
a supplemental verbose record of your run. While you may want to mine the verbose file for details
about the run, it will generally not be a good interface for keeping track of the job status.
- -h, or --help
- user interface info.
- NOTE: no job will be run if this flag is used, regardless of whatever flags are used.