PRICE Documentation: Independent Quality Filter

Back to PRICE Documentation main page

PRICE provides a number of features for dynamically filtering input data across the many cycles of a run based on features such as sequence length, quality, similarity to a reference sequence, etc. However, if a consistent set of filters is to be applied across all of the cycles of a run, a considerable speed advantage can be gained by generating a set of files ahead of time with the filters already applied. The executable PriceSeqFilter provides that functionality, providing access to many of the same features as are accessible through the PriceTI command-line interface.

Command-line Arguments

The various command-line arguments for PRICE SeqFilter are described below. Note that brief descriptions, including current default values, can be accessed at the command line by executing the PRICE SeqFilter with the -h or --help flags.

Usage: ./PriceSeqFilter [args]

INPUT/OUTPUT FILES:

Accepted formats are fasta (.fa or .fasta), fastq (.fq, .fastq, or _sequence.txt), or priceq (.pq or .priceq)
NOTE: input and output file types must match, both in terms of file format and paired vs unpaired.
NOTE ABOUT FASTQ ENCODING: multiple encodings are currently used for fastq quality scores. The traditional encoding is Phred+33, and PRICE will interpret scores from any .fq or .fastq file according to that encoding. The Phred+64 encoding has been used extensively by Illumina, and so it is applied to Illumina's commonly-used _sequence.txt file append. Please make sure that your encoding matches your file append.
INPUT FILES:

-f a

(a) input file of non-paired sequences

-fp a b

(a,b) two input files of sequences; the sequences in one file are the paired-ends of those in the other.

OUTPUT FILES:

-o a

(a) input file of non-paired sequences

-op a b

(a,b) two input files of sequences; the sequences in one file are the paired-ends of those in the other.

OTHER PARAMS:

-r a

(a) alignment score reward for a nucleotide match; should be a positive integer (default=1)
These alignment parameters (-r, -q, -G, and -E) are applied to all contexts in which alignments are generated using the dynamic programming algorithm. Alignments are evaluated as satisfactory or unsatisfactory in terms of the specified % IDs (fraction of nucleotides that match), but optimal alignments are sought and selected based on the alignment scores.

-q a

(a) alignment score penalty for a nucleotide mismatch; should be a negative integer (default=-2)
See -r above.

-G a

(a) alignment score penalty for opening a gap; should be a negative integer (default=-5)
See -r above.

-E a

(a) alignment score penalty for extending a gap; should be a negative integer (default=-2)
See -r above.

FILTERING RULES:

-pair a

(a) is "both" or "either", describing whether both reads of a pair or either read of a pair must FAIL the filters in order for the entire pair to be removed. If paired files are used as input, the read pairs will be retained or discarded together. If you want the sequences to be individually evaluated, run twice using each file as an individual (non-paired) input file. (default=either)

FILTERING SEQUENCES:

-rqf a b

filters out sequences with an unaccptably high number of low-quality nucleotides, as defined by the provided quality scores (only applies to files whose formats include quality score information).
(a) the percentage of nucleotides in a read that must be high-quality; (b) the minimum allowed probability of a nucleotide being correct (must be between 0 and 1, and will usually be a decimal value close to 1).

-rnf a

filters pairs of reads if either has an unaccptably high number of uncalled nucleotides (Ns or other ambiguous IUPAC codes). Like -rqf, but will also filter fasta-format data.
(a) the percentage of nucleotides in a read that must be called

-maxHp a

filters out a pair of reads if either read has a homo-polymer track >(a) nucleotides in length.

-maxDi a

filters out a pair of reads if either read has a repeating di-nucleotide track >(a) nucleotides in length.
NOTE: this will also catch mono-nucleotide repeats of the specified length (a string of A's is also a string of AA's), so calling -maxHp in addition to -maxDi is superfluous unless -maxHp is given a smaller max value.

-badf a b

sequences fail if they match with at least (b)% identity to a sequence in file (a).

-goodf a b

sequences fail if they DON'T match with at least (b)% identity to a sequence in file (a).

-lenf a

filters out contigs shorter than (a) nt.

COMPUTATIONAL EFFICIENCY:

-a x

(x)num threads to use

USER INTERFACE:

-log a

determines the type of standard output:

(a) = c: concise stdout (default)
(a) = n: no stdout

-h, or --help

user interface info.
NOTE: no job will be run if this flag is used, regardless of whatever flags are used.