PRICE Documentation: File Formats

Back to PRICE Documentation main page

The file formats described below are used and/or generated by PRICE. In some cases,they follow specifications that have been outlined elsewhere. PRICE was built in accord with those specifications, but any additional specifications in terms of how PRICE interprets the files are also outlined below.

File Format Uses
Fasta Input/output sequence data
Fastq Input sequence data
_sequence.txt Input sequence data
Priceq Input/output sequence data

Fasta

Fasta is a multi-line text file format. Each sequence entry begins with a line starting with the carrot (">") ASCII character, followed by a sequence name (not explicitly used by PRICE), with no whitespace between the two. The sequence begins on the next line of the file and continues until another line beginning with a carrot (">").

Nucleotide sequences may consist of all alphabetical IUPAC characters in upper or lower case. Whitespace characters are skipped, as are dashes ("-"). U's are converted to T's for internal representation and output. All ambiguity characters are treated as N's. "." is also treated as an N.

For more information, see http://en.wikipedia.org/wiki/FASTA_format

Fastq

Fastq is a four-line text file format. Each set of four lines corresponds to a sequence read:

Requires that lines 2 and 4 contain the same number of characters.

Nucleotide sequences may consist of all alphabetical IUPAC characters in upper or lower case. "-" is not supported. U's are converted to T's for internal representation and output. All ambiguity characters are treated as N's. "." is also treated as an N.

Score strings may consist of ASCII characters with decimal values ranging from 33 ("!") to 126 ("~"). These characters indicate error probabilities according to the Phred specification for fastq. ASCII values are converted to "Sanger" values by subtracting the 1st character's offset (33; Phred+33 encoding). Those Sanger values are converted to probabilities of a nucleotide being mis-called according to the following formula: Q = -10 log10( p ) where Q is the Sanger score and p is the probability of the base being incorrectly called. Nucleotides with Sanger scores of 2 or less (ASCII values 33-35) are considered by PRICE to have a 0 score (100% chance of being incorrect), and their corresponding nucleotides are converted to N's. ASCII values outside of the allowed range 33-126 will cause PRICE to crash with an exception raised indicating the file and character.

For more information, see http://en.wikipedia.org/wiki/FASTQ_format

_sequence.txt

This was a variant of fastq that was output by Illumina basecalling software (CASAVA pre-v1.8). It shares the four-line structure of fastq but encodes the scores with different ASCII characters (Phred+64 as opposed to Phred+33).

NOTE: older Solexa/Illumina basecalling (CASAVA pre-v1.3) would allow for values less than zero (lexicographically smaller than "@"). PRICE allows Sanger values as low as -5 / ASCII values as low as 59 (character ";") in _sequence.txt files. However, in accordance with the CASAVA 1.5+ specifications, any Sanger value of 2 or smaller / ASCII value of 66 or smaller ("B") will be treated as 0, and the nucleotide identity at positions in the sequence with zero-scores are modified to IUPAC N's (uncalled bases) by PRICE. Characters less than -5 (";") will cause PRICE to fail with an exception raised indicating that an illegal character was encountered, what that character was, and the file. The upper-limit ASCII value is 126 ("~"), which has a Sanger value of 62. Higher values will also raise an exception.

Various Solexa/Illumina basecalling software versions have also used modifications to the standard Phred formula for calculating basecall accuracy. Except for the zero-exceptions described above, PRICE always uses the Phred formula Q = -10 log10( p ) where Q is the Sanger score (the ASCII score minus the offset; for _sequence.txt, the offset is 64) and p is the probability of the base being incorrectly called. This is the specification for CASAVA v1.3+.

For more information, see http://en.wikipedia.org/wiki/FASTQ_format

Priceq

Priceq is a six-line text file format specified for use by PRICE. Each set of six lines corresponds to a contig. The format is similar to that of fastq, but with ASCII scores representing fold-coverage (confidence; level of support) and with the additional pair of lines providing evidence for adjacency of nucleotides (i.e. confidence of DNA phospodiester bonds between nucleotides):

Requires that lines 2 and 4 contain the same number of characters and that line 6 contains one fewer character than line 2 or 4.

Nucleotide sequences may consist of all alphabetical IUPAC characters in upper or lower case. "-" is not supported. U's are converted to T's for internal representation and output. All ambiguity characters are treated as N's. "." is also treated as an N.

Nucleotide and phosphodiester confidence score strings may consist of ASCII characters with decimal values ranging from 33 ("!") to 126 ("~"). Values >126 are allowed but not used by PRICE (see scoring function below). These characters indicate the number of read counts supporting the shown nucleotide identity/phosphodiester bond in excess of number of dissenting counts. This reflects the internal representation of contigs/sequences in PRICE: when two aligned sequences are combined into a contig by PRICE, the confidence scores of any agreeing nucleotide identity or phosphodiester bond are added together. For conflicting nucleotide identities, the identity with the higher confidence score is selected, and the confidence score of the losing identity is subtracted from that of the winner to yield the new contig's confidence score at that position.

Phosphodiester bonds conflict in the context of gapped alignments. In that case, the single phospodiester spanning the gap is compared to the average of the two flanking the insert. If the gap wins, then the inserted nucleotides are eliminated and the average of the two relevant phosphodiester scores is subtracted from the one spanning the gap. In the inverse case, the phosphodiester score spanning the gap is divided between the two winning phosphodiester scores and a proportional quantity is subtracted from each. If a and b are the two winning phosphodiester scores flanking an insert, c is the phospodiester score for the losing gap, then the new phosphodiester scores flanking the insert A and B according to the formulae A = a - c * a / (a + b) and B = b - c * b / (a + b). This approach prevents scores from every being less than zero.

Both nucleotide and phospodiester confidence scores are internally represented as decimal numbers. Non-integer numbers are introduced by scored input data (fastq/_sequence.txt/priceq) or by normalization of data across multiple contexts (for instance, a read deriving from a repetitive sequence element that aligns with equal scores to multiple copies of that element in the assembly). That property is represented as accurately as possible by the priceq scoring function.

Scoring function: Confidence scores are calculated from ASCII characters according to the following scheme, where AD is the decimal value of ASCII score character:
AD Score
33 0.0
34 0.3
35-125 1.5 ^ (AD - 35)
>=126 1.5 ^ (126 - 35)
Note that when priceq files are being created, in the score->AD conversion, any score >= 0.9 will be upgraded to 35 and will therefore be re-interpreted as 1. In all other cases, values are rounded down to the nearest matching AD value.