PRICE Documentation: Module Structure

The interfaces described below reveal the logical structure of PRICE's overall design and the overall strategy for assembly taken by PRICE. Because PRICE was written to accommadate a variety of assembly goals and challenges, it includes implementations for a wide variety of sometimes dissimilar assembly strategies. Understanding the following interfaces will provide a context for better understanding the implementing classes, and developers wishing to add functionality (including the original author) will be able to achieve most goals through the introduction of new classes for just these interfaces.

ScoredSeq

This is the representation for sequences (reads and/or contigs) in PRICE. Abstractly, a ScoredSeq has three properties: 1) a series of nucleotide identities, 2) a series of confidence scores for the indicated nucleotide identities, one per nucleotide, and 3) a series of confidence scores for th adjacency of two nucleotides (essentially a score for the existance of the indicated 3p-5p phosphodiester bond). Both of these scores are decimal numbers that are supposed to reflect the amount of data in support of the nucleotide identity/phosphodiester bond, i.e. a nucleotide at a position covered by 5 sequences, with all of the sequences in agreement for that nucleotide's identity, would have a score of 5. The decimal nature of the score allows quality scores (such as those provided by .fastq files) to be taken into account. Scores represent the possibility of a nucleotide identity being mis-called. So in the example above, if another read was added to support the nucleotide identity, but that read had a 1% scored possibility of being incorrect, the resulting contig's score at that position would become 5.99. Ambiguous nucleotides (N's) have a score of zero by definition. Intermediate ambiguities are not allowed (i.e. Y for pyrimidines). All nucleotides are represented as upper-case DNA letters (A,T,C,G - U is not recognized by a ScoredSeq).

AssemblyJob

This is an implementation of an assembly strategy. It is essentially a programmatic interface for an entire assembler: it takes in a set of sequences (reads and/or contigs) and outputs a set of sequences (contigs, or leftover, unused sequencs that were not combined with anything or modified during the assembly process. PRICE uses a variety of different assembly strategies, each of which requires a unique implementation (for example, collapse of fully-redundant sequences versus de Bruijn graph-based assembly). Each of these strategies implements the AssemblyJob interface and, once created, they can all be managed uniformly in terms of execution and gathering of results.

ReadFile

A source of sequences. ScoredSeqs are returned, regardless of the input format. A number of input file formats are already implemented (fasta, fastq, the PRICE-specific priceq), and inclusion of novel sequence data formats simply requires the implementation of an appropriate ReadFile class to parse the file(s) and an interface method to identify the input format.

OutputFile

A place to put sequences. A set of ScoredSeq objects are provided to it; depending on the output format, varying aspects of the provided information can be used. For example, for a fasta-format output file, the sequences will be written but the scores ignored. Output files can also (but do not generally) perform additional filtering or revision of the output sequences, or can simply do nothing with the sequences. The output of novel file formats, or direct piping of sequences to another program or a database, can be achieved by implementation of an appropriate OutputFile class and an interface method to specify its use.

AssemblerListener

PRICE is designed with a single programmatic interface (the Assembler class) that various user interfaces could use to launch assembly jobs. However, different user interfaces will handle progress reports (feedback from PRICE about job progress, contig statistics, etc.) differently by seeking out different types of information and reporting it to the user differently (or not reporting it). The AssemblerListener interface is designed and integrated into PRICE to collect progress information as frequently as possible, allowing the various implementations to report or analyze that data to varying degrees. For example, when contigs are reported at the end of a cycle, the AssemblerListener could simply count the contigs and report basic size statistics, or it could perform an in-depth analysis by reorting thinks like complexity, nucleotide distribution, coverage distribution, etc. Or, for a null user interface, the data could simply be ignored. New user interfaces will require a new (or several new) AssemblerListener classes.

Secondary Interfaces

The interfaces described below are not as central to the overall design of PRICE as the Key Interfaces above, but represent useful abstractions for the introduction of additional PRICE features/assembly strategies. They are likely targets for implementation when adding functionality to PRICE.

EcoFilter

This interface represents a test for contigs (ScoredSeqs), and implementing classes take in a set of sequences and sort them into two categories: "passing" or "not passing". Generally, these tests are used to identify contigs that are not of interest and therefore should be deleted at the end of an assembly cycle. Current classes implement tests for minimum length, minimum coverage levels, matches to reference sequences, etc.

ReadPairFilter

This interface provides the opportunity to filter away potentially problematic input data. It provides a test for for both an individual sequence AND a pair of sequences (it is applied to paired-end sequences), and informs the user whether that sequence (or pair) passes or fails the test. Generally, are used to eliminate reads or input contigs with very low sequence complexity, overall low quality scores, etc. The provision of a single test for a pair of sequences allows for the boolean combination of the single-sequence test to be adjusted according to the test (i.e. does the pair pass if either sequence passes, or only if both sequences pass?) and also permits the implementation of tests that apply to the relationship of reads to one another (for example, keeping or eliminating pairs for which the two sequences extensively overlap, indicating a short underlying amplicon).

PRICE Documentation: Module Structure

Table of Contents

Key Interfaces

Secondary Interfaces