PRICE Documentation: Module Structure
Back to PRICE Documentation main page
Table of Contents
The interfaces described below reveal the logical structure of PRICE's overall design and the overall
strategy for assembly taken by PRICE. Because PRICE was written to accommadate a variety of assembly
goals and challenges, it includes implementations for a wide variety of sometimes dissimilar assembly
strategies. Understanding the following interfaces will provide a context for better understanding
the implementing classes, and developers wishing to add functionality (including the original author)
will be able to achieve most goals through the introduction of new classes for just these interfaces.
- ScoredSeq
- This is the representation for sequences (reads and/or contigs) in PRICE. Abstractly, a ScoredSeq
has three properties: 1) a series of nucleotide identities, 2) a series of confidence scores for
the indicated nucleotide identities, one per nucleotide, and 3) a series of confidence scores for th
adjacency of two nucleotides (essentially a score for the existance of the indicated 3p-5p phosphodiester
bond). Both of these scores are decimal numbers that are supposed to reflect the amount of data
in support of the nucleotide identity/phosphodiester bond, i.e. a nucleotide at a position covered by
5 sequences, with all of the sequences in agreement for that nucleotide's identity, would have a score
of 5. The decimal nature of the score allows quality scores (such as those provided by .fastq files)
to be taken into account. Scores represent the possibility of a nucleotide identity being mis-called.
So in the example above, if another read was added to support the nucleotide identity, but that read had
a 1% scored possibility of being incorrect, the resulting contig's score at that position would become
5.99. Ambiguous nucleotides (N's) have a score of zero by definition. Intermediate ambiguities are not
allowed (i.e. Y for pyrimidines). All nucleotides are represented as upper-case DNA letters (A,T,C,G - U
is not recognized by a ScoredSeq).
- AssemblyJob
- This is an implementation of an assembly strategy. It is essentially a programmatic interface for an entire
assembler: it takes in a set of sequences (reads and/or contigs) and outputs a set of sequences (contigs, or
leftover, unused sequencs that were not combined with anything or modified during the assembly process. PRICE
uses a variety of different assembly strategies, each of which requires a unique implementation (for example,
collapse of fully-redundant sequences versus de Bruijn graph-based assembly). Each of these strategies
implements the AssemblyJob interface and, once created, they can all be managed uniformly in terms of execution
and gathering of results.
- ReadFile
- A source of sequences. ScoredSeqs are returned, regardless of the input format. A number of input file formats
are already implemented (fasta, fastq, the PRICE-specific priceq), and inclusion of novel sequence data formats
simply requires the implementation of an appropriate ReadFile class to parse the file(s) and an interface method
to identify the input format.
- OutputFile
- A place to put sequences. A set of ScoredSeq objects are provided to it; depending on the output format, varying
aspects of the provided information can be used. For example, for a fasta-format output file, the sequences
will be written but the scores ignored. Output files can also (but do not generally) perform additional filtering
or revision of the output sequences, or can simply do nothing with the sequences. The output of novel file formats,
or direct piping of sequences to another program or a database, can be achieved by implementation of an appropriate
OutputFile class and an interface method to specify its use.
- AssemblerListener
- PRICE is designed with a single programmatic interface (the Assembler class) that various user interfaces could use
to launch assembly jobs. However, different user interfaces will handle progress reports (feedback from PRICE about
job progress, contig statistics, etc.) differently by seeking out different types of information and reporting it to
the user differently (or not reporting it). The AssemblerListener interface is designed and integrated into PRICE to
collect progress information as frequently as possible, allowing the various implementations to report or analyze
that data to varying degrees. For example, when contigs are reported at the end of a cycle, the AssemblerListener
could simply count the contigs and report basic size statistics, or it could perform an in-depth analysis by reorting
thinks like complexity, nucleotide distribution, coverage distribution, etc. Or, for a null user interface, the data
could simply be ignored. New user interfaces will require a new (or several new) AssemblerListener classes.
The interfaces described below are not as central to the overall design of PRICE as the Key Interfaces above, but
represent useful abstractions for the introduction of additional PRICE features/assembly strategies. They are
likely targets for implementation when adding functionality to PRICE.
- EcoFilter
- This interface represents a test for contigs (ScoredSeqs), and implementing classes take in a set of sequences and
sort them into two categories: "passing" or "not passing". Generally, these tests are used to identify contigs
that are not of interest and therefore should be deleted at the end of an assembly cycle. Current classes implement
tests for minimum length, minimum coverage levels, matches to reference sequences, etc.
- ReadPairFilter
- This interface provides the opportunity to filter away potentially problematic input data. It provides a test for
for both an individual sequence AND a pair of sequences (it is applied to paired-end sequences), and informs the
user whether that sequence (or pair) passes or fails the test. Generally, are used to eliminate reads or input contigs
with very low sequence complexity, overall low quality scores, etc. The provision of a single test for a pair of sequences
allows for the boolean combination of the single-sequence test to be adjusted according to the test (i.e. does the pair
pass if either sequence passes, or only if both sequences pass?) and also permits the implementation of tests that apply
to the relationship of reads to one another (for example, keeping or eliminating pairs for which the two sequences extensively
overlap, indicating a short underlying amplicon).