README for SHRiMP2: SHort Read Mapping Package, version 2.0 http://compbio.cs.toronto.edu/shrimp This document describes the programs of which SHRiMP2 is comprised, including their use, output formats, and various parameters. The algorithms employed are also briefly described. SHRiMP2 builds on, and significantly extends the original SHRiMP package, henceforth referred to as SHRiMP1. SHRiMP2 is brought to you by: Matei David Michael Brudno Daniel Lister Misko (Michael) Dzamba SHRiMP1 was originally developed by: Stephen M. Rumble Michael Brudno Phil Lacroute Vladimir Yanovsky Marc Fiume Adrian Dalca Matei David For details on the original SHRiMP1 package, see the following publication: Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, et al. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput Biol 5(5): e1000386. doi:10.1371/journal.pcbi.1000386 The authors may be contacted at shrimp at cs dot toronto dot edu. -------------------------------------------------------------------------------- Table of Contents -------------------------------------------------------------------------------- 1. Overview 2. Minimum Requirements 2.1 RAM Requirement 2.2 Other Requirements 3. Sample Usage Scenarios 3.1 Installing SHRiMP2 3.2 Compiling SHRiMP2 from source 3.3 Mapping against a genome whose projection DOES fit in RAM 3.4 Mapping against a genome whose projection DOES NOT fit in RAM 3.5 Mapping cDNA reads against a miRNA database 3.6 Advanced: Mapping in OVERLY sensitive mode 4. Input and Output Specifications 5. Parameters 6. Algorithms 7. Comparison to SHRiMP1 7.1 Algorithms 7.2 SHRiMP Output Format 7.3 SHRiMP1 Tools 8. Contact 9. Acknowledgements -------------------------------------------------------------------------------- 1. Overview -------------------------------------------------------------------------------- SHRiMP2 is a software package for mapping reads from a donor genome against a target (reference) genome. SHRiMP2 was primarily developed to work with short reads produced by Next Generation Sequencing (NGS) machines. The main features of SHRiMP2 include: - Fasta input format; - SAM output format; - Support for both letter space (Illumina/Solexa) and colour space (AB SOLiD) reads; - Paired mapping mode, using mate-pair/pair-end information; - Parallel computation, fully utilizing the power of modern shared-memory multi-core architectures. SHRiMP2 uses two techniques to match reads to a genome: we first use the q-gram filtering technique, utilizing multiple spaced seeds, to rapidly identify candidate mapping locations for each read. Subsequently, these locations are thoroughly investigated by a vectored implementation of the standard dynamic programming (Smith-Waterman) string matching algorithm, allowing for the accurate identification of mismatches (SNPs), micro-indels, and crossovers (colour space sequencing errors.) SHRiMP2 primarily targets sensitivity, but it has come a long way to achieve it at a reasonable speed. To illustrate its performance: Sensitivity: SHRiMP2 achieves 94.4% precision and 78.6% recall when mapping simulated colour space reads of 50 base pairs (bp) each containing 1 SNP _and_ 1 indel of size up to 5bp _and_ 4% per-base error rate. Speed: SHRiMP2 on a single 3.0GHz core with 16GB of RAM can map 36bp colour space reads against the reference human genome (hg18) at the rate of 160,000 reads per hour, and twice as fast if using mate-pair/pair-end data. This rate is fully core-parallelizable, in the sense that a cluster of 4 machines, each with 2 quad-core CPUs and 16GB of RAM, will map 4x2x4=32 times faster. This is about 10 million paired reads per hour. So, 250 million 36bp paired reads, equivalent to a 3x coverage of hg18, would take about 30 hours. Prior to mapping, indexing the projection of hg18 into 4 pieces that would each fit in 16GB RAM can be done by in 2 hours by running 4 different machines in parallel. (These are all rough estimates.) SHRiMP2 consists of the read mapper program gmapper, along with a series of tools and scripts. In this README file, we primarily describe gmapper. Descriptions of tools remaining from SHRiMP1 are available in README.SHRiMP132. Descriptions of various helper scripts are in the script directory utils/. -------------------------------------------------------------------------------- 2. Minimum Requirements -------------------------------------------------------------------------------- 2.1 RAM Requirement ------------------- SHRiMP2 is designed to fully utilize physical random-access memory (RAM). While it is certainly possible to run it on swap (disk-mapped) memory, the performance cost of doing that is rather prohibitive. RAM Requirement: To map against a set of contigs of total length L (in bp), using K spaced seeds of weight W, SHRiMP2 needs the following minimum amount of RAM, in bytes: ( L x K x 4 ) + ( K x 4^{W or 12(**)} x (4 + sizeof(void *)) ) + 50,000,000 The third term represents working memory. The second term varies with the weight W of the seeds used. With the default 4 seeds of weight 12, on 64 bit-machines (with 8-byte pointers), this amounts to 0.75GB. (**) By hashing kmers (see -H parameter), the exponent can be brought down to 12, at the expense of some speed drop. The first term generally dominates. E.g., with the default settings (K=4 seeds), to map against the full hg18 (L=3*10^9 bp), the first term becomes 3*10^9 x 4 x 4 = 48GB of RAM. To make it accessible to machines with smaller amounts of RAM, SHRiMP2 provides a mechanism to split a genome into several chunks, each of which comes with the overhead of terms 2 and 3. E.g., in our tests, we split hg18 into 4 chunks, each using about 13GB of RAM, which can fit comfortably on a machine with 16GB of RAM. Currently, SHRiMP2 does not split individual contigs. (However, this can be achieved by hand, possibly at the cost of losing some mappings in the region of the break point.) As a result, the minimum amount of RAM required equals the amount needed to index the largest contig. E.g., in the case of hg18, chr1 has about 260,000bp, and the amount of RAM required to index it with K spaced seeds is K x 1.04 + 0.75 GB. (For a cluster of nodes with 4GB of RAM, we recommend using K=3 seeds rather than the default 4.) 2.2 Other Requirements ---------------------- SHRiMP2 uses SSE2 ("vector") machine instructions to achieve a fast implementation of the standard dynamic programming (Smith-Waterman) string matching algorithm. E.g., it will not work on a non-x86/x86_64 architecture. SHRiMP2 uses the OpenMP API to achieve multi-threaded operation. -------------------------------------------------------------------------------- 3. Sample Usage Scenarios -------------------------------------------------------------------------------- 3.1 Installing SHRiMP2 ---------------------- Assume we downloaded the linux binary package in ./SHRiMP_2_1_1.lx26.x86_64.tar.gz $ tar xvzf SHRiMP_2_1_1.lx26.x86_64.tar.gz $ cd SHRiMP_2_1_1 $ file bin/gmapper bin/gmapper: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, statically linked, for GNU/Linux 2.6.9, not stripped $ export SHRIMP_FOLDER=$PWD Done! At this point the binaries are in bin/, and the various scripts are in utils/. 3.2 Compiling SHRiMP2 from source --------------------------------- Assume we downloaded the source package in ./SHRiMP_2_1_1.src.tar.gz $ tar xvzf SHRiMP_2_1_1.src.tar.gz $ cd SHRiMP_2_1_1 $ make [...warnings, hopefully no errors...] $ ls bin/gmapper* bin/gmapper bin/gmapper-cs bin/gmapper-ls $ export SHRIMP_FOLDER=$PWD 3.3 Mapping against a genome whose projection DOES fit in RAM ------------------------------------------------------------- Assume we have: ./ciona.fa - the Ciona Savignyi genome (175*10^6 bp) ./reads.5m.50bp.cs.fa - 5 million 50bp colour space reads We look at the genome and the reads: $ ls -lh ciona.fa -rw-r--r-- 1 username 173M May 6 14:23 ciona.fa $ head -n 4 ciona.fa >reftig_0 ACCACACATCTGGCGTTGGTGCAACTACCACAACATCTGGCGTTCGTGCA ACCACCACAacatctggtcctggtgcaactaccacaacatctggtcctgg tgcaaccaccacaacatctggtgctggtgcaaccaccacaacatctggtg $ head -n 4 reads.5m.50bp.cs.fa >469_26_42_F3 T121133100312321122210031200212212233 >469_26_379_F3 T312022230033100001303023233122232120 Assume we have 2 machines, each with 4 cores and 8GB of RAM. We check memory requirements: 173M x 4 x 4 = 2.77GB. This is way below the available RAM, so there are no issues fitting the genome in memory. However, we still want to use both available machines, so we split the reads in two sets. $ grep '^>' reads.5m.50bp.cs.fa | wc -l 5000000 $ $SHRIMP_FOLDER/utils/splitreads.py 2500000 reads.5m.50bp.cs.fa $ ls *_to_* 0_to_2499999.csfasta 2500000_to_4999999.csfasta $ grep '^>' 0_to_2499999.csfasta | wc -l 2500000 This reads are now in the two files above. We assume the directory with the reads as well as the directory containing SHRiMP2 are shared across the machines. We map the reads with the following commands <machine1>$ $SHRIMP_FOLDER/bin/gmapper-cs 0_to_2499999.csfasta \ ciona.fa \ -N 4 -o 5 -h 80% >map.0_to_2499999.out 2>map.0_to_2499999.log <machine2>$ $SHRIMP_FOLDER/bin/gmapper-cs 2500000_to_4999999.csfasta \ ciona.fa \ -N 4 -o 5 -h 80% >map.2500000_to_4999999.out 2>map.2500000_to_4999999.log The options used above have the following meaning: -N 4 : Use 4 threads of control, corresponding to the number of cores available on each machine. -o 5 : Print top 5 scoring hits for each read. -h 80% : Set the threshold score for a hit to 80% of the maximum possible score. For reads of 50bp and the default match score of 10, the maximum score is 500, and the threshold is set to 400. Eventually, the two mapping jobs complete. We merge the mappings into one file with: $ $SHRIMP_FOLDER/utils/merge-hits-diff-qr-same-db.py map.0_to_2499999.out \ map.2500000_to_4999999.out >map.out Above, the merging script we used corresponds to "different query set" (2 disjoint sets of reads), "same database" (the full ciona genome). 3.4 Mapping against a genome whose projection DOES NOT fit in RAM ----------------------------------------------------------------- Assume we have: ./hg18.fa - fasta file containing all contigs of the reference human genome ./reads.500kx2.36bp.ls.fa - 1 million 36bp paired letter space reads Assume we know that the reads in each pair come from opposite strands of the genome, pointing "inwards" (towards each other), with an insert of average length 200 and standard deviation 10. Assume we have a single machine with 2 quad-core CPUs and 16GB of RAM. We look at the genome and the reads: $ ls -lh hg18.fa -rw-r--r-- 1 username 3.0G May 6 14:23 hg18.fa $ head -n 8 reads.500kx2.36bp.ls.fa >071214_EAS56_0061:1:1:882:575:1 GAAATTGTACACATCACACACCTTTTTGTCTTTCTT >071214_EAS56_0061:1:1:882:575:2 TGCTAGATAATGATTTGCACATTTTTAATTGGTAGA >071214_EAS56_0061:1:1:781:662:1 GAAGTGATTACATTAAGCTGTACATATGCAGCATTC >071214_EAS56_0061:1:1:781:662:2 AAATATGTTGATTTTTCAGTGTTTTTAAAGTAAAAT We compute the memory requirements for mapping against hg18 using the default 4 seeds: 3.0G x 4 x 4 = 48GB of RAM. Our machine only has 16GB of RAM. We split the genome with the following utility: $ $SHRIMP_FOLDER/utils/split-db.py --ram-size 14 --prefix hg18 hg18.fa [...] $ ls *.fa hg18-14gb-12_12_12_12seeds-1of4.fa hg18-14gb-12_12_12_12seeds-3of4.fa hg18-14gb-12_12_12_12seeds-2of4.fa hg18-14gb-12_12_12_12seeds-4of4.fa Above, we decided to give the gmapper process only 14GB of RAM (the remaining 2GB are presumable used by the operating system.) The script split hg18 into 4 chunks, each containing several contigs, that will each fit in 14GB of RAM. At this point, we realize we might be mapping letter space reads against hg18 in the future, so we decide to create projections of the individual genome pieces, so that we avoid having to re-project them every time we start a mapping job. To achieve this, we run: $ $SHRIMP_FOLDER/utils/project-db.py --shrimp-mode ls hg18-14gb-*.fa [...] $ ls hg18-14gb-*-ls.* hg18-14gb-12_12_12_12seeds-1of4-ls.genome hg18-14gb-12_12_12_12seeds-1of4-ls.seed.0 hg18-14gb-12_12_12_12seeds-1of4-ls.seed.1 hg18-14gb-12_12_12_12seeds-1of4-ls.seed.2 hg18-14gb-12_12_12_12seeds-1of4-ls.seed.3 hg18-14gb-12_12_12_12seeds-2of4-ls.genome hg18-14gb-12_12_12_12seeds-2of4-ls.seed.0 hg18-14gb-12_12_12_12seeds-2of4-ls.seed.1 hg18-14gb-12_12_12_12seeds-2of4-ls.seed.2 hg18-14gb-12_12_12_12seeds-2of4-ls.seed.3 hg18-14gb-12_12_12_12seeds-3of4-ls.genome hg18-14gb-12_12_12_12seeds-3of4-ls.seed.0 hg18-14gb-12_12_12_12seeds-3of4-ls.seed.1 hg18-14gb-12_12_12_12seeds-3of4-ls.seed.2 hg18-14gb-12_12_12_12seeds-3of4-ls.seed.3 hg18-14gb-12_12_12_12seeds-4of4-ls.genome hg18-14gb-12_12_12_12seeds-4of4-ls.seed.0 hg18-14gb-12_12_12_12seeds-4of4-ls.seed.1 hg18-14gb-12_12_12_12seeds-4of4-ls.seed.2 hg18-14gb-12_12_12_12seeds-4of4-ls.seed.3 Since we have a single machine, we run the mapping process sequentially against each of the chunks. $ for i in 1 2 3 4; do \ $SHRIMP_FOLDER/bin/gmapper-ls -L hg18-14gb-12_12_12_12seeds-${i}of4-ls \ reads.500kx2.36bp.ls.fa \ -N 8 -p opp-in -I 50,500 -m 20 -i -25 -g -40 -e -10 -E \ >map.db${i}of4.sam 2>map.db${i}of4.log done [...] $ ls map.db*.sam map.db1of4.sam map.db2of4.sam map.db3of4.sam map.db4of4.sam The options used above have the following meaning: -L hg18-14gb-12_12_12_12seeds-${i}of4-ls : Load the projection of the i-th chunk of the genome. This is in the 5 files with this prefix, and ending in .genome and .seed[1-4] -N 8 : Use 8 threads of control, fully utilizing 2 quad-core CPUs. -p opp-in : Enabled paired mode, where the two reads in each pair are on "opp"-osite strands, pointing "in"-wards. -I 50,500 : Set minimum and maximum allowed insert size. -m 20 -i -25 -g -40 -e -10 : Set specific Smith-Waterman scores for match, mismatch, gap open and gap extend. -E : Enable SAM output. After the mapping jobs complete, we merge the results with: $ $SHRIMP_FOLDER/bin/mergesam reads.500kx2.36bp.ls.fa map.db?of4.sam > map.sam Above, the merging script we used corresponds to "same query set" (same reads) and "different databases" (disjoint chunks of hg18). NOTE: For an example in which we split both the reads and the genome, combining examples 3.3 and 3.4, see $SHRIMP_FOLDER/SPLITTING_AND_MERGING. 3.5 Mapping cDNA reads against a miRNA database ----------------------------------------------- Assume we have: ./hsa.miRNA.cDNA.fa - database of mature miRNA sequences in human (as cDNA) ./reads.1m.cDNA.50bp.csfasta - 1 million 50bp colour space cDNA reads We project the database with: $ $SHRIMP_FOLDER/utils/project-db.py --seed 00111111001111111100,\ 00111111110011111100,00111111111100111100,00111111111111001100,\ 00111111111111110000 --h-flag --shrimp-mode cs hsa.miRNA.cDNA.fa $ ls hsa.miRNA.cDNA-ls.* hsa.miRNA.cDNA-ls.genome hsa.miRNA.cDNA-ls.seed.1 hsa.miRNA.cDNA-ls.seed.3 hsa.miRNA.cDNA-ls.seed.0 hsa.miRNA.cDNA-ls.seed.2 hsa.miRNA.cDNA-ls.seed.4 Here, we used 5 seeds of weight 14 and span 20 which allow for 2 initial mismatches (the leading 00), followed by a match of length 7bp (the 6 following 1s), followed by another match of various forms. Note: The seeds are applied to the colour space translations of both the database and the reads. Hence, a string of 6 consecutive 1s corresponds to 6 consecutive matching colours, which are in particular generated by a string of 7 consecutive matching letters. We map the reads with: $ $SHRIMP_FOLDER/bin/gmapper-cs -L hsa.miRNA.cDNA-ls \ reads.1m.cDNA.50bp.csfasta \ -N 8 -n 1 -U -o 2 -F -v 50% -h 50% -P \ >map.out 2>map.log The options used above have the following meaning: -L hsa.miRNA.cDNA-ls : Load genome projection. -N 8 : Use 8 threads of control. -n 1 : Require a single space seed match to investigate region. -U : Perform ungapped alignment. -o 2 : Print 2 top hits per read. -F : Process only the forward strand. -v 50% -h 50% : Set the score thresholds for the SW filtering steps to 50% of the maximum. This effectively disables the filtering by score threshold. -P : Pretty-print mappings so that we can look at them. To clarify the settings above: - only the forward strand (-F) of database sequences is inspected; - mapping regions are selected by a single spaced seed match (-n 1); - ungapped (-U) mapping scores are still being computed (only match, mismatch and crossover scores are now relevant); - mappings that do not achieve a score of 50% of the maximum are discarded (not many of those, because the initial seed match already guarantees a decent score); - the 2 top hits (-o 2) per read are output. In SHRiMP 2.0.4, a new command-line option was introduced that can be used to load several miRNA-specific settings: "-M mirna" or "--mode mirna" is equivalent with: loading the 5 seeds mentioned above; plus "-H -n 1 -w 100% -U -a 0 -g -255 -q -255 -Z" We have a look at the mappings: $ head -n 15 map.out #FORMAT: readname contigname strand contigstart contigend readstart readend r\ eadlength score editstring >1380_607_170_F3 hsa-let-7e MIMAT0000066 + 1 20 1 \ 20 35 186 15x5 G: 1 TGAGGTAGGAGGTTGTATAGTT------------- 20 |||||||||||||||X|||| T: TGAGGTAGGAGGTTGtATAG--------------- R: 1 T01220132022010123332210233322123032 20 >1380_721_321_F3 hsa-miR-524-5p MIMAT0002849 + 1 22 \ 1 22 35 220 22 G: 1 CTACAAAGGGAAGCACTTTCTC------------- 22 |||||||||||||||||||||| T: CTACAAAGGGAAGCACTTTCTC------------- R: 1 T22311002002023112002222302010303102 22 3.6 Advanced: Mapping in OVERLY sensitive mode ---------------------------------------------- SHRiMP2 is designed to be quite sensitive, but without complete disregard to speed. The following is an ADVANCED EXAMPLE showing some settings that will make it OVERLY sensitive. CAUTION: Not all of these settings need to be applied at the same time to make gmapper sensitive enough for your needs. Please read this example in conjunction with the relevant descriptions in the Algorithms Section, and make an informed choice of parameters for your application. Assume we have a set of (not so many...) reads that we really need to find mappings for, and the appropriate genome projection. We can try: $ $SHRIMP_FOLDER/bin/gmapper-cs -L database reads.csfasta \ -V -w 150% -n 1 -r 50% -v 55% -l 40% -Z -h 60% -a -1 \ >map.out 2>map.log The significance of these options is the following: -V : Do not automatically trim genome index list that are unusually long. In our tests with hg18, this results in about 1-2% more hits, but the running time can increase dramatically, to 3x or more. -w 150% : Enlarge the length of the genome window against which each read is being mapped. -n 1 : Enable window generation mapping mode 1. This is very costly when working with a large genome. The setting means a single spaced kmer match between a read and the genome can be enough to create a candidate mapping window. With seeds of weight 12 and a uniform random genome, a random match occurs once in every 4^12 locations. For 1/4 of hg18, this means about 44 random matches, for every spaced kmer and for every seed. With 4 seeds of average span 20 and reads of length 50, thats about (50-20)x4x44 ~= 5,000 random single spaced kmer matches. Investigating a candidate window location around each of those is costly. The default window generation mode (-n 2) requires at least 2 spaced kmer matches, which reduces dramatically the number of windows created around random matches. (Even though a match of 2 spaced seeds of weight 12 in a genome window does not guarantee 2x12=24 matches, by design of the seeds it does guarantee about 17 matches. As a result, the number of windows expected to be placed around random matches drops by a factor of at least 4^5 = 1024.) For paired mapping mode, you can try the window generation mode "-n 3" first, and, at the extreme, "-n 2". -r 50% : Set the window generation threshold to 50% of the read length. Combined with -n 1, the effect of this is that every single spaced kmer match from a seed with span greater than 50% of read length will generate a candidate mapping window. -v 55% : Lower the threshold for the vector SW filter. -l 40% : Increase the allowable overlap of one mapping location with a previously inspected mapping location that passed the vector filter. This helps if for some reason the data makes it hard for gmapper to center genome mapping locations around spaced kmer matches. -Z : Disable caching of previous vector SW runs. -h 60% : Lower the threshold for the full SW filter. -a -1 : Disable the restriction of the full SW dynamic programming algorithm to areas around the matching spaced kmers. This can help detecting larger indels, but the runtime of the full SW filter will increase drastically. Still, in some cases this is ok to do, as this runtime is much smaller than, say, the time spent in the vector SW filter. See the log file from a relevant run for these timings. -------------------------------------------------------------------------------- 4. Input and Output Specifications -------------------------------------------------------------------------------- Input Files ----------- Both the reads and the genome files should be in fasta format. The files are opened using the zlib library, so it is ok for them to be gzip-ed. (But not zip-ed!) The genome fasta should _always_ be in letter space. The space (letter/colour) of the reads should match the gmapper mode, which is selected by invoking gmapper by the appropriate symlink, gmapper-ls or gmapper-cs. Output Files ------------ SHRiMP2 supports SAM output, as well as native SHRiMP output. SAM output format is documented elsewhere. SHRiMP output format is remains from SHRiMP1. For more details, see Section 7. In either output format (except for SHRiMP with pretty-print), each mapping occupies one line. The top <num_outputs> mappings for each read appear in the output file in the order of best to worst. When running gmapper in multi-threaded mode, mappings of different reads might appear in the output file intermingled, as follows (only displaying the name and the score): $ grep "^>" map.out | head -n 8 >read_1 [...] 355 >read_1 [...] 355 >read_2 [...] 376 >read_2 [...] 376 >read_1 [...] 327 >read_3 [...] 415 >read_1 [...] 327 >read_2 [...] 376 Several scripts are provided that manipulate output files (merge, sort). Paired Mapping Mode Input ------------------------- When running in paired mode, the two reads in each pair should appear consecutively in the reads file, as follows: $ head -n 8 reads.500kx2.36bp.ls.fa >071214_EAS56_0061:1:1:882:575:1 GAAATTGTACACATCACACACCTTTTTGTCTTTCTT >071214_EAS56_0061:1:1:882:575:2 TGCTAGATAATGATTTGCACATTTTTAATTGGTAGA >071214_EAS56_0061:1:1:781:662:1 GAAGTGATTACATTAAGCTGTACATATGCAGCATTC >071214_EAS56_0061:1:1:781:662:2 AAATATGTTGATTTTTCAGTGTTTTTAAAGTAAAAT In particular: 1. The reads should not be simply dumped in the same file. That is, the reads in each pair should not be just about anywhere in the file. They must be consecutive. 2. There should not be unpaired reads among the paired reads. If you have both paired and unpaired reads, use different files and different runs. Several scripts are provided to get the reads in this format. Alternatively, the reads in each file can be placed in two corresponding files, which are then loaded using options -1 and -2. CAUTION: gmapper will NOT check any of the above. If started in paired mode, it will simply fetch consecutive reads and assume they form a pair. For SAM output, gmapper infers the name of a pair of reads as the longest common prefix of the two read names in that pair. E.g., the first pair above would be called "071214_EAS56_0061:1:1:882:575:", including the trailing ":". Paired Mapping Mode Output -------------------------- In paired mapping mode, gmapper outputs pairs of mappings, which consist of consecutive lines describing the mapping of the first read followed immediately by the mapping of the second read. Even though several mappings of the same pair need not occur consecutively (because of multi-threading), the two mappings inside each pair _do_ occur consecutively. For example, >071214_EAS56_0061:1:1:882:575:1 [...] 350 >071214_EAS56_0061:1:1:882:575:2 [...] 310 >071214_EAS56_0061:1:1:781:662:1 [...] 320 >071214_EAS56_0061:1:1:781:662:2 [...] 360 >071214_EAS56_0061:1:1:781:662:1 [...] 320 >071214_EAS56_0061:1:1:781:662:2 [...] 340 >071214_EAS56_0061:1:1:882:575:1 [...] 310 >071214_EAS56_0061:1:1:882:575:2 [...] 350 Scripts are provided that sort mappings of pairs. -------------------------------------------------------------------------------- 5. Parameters -------------------------------------------------------------------------------- The following parameters govern the behaviour of gmapper. For more thorough descriptions of the algorithms involved, see the next Section. The list is organized conceptually by the "area" where parameters operate. Genome Projection ----------------- [ -s/--seeds <seed1,seed2,....> ] Each spaced seed is a contiguous string of 0s and 1s. Several spaced seeds can be specified either by separating them by commas as a single parameter to -s, or by giving -s several times. By default, gmapper uses a set of 4 spaced seeds of weight 12. Several sets of default seeds are available for certain weights. These can be loaded by, e.g. "-s w16", to load the default set of seeds of weight 16. Note: When running gmapper in colour space mode, seeds are applied to colour space representations of the genome and the reads. [ -H/--hash-spaced-kmers ] Hash spaced kmers obtained from each spaced seed into 24-bit strings before indexing them. [ -z/--cutoff <cutoff> ] Ignore lists in the genome index that are longer than <cutoff>. [ -V/--trim-off ] Disable automatic genome index trimming. By default, if no "-z" is given, gmapper trims the lists in the genome index longer than a given threshold. Currently, the automatic threshold is (about): max(1000, 100*<average_list_length>). This flag is ignored if either -z or -S is given. [ -S/--save <filename> ] With this parameter, gmapper projects and indexes the genome, and saves it in several files for future use. No read mapping is performed. The only other relevant parameters in this mode of operation are: -s, -H, and -z. The files created are: <filename>.genome - contains the genome sequence, along with some extra data <filename>.seed.0, <filename>.seed.1, ... - each such file contains the index of the genome projected by one spaced seed. Note: If using -S and -L to save and load a genome index, note that letter and colour space indexes are different. If you have colour (letter) space reads, you should build a colour (letter) space index of the genome. [ -L/--load <filename> ] Load a genome index from the given file. Also loads the set of spaced seeds used in the projection, as well as the value of the -H flag when the index was created. Any -s or -H flags are ignored. In "short form", <filename> does not contain the character ",", and it is interpreted as the prefix of the path to the index files. The actual files are then found by appending ".genome" and ".seed.*". In "long form", <filename> contains the character ",". In this case, the string is tokenized by the "," separator, obtaining, say, "A,B,C". Then, "A" is interpreted as the .genome file, and "B" and "C" are interpreted as the .seed.* files. Thus, if a genome index was created using 2 seeds and the parameter "-S db", then "-L db" is equivalent to "-L db.genome,db.seed.0,db.seed.1". The advantage of the long form of -L is that one may specify which of the seeds to load, rather than loading them all, which is what happens with the short form. E.g., we could use "-L db.genome,db.seed.1". [ --save-mmap /<mmap_name> ] [ --load-mmap /<mmap_name> ] Can be used to save and subsequently load a genome projection to and from shared memory. The projection will remain resident in shared memory until explicitly removed with $ rm /dev/shm/<mmap_name> When saving to shared memory, the genome projection must be first loaded with -L (not directly from a fasta file). The name parameter must start with a '/' character, and it must contain no other '/' characters. This functionality is useful only when many individual gmapper runs are performed against the same large genome projection, to the point where the projection loading part (with -L) is a bottleneck. For reasons we did not fully investigate, loading does not work on certain machines. Consequently, the functionality should be considered experimental. General Mapping --------------- [ -w/--match-window <window_length> ] Specifies the length of the genome window against which a read is mapped. It can be given either as an absolute value, e.g., "-w 62", or as a percentage of read length, e.g., "-w 133%". It defaults to "-w 140%". [ -U/--ungapped ] Perform ungapped alignment. [ -F/--positive ] Only process the forward (positive) strand of the genome. [ -C/--negative ] Only process the reverse complement (negative) strand of the genome. Smith-Waterman Scores --------------------- [ -m/--match <match_score> ] The score for matches during the Smith-Waterman score calculation. This should be positive. [ -i/--mismatch <mismatch_score> ] The score for mismatches. This should be negative. [ -g/--open-r <gap_open_reference_score> ] The score to open a gap along the genome sequence. Should be negative. Note: In the current implementation, the gap_open score does not include any extension. That is, a gap of length 1 is scored as: <gap_open_score>+<gap_extend_score> [ -q/--open-q <gap_open_query_score> ] The score to open a gap along the read sequence. Should be negative. Note: If -g is set and -q is not set, the gap open penalty for the query will be set to the same value as specified for the reference. [ -e/--ext-r <gap_extend_reference_score> ] The score to extend a gap along the genome sequence. Should be negative. [ -f/--ext-q <gap_extend_query_score> ] The score to extend a gap along the read sequence. Should be negative. Note: If -e is set and -f is not set, the gap extend penalty for the query will be set to the same value as specified for the reference. [ -x/--crossover <crossover_score> ] The score for calling a colour space sequencing error. Should be negative. Available only in colour space mode. Disable with a prohibitive value: "-x -255". [ --pr-xover <pr_xover> ] Sets the default crossover probability. This is used to compute other probabilities from scores. During alignment, the probability a colour is wrong is derived from its quality values (if available) rather than using this default. (Default: 0.03) Filter 1: Window Generation --------------------------- [ -n/--cmw-mode <window_generation_mode> ] This parameter controls how candidate mapping windows (CMW) are created for every read or readpair. In unpaired mode, possible values are 1 or 2 (=default). In paired more, possible values are 2, 3, or 4 (=default). Their significance is as follows. In unpaired mode, create a CMW for a read based on: 1: as little as 1 spaced seed match 2: at least 2 spaced seed matches In paired mode, create 2 CMWs, one for each foot (read in a pair), based on: 2: as little as 1 spaced seed match in either foot 3: at least 2 spaced seed matches in one foot, and as little as 1 in the other 4: at least 2 spaced seed matches in either foot [ -r/--cmw-threshold <window_generation_threshold> ] Generate a CMW as long as the "optimistic" estimation of the match score is greater than <window_generation_threshold>. The estimation is computed by assuming that all positions inside a spaced kmer match are, in fact, matching. The threshold is either absolute, e.g., "-r 252", or a percentage of the maximum possible match score, e.g., "-r 50%". It defaults to "-r 55%". Note: This threshold is ignored when either <window_generation_mode> is 1 (in unpaired mode this happens with "-n 1"; in paired mode with "-n 2" or "-n 3"), or when performing ungapped alignment ("-U"). Note on 454 reads ----------------- Currently, gmapper constructs the estimation is this step based on at most a single indel. This is appropriate for read lengths less than 100. For longer reads, which might contain more indels, you might want to significantly lower this threshold, and use an absolute value. For example, to allow a CMW to pass this filter based on two spaced kmer matches of total length 50 and one indel (with the default scores), you should use "-r 464". (We get 464 from 50 matches: +500, 1 gap open: -33, 1 ins extend: -3.) Filter 2: Vector SW/Ungapped Alignment -------------------------------------- [ -v/--vec-threshold <vector_sw_threshold> ] For every CMW for a read, discard it if after running the vector SW algorithm, its mapping score is below <vector_sw_threshold>. The threshold is either absolute, e.g., "-v 300", or a percentage of the maximum possible match score, e.g., "-v 60%". It defaults to "-v 60%". Available only in colour space mode. (In letter space mode, this equals <full_sw_threshold>.) [ -l/--cmw-overlap <window_overlap> ] Following a CMW filter 2 hit, ignore further CMWs that overlap it by more than <window_overlap>. It can be given as either an absolute value, e.g., "-w 62", or a percentage of read length, e.g., "-w 50%". It defaults to "-l 80%". [ -Z/--cachebypass-off ] Disable cache bypass of vector SW calls. This is disabled by default in ungapped mode ("-U"). Filter 3: Scalar (Full) SW Alignment ------------------------------------ [ -h/--full-threshold <full_sw_threshold> ] For every CMW, discard it if after running the full SW algorithm, its mapping score is below <full_sw_threshold>. The threshold is either absolute, e.g., "-h 400", or a percentage of the maximum possible match score, e.g., "-h 80%". It defaults to "-h 68%". [ -o/--report <num_outputs> ] Output the top <num_outputs> filter 3 hits for each read/pair. [ --max-alignments <n> ] If more than <n> mappings for a read (pair) pass all filters, drop them all. Also see Note on Post-alignment Option Ordering. [ --strata ] Only output the highest scoring mappings for any given read, up to <num_outputs> mappings per read. Also see Note on Post-alignment Option Ordering. [ -a/--anchor-width <anchor_width> ] Limit the full SW algorithm to a band of <anchor_width> additional width around matching kmers (anchors). This defaults to "-a 8" in gapped mode, and "-a 0" in ungapped mode. To disable anchor usage altogether, use "-a -1". [ -T/--rev-tiebreak ] Reverse the order in which tie-breaks are resolved on the negative strand. [ -t/--tiebreak-off ] Disable reverse tiebreak (-T/--rev-tiebreak , on by default). [ --global ] Perform a full global alignment in the last phase of alignments. NOTE: This is on by default as of v2.2.0. [ --local ] Perform local alignment instead of global. In this mode, the ends of a read do not necessarily need to match the reference. Mapping qualities are not available in this mode. NOTE: This used to be the default setting prior to v2.2.0. [ --bfast ] Color-space only. Try to align like BFAST. This enables global alignments and if in FASTQ mode, outputs base qualities in the QUAL field, just like BFAST. [ --indel-taboo-len <value> ] Prevent indels from starting or ending in the tail <value> positions of a read. Note: for deletions, start and end positions are the same with respect to the read. Insertions might still start before the taboo zone, but then they have to go all the way to the end of the read (this will only ever happen with --global). Only available in colour space. Sets of settings ---------------- [ -M/--mode <mode> ] Currently only "mirna" is a valid mode. In this case, the option is equivalent with: loading the 5 seeds mentioned above (in the miRNA section 3.5); plus "-H -n 1 -w 100% -U -a 0 -g -255 -q -255 -Z" Paired Mode ----------- [ -p/--pair-mode <paired_mode> ] By default, gmapper operates in unpaired mapping mode. This parameter enables the paired mapping mode. The possible values are: "opp-in", "opp-out", "col-fw" and "col-bw". These correspond to: opposing strands, inwards; opposing strands, outwards; colinear, second is forward; colinear, second is backward. To illustrate the settings, consider the following genome and perfect reads: --> --> GGA GTA + TGGATCCGTAA - ACCTAGGCATT CCT CAT <-- <-- (R1:GGA, R2:TAC) or (R1:TAC, R2:GGA) are "opp-in", insert size = 8 (R1:GTA, R2:TCC) or (R1:TCC, R2:GTA) are "opp-out", insert size = 4 (R1:GGA, R2:GTA) or (R1:TAC, R2:TCC) are "col-fw", insert size = 6 (R1:GTA, R2:GGA) or (R1:TCC, R2:TAC) are "col-bw", insert size = 6 [ -I/--isize <min,max> ] Limit the search space for paired mappings to those where the insert size between the two feet falls in the given range. This parameter is only available in paired mapping mode. It defaults to "-I 0,1000". In all paired mapping modes, the insert is defined as the distance between the 5' ends of the reads (see example above.) Neither the min nor the max values can be negative. If you are unsure about the orientation of the input reads, try each pairing mode along with -X on a statistically non-negligible set of reads (say, 10,000). The correct pairing mode will likely be the one where the insert size histogram is neither flat, nor abundant in negative values. NOTE: This limit is NOT strictly enforced. By nature of the underlying algorithms, some paired mappings with inserts just outside the given range will also be considered. These are not discarded by gmapper itself. To strictly enforce these bounds, use a script to filter the SAM output. [ --half-paired ] In paired mode, if a pair does not map, try to map each read individually. (This is on by default as of v2.2.0). [ --no-half-paired ] Do NOT try to map individual reads. If a pair does not map, do NOT try to align each read independently. (The default is to try.) [ --sam-r2 ] Report the SAM r2 field for letter space alignments. Report a similar x2 fi- eld for colour space alignments. [ --insert-size-dist <mean,stddev> ] Specifies the mean and standard deviation of the insert sizes. They are assumed to come from a normal distribution. These values are used in mapping quality computation. The defaults are: mean=200, stddev=100. Thread Control -------------- [ -N/--threads <num_threads> ] Use <num_threads> threads of control. It defaults to "-N 1". [ -K/--thread-chunk <chunk_size> ] Threads take turns "checking out" a chunk of reads, mapping them, printing the results, and "checking out" the next chunk. This parameter specifies how many reads should be in each such chunk. Defaults to "-K 1000". [ -D/--thread-stats ] Print individual thread statistics in the log file. Input Control -------------- [ --longest-read ] The longest input read is of this length. [ -Q/--fastq ] The input is FASTQ format. [ -1/--upstream <filename> ] Use the given file as input for the upstream reads in paired mode. [ -2/--downstream <filename> ] Use the given file as input for the downstream reads in paired mode. [ --min-avg-qv <value> ] The minimum average quality value of a read for it to even be considered for mapping. The default is 10. [ --qv-offset <value> ] Interpret qvs in fastq input as PHRED+<value>. The default is 33 for colour space and 64 for letter space. [ --no-qv-check ] By default, gmapper crashes if it detects a qv less than -10 or greater than 50. This helps prevent mix-up between various input formats. This option diables this check. [ --ignore-qvs ] When input is fastq, completely ignore base/colour qvs from the analysis, as if the input were fasta. [ --trim-front <value> ] Trim the front of read (or both reads of pair), by this amount. [ --trim-end <value> ] Trim the end of read (or both reads of pair), by this amount. [ --trim-first ] Trim only the first read in read pair. [ --trim-second ] Trim only the second read in the read pair. [ --trim-illumina ] Trim trailing sequence that has 'B' (2) quality values. Output Control -------------- [ --shrimp-format ] Select old SHRiMP output format. [ -P/--pretty ] Select SHRiMP output format, and pretty-print alignments. [ -R/--print-reads ] Select SHRiMP output format, and include reads in the output. [ -E/--sam ] Select SAM output format. (This is on by default as of v2.2.0.) [ --sam-unaligned ] If SAM output format is also selected dump unaligned reads to the SAM output. [ --sam-header <filename> ] Output the filename as header instead of regular SHRiMP / SAM header. Includes additional lines defining a read group if need be. [ --sam-header-XX <filename> ] Where XX is one of "hd", "sq", "rg", "pg". Replace the normal gmapper output for that line with the contents of that file. Note: If --sam-header is given, all headers are taken from that file and no additional ones are printed. If some --sam-header-XX are given and some aren't, the given ones are taken from the associated files, and the not given ones are filled in by gmapper. [ --read-group read_group,pool_name ] Include in the SAM output the corresponding read_group and pool_name. This is ignored if either --sam-header or --sam-header-rg are given. [ --un <filename> ] [ --al <filename> ] Print unaligned/aligned reads to target file in the same format as the input (fasta or fastq). NOTE: Currently, the reads in these files are not guaranteed to be sorted in the same order as in the input when multiple threads are used! [ --progress <value> ] Display a progress line each <value> reads. the default is 100,000. Mapping Qualities ----------------- [ --no-mapping-qualities ] Do not compute mapping qualities. By default, they are computed. [ --all-contigs ] This flag specifies that all contigs of interest are included in this run. When this is not the case (which is the default), gmapper includes extra fields in its SAM output in order to make possible the recalculation of mapping qualities when merging mappings from different files. Also see Note on Post-alignment Option Ordering. [ --single-best-mapping ] If mapping qualities are computed, this flag tells gmapper to only output the "best" mapping for each read (or pair, in paired mode). Note: in paired mode, if --single-best-mapping is given but not --all-contigs, gmapper outputs the best mapping from each class: best paired, best unpaired for the first read, and best unpaired for the second read. These extra mappings are used to recompute mapping qualities when merging. Also see Note on Post-alignment Option Ordering. [ --no-improper-mappings ] Only relevant in paired mode, when --all-contigs and --single-best-mapping are both given. By default, if SHRiMP decides the best mapping of a pair is a singleton, it will attempt to pair it up with the best mapping of the other read, if the latter is strong enough. This enables mapping in the presence of large structural variations. The flag disables this behaviour. Diagnostics ----------- [ -X/--isize-histogram ] When running in paired mapping mode, dump histogram of insert sizes. (Use it with a statistically significant small set of reads.) This can be used to check the correct paired mapping mode was selected. The distribution of insert sizes should be bell-shaped. If it looks uniform, the paired mode was probably wrong. [ -Y/--proj-histogram ] Dump histogram of the lengths of the genomic index lists. The distribution of these list lengths has huge tails (e.g. average 50, maximum 500,000), and trimming some of the outliers often results in a dramatic speed boost at very little, if any, cost to sensitivity. Note on Post-alignment Option Ordering -------------------------------------- The set of mappings which pass the Scalar/Full SW threshold are then processed in this order: (1) If --strata is given, it is applied first, based on (sum of) scores. (2) If --max-alignments is gien, it is applied second. (3) Mapping qualities are then computed based on the remaining reads. (4) If --single-best-mapping is given for paired reads: (4a) If --all-contigs is not given, at most one top mapping in each class is reported, where the possible classes are: singleton mapping for first read, singleton mapping for second read, paired mapping. Here, sorting is by mapping quality. (4b) If --all-contigs is given, at most one mapping is output for each read pair. Selection between classes is done based on mapping quality. Note: For unpaired reads, there is only one class of mapping, so (4a) and (4b) are equivalent in terms of what mappings are output. However, they are different in that --all-contigs also prevents additional SAM fields from being included in the output which would have been used by merging. -------------------------------------------------------------------------------- 6. Algorithms -------------------------------------------------------------------------------- The mapping algorithm (gmapper) works as follows. First, it projects and indexes the genome using every spaced seed. These indexes are all kept in RAM. Next, several threads of control are created. Each thread "checks out" a chunk of reads to be mapped, it goes on running independently of the other threads, mapping the reads in its chunk, it prints the mappings as they are found, and when it is done, it returns to "check out" the next chunk. To map every read Q, the set of all possible mapping locations is conceptually passed through 3 filters: Window Generation, Vector SW/Ungapped Alignment, and Scalar (Full) SW Alignment. At the end, the top <num_outputs> remaining hits for every read are output. When running in paired mode, gmapper only looks for locations where to map each pair of reads in such a way that the size of the insert between the reads falls within a specified range. In most cases, this dramatically reduces the size of the search space. More detailed descriptions follow. Genome Projection ----------------- The first step in a gmapper run is to project and index the genome using every spaced seed. For a spaced seed S of weight (=number of 1s) W, the number of possible spaced kmers we might obtain by applying S to a genome or a read is 4^W. (We drop kmers containing wildcard bases such as "N".) Thus, for every seed S we build an array of size 4^W, where the location indexed by R contains the list of genome locations X, such that applying S at X yields the spaced kmer R. The size of the spaced kmer indexes (4^W) quickly becomes impractical. For this reason, we offer the option (using the -H flag) to first hash spaced kmers into 24-bit strings, then store genome locations indexed by those hash values. This is equivalent to topping the spaced kmer index size at 2^24 = 4^12. The disadvantage of this scheme is that, during the matching, some of the genome locations that are investigated do not in fact contain matches to the read being processed. Trimming the Genome Index ------------------------- In our experience, the distribution of the lengths of the lists in a genome index has huge tails. The super-long lists negatively impact running time, without providing any significant sensitivity gains. E.g., for pieces of 1/4 of hg18, the average list length is about 50, and the longest list has length about 500,000. Trimming (discarding) genome lists longer than about 5,000 lowers the running time by a factor of about 3x, and decreases the number of hits output by merely 1-2%. Because of this dramatic benefit, gmapper currently implements a basic automatic trimming step, selecting the cutoff value of: max( 1000, 100*4^total_genome_length / 4^max_seed_weight ) Informally, this is either 1000 or 100x the average list length, whichever is higher. The choice described above is very basic, and unfortunately, we do not have a reliable method to do a better automatic trimming. (This is high on our to-do list for the next release.) In our experience, finding a good list_cutoff value can be done as follows: - Split and project the genome untrimmed. - Run gmapper with the -Y -V parameters to see a list length histogram with: $ gmapper-ls -L genome -Y -V /dev/null >/dev/null - Based on histogram above, find a candidate value for list_cutoff, e.g., 8000; - Run gmapper with several list_cutoff values on a small (say, 10k) but statistically significant subset of reads; discard the mappings, but watch the running times and the percentage of reads mapped in the log file. E.g., $ for Z in 1000 2000 4000 8000 16000 32000 64000; do gmapper-ls -L genome reads.lsfasta -N 8 -z $Z >/dev/null 2>map.z${Z}.log done; $ grep -E "(Cutoff)|(Run-time:)|(Matched:)|(Matches:)" map.z*.log - After deciding on a good value, you can either give it on the command line for every mapping job with the -z parameter, or you can construct a trimmed genome index with, e.g.: $ gmapper-ls -L genome -S genome-trimmed -z 4000 Filter 1: Window Generation --------------------------- In the first filter, a read is projected using all spaced seeds, obtaining a set of several spaced kmers. For every spaced seed S and every spaced kmer R obtained from Q using S, the corresponding genome index is looked up, retrieving the list of genome locations where we also obtain R by applying S. Visualizing in 2D the genome G along the X axis and the read Q along the Y axis, we are able to compute a set of (spaced) "anchors", which are diagonal matching regions between G and Q. In the "Window Generation" filter, we find candidate matching windows CMW, which are regions of G of <window_length> length that contain "enough" anchors to justify further investigation. There are several window generation modes available (selected by -n). In window generation mode 2 (the default), each CMW must be supported by at least 2 anchors. In mode 1, CMW might be created even from a single anchor. Mode 1 results in higher sensitivity, but lower speed. In our experience, requiring more than 2 anchors for each CMW no longer provides justifiable speed benefits. In paired mapping mode, filter 1 is run independently for each read in a pair. In this case, in window generation mode "-n 4" (the default), both runs of filter 1 occur in (unpaired) mode 2. In window generation modes "-n 2" or "-n 3", both runs of filter 1 occur in (unpaired) mode 1. The <window_generation_threshold> parameter specifies the minimum score that has to be achieved by an optimistic estimate of the real match score in order for the CMW to be generated. The "optimistic" estimate consists of filling the holes in any anchors, and constructing a match from at most two anchors and a single gap. (Note, we say "optimistic" because the anchors come from matching spaced kmers, and G and Q might in fact differ at locations that are in those holes.) Filter 2: Vector SW/Ungapped Alignment -------------------------------------- Next, all CMWs for a given read are investigated by a fast vectorized implementation of the Smith-Waterman dynamic programming string matching algorithm. When running in colour space mode, the matching in this filter is applied to the colour space representations of the read and the CMW, and no crossovers (colour space sequencing errors) are called. When running in ungapped mapping mode, this filter is replaced by a linear time optimal alignment algorithm. In order for a CMW to pass this filter, it must achieve a score of at least <vector_sw_threshold>. In colour space, this is set by -v. In letter space, the vector SW and full SW should obtain the same score, so <full_sw_threshold> is used. Since filter 1 can potentially create many overlapping CMWs that are off by a few bp, filter 2 offers the option to discard any CMWs which overlap a "good" CMW (which already passed the filter) by more than <window_overlap>. This is set by -l. To speed up the mapping of a single read against repetitive regions of a genome, every time the SW algorithm is run, the resulting score is saved in a cache, indexed by the hash value of the genomic window. Subsequently, prior to running the SW algorithm on another genomic window (and the same read!), its hash value is computed and the cache is looked up for a previous score. Conflicts are possible, resulting in wrong SW scores, but they are extremely rare. The -Z flag disables this mechanism. This is disabled anyhow in ungapped mode, as in that case, computing the hash value of a genomic window is not much faster than the linear time alignment algorithm. Filter 3: Scalar (Full) SW Alignment ------------------------------------ Finally, the remaining <num_outputs>+10 top CMWs for each read are passed through a full SW alignment algorithm. While the vector SW applies to colour space, the final full alignment is always done in letter space. Since each letter depends on the previous letter and colour, any error on the colour space read will affect all following letters when converting to letter space. For this reason, we perform the alignment of all four possible letter space translations of the read and permit jumping between matrices, at the cost of <crossover_score>. A CMW passes this filter if its score is at least <full_sw_threshold>. This is set by -h. At the end, the top <num_outputs> remaining CMWs are output. The full SW alignment is the costliest of the filters applied, with running time proportional to the product of the lengths of the read and the CMW. A small optimization is achieved by restricting the dynamic programming to run through a series of "necks" or diagonal tubes centered along the original anchors that triggered the creation of the CMW in filter 1. By default, gmapper uses anchors with total extra width 8. In ungapped mode, the extra width is set 0. To disable this restriction and allow the SW algorithm to run through the entire matrix, use "-a -1". Paired Mapping Mode ------------------- In paired mapping mode, an additional conceptual filter is added between filters 1 and 2. This filter discards any CMWs for one read in a pair that cannot be "paired up" with a CMW for the other read. Here, a CMW for read Q1 is "paired up" to a CMW for read Q2 if the size of the insert between them falls in the allowed interval. The final mapping score for a pair equals the sum of the mapping scores of the individual reads. (We do not currently score insert sizes.) Following filter 3, the top <num_outputs> remaining paired hits for each pair are output. Mapping Quality Values ---------------------- As of v2.2.0, SHRiMP computes by default quality values for mappings, as well as for individual base calls. In letter space, base quality values are the same as the input base quality values. In colour space, base quality values are derived by running a forward-backward algorithm in order to compute the posterior probability of every base call. Mapping quality values are based on the probability a certain mapping is correct. For unpaired reads, the probability a certain mapping is correct is derived as the posterior of that mapping divided by the sum of the posteriors of all other mappings of that read. For reference, see: Li H., Ruan J., Durbin R., "Mapping short DNA sequencing reads and calling variants using mapping quality scores" doi: 10.1101/gr.078212.108. For paired reads, a more elaborate computation is performed. Mappings are placed into three classes: paired, unpaired for the first read, and unpaired for the second read. Various weights are used to carefully select between the classes, taking into account the posteriors of the mappings, the read lengths, as well as prior assumptions on the sensitivity of SHRiMP. In the end, of all classes of mappings for a certain pair of reads, at most one will have quality value greater than or equal to 4. When --single-best-mapping is used, only one best mapping is output for each pair. (Note, the quality of this best mapping might still be less than 4. If those are undesirable, they should be filtered downstream from the SAM output file.) When using --single-best-mapping, if SHRiMP decides that the best mapping for a pair of reads is a singleton mapping (only one of the reads being mapped), SHRiMP will also look at the best singleton mapping of the other read. If that one is by itself strong enough, SHRiMP will forcefully pair up those mappings, and mark them as an "improper" mapping. (Specifically, bit 0x2 in the FLAG field of the SAM output will be 0). Such mappings do not respect any assumptions about the insert size distribution, and they can even be on different chromosomes. To disable this behaviour, see --no-improper-mappings. -------------------------------------------------------------------------------- 7. Comparison to SHRiMP1 -------------------------------------------------------------------------------- 7.1 Algorithms -------------- The major design difference between SHRiMP1 and SHRiMP2 is that the former indexes reads and runs through the genome, while the latter indexes the genome and runs through the reads. Because of this, to make SHRiMP2 run a machine with limited amounts of RAM, one needs to split the genome into several pieces. However, the new design comes with a number of positives: genome indexes can be reused, massive parallelism can be achieved, and mate-pair/pair-end information can be effectively used to dramatically restrict the search space for mapping every read. 7.2 SHRiMP Output Format ------------------------ SHRiMP2 provides SAM output natively. For backward compatibility and for some of the tools included, it still provides the original SHRiMP output, described below. Lines corresponding to individual hits with tab-delimited fields. Such lines always begin with a '>' character in the first position. All utilities ignore any lines that do not begin with '>', such as alignments, comments, etc. Here's an example ('\' indicates continuation of the same logical line on the next line of this README file and does not appear in the actual output): >947_1567_1384_F3 reftig_991 + 22901 22923 3 \ 25 25 2020 18x2x3 Additionally, the beginning of each output file begins with a specification of the tab-delimited fields. For example, the following applies to the above read hit: #FORMAT: readname contigname strand contigstart contigend readstart readend\ readlength score editstring The #FORMAT: line allows new fields to be unambiguously added, or others removed or reordered without requiring alteration to the parsing routines. Descriptions of the columns are as follows: 'readname' Read tag name 'contigname' Genome (Contig/Chromosome) name 'strand' Genome strand ('+' or '-') 'contigstart' Start of alignment in genome (beginning with 1, not 0). 'contigend' End of alignment in genome (inclusive). 'readstart' Start of alignment in read (beginning with 1, not 0). 'readend' End of alignment in read (inclusive). 'readlength' Length of the read in bases/colours. 'score' Alignment score 'editstring' Edit string: read to reference summary; see below. Additionally, probcalc output adds the following columns: 'pchance' Probability that a read will align with a genome with as good a score or better by chance. 'pgenome' Probability that a hit was generated via common evolutionary events characteristic of the genome. 'normodds' Normalized pgenome/pchance. The 'editstring' column contains a short description of the alignment with reference to the database sequence. This allows at-a-glance analysis of alignments, including identification of miscalls, SNPs, indels, etc. The edit string consists of numbers, characters and the following additional symbols: '-', '(' and ')'. It is constructed as follows: <number> = size of a matching substring <letter> = mismatch, value is the tag letter (<letters>) = gap in the reference, value shows the letters in the tag - = one-base gap in the tag (i.e. insertion in the reference) x = crossover (inserted between the appropriate two bases) For example: A perfect match for 25-bp tags is: "25" A SNP at the 16th base of the tag is: "15A9" A four-base insertion in the reference: "3(TGCT)20" A four-base deletion in the reference: "5----20" Two sequencing errors: "4x15x6" (i.e. 25 matches with 2 crossovers) 7.3 SHRiMP1 Tools ----------------- Some of the tools included with the original SHRiMP1 package are still available (probcalc, prettyprint, shrimp_var), but these tools do not work with SAM output. For a description of these tools, we include the README file for the latest SHRiMP1 version, in README.SHRiMP132. -------------------------------------------------------------------------------- 8. Contact -------------------------------------------------------------------------------- The program website is http://compbio.cs.toronto.edu/shrimp The authors of this software may be contacted at the following e-mail address: shrimp at cs dot toronto dot edu -------------------------------------------------------------------------------- 9. Acknowledgements -------------------------------------------------------------------------------- Development was performed at the University of Toronto's Computational Biology lab in collaboration with the Stanford University Sidow Lab. The development of this distribution was made possible in part by National Engineering and Research Council of Canada Undergraduate Student Research Awards (NSERC USRAs). We would like to thank Dr. Lucian Ilie of the University of Western Ontario for providing us with sets of spaced seeds especially designed to achieve good sensitivity. We would like to thank Dr. Alessandro Guffanti (Bioinformatics, Genomnia srl, Milan, Italy), Christine Voellenkle and Jeroen van Rooij ( Policlinico San Donato, Milan, Italy) for their feedback on using SHRiMP for mapping micro RNA data, including the set of spaced seeds mentioned in section 3.5, which are used by default in miRNA mode.