Program: SOAPsnp (Short Oligonucleotide Analysis Package for Single Nucleotide Polymorphism)

Copyright (C) 2008, BGI Shenzhen.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Author:  BGI shenzhen
Contact: soap@genomics.org.cn

Introduction

SOAPsnp is a member of the SOAP (Short Oligonucleotide Analysis Package). Despite its name, the program is a resequencing utility that can assemble consensus sequence for the genome of a newly sequenced individual based on the alignment of the raw sequencing reads on the known reference. The SNPs can then be identified on the consensus sequence through the comparison with the reference. In the first Asian genome re-sequencing project, evalution of SOAPsnp result on Illumina HapMap 1M BeadChip Duo genotyping sites shows great accuracy. Over 99% of the genotyping sites are covered at over 99.9% consistency. Further PCR plus Sanger sequencing of the inconsistent SNP sites confirmed majority of the SOAPsnp results.

SOAPsnp uses a method based on Bayesa' theorem (the reverse probability model) to call consensus genotype by carefully considering the data quality, alignment, and recurring experimental errors. All these kinds of information was integrated into a single quality score for each base in PHRED scale to measure the accuracy of consensus calling. Currently, it supports the alignment format of SoapMap.

Download

<Download link>

System requirements
SOAPsnp is a command line driven program written in C/C++ that generally runs under 64-bit Linux system. The program has been tested on various platforms like x86-64 Xeon with Linux kernel 2.6.9 and Loongson 2E/2F with Linux kernel 2.6.22. It is in principle portable to other architectures/systems as only standard C++ libraries were used. GNU Compiler Collection (version>=3.4) is recommended to compile the codes.

The program needs ~500M or even smaller memory to run. However, its output might be very large that consumes a lot of harddisk space. In text output mode, the output file may be as large as 60 times the genome size (e.g. 180G free space is required to run a human genome). In GLF output format (which is proposed by Prof. R. Durbin in Wellcome Trust Sanger Institute), the output file approximately requires a free disk space of 12 times the genome size to store.

Installation
1.	Download the tarball of the latest SOAPsnp version from the link above. (For example, SOAPsnp.tar.gz)
2.	In the Linux console:
tar zxvf /<PATH_WHERE_YOU_PUT_THE_TARBALL>/SOAPsnp.tar.gz
cd SOAPsnp/
3.	Change the 'makefile' if necessary. For example, you may would like to modify the compiler optimization parameters.
4.	In the Linux console:
make all
Then an executable of SOAPsnp will be generated in the directory.
In the Linux console, type:
./soapsnp
or:
<Absolute path>/soapsnp
	to run the program. You may copy the executable to /usr/bin/ or other system paths defined in the environment variables so that you can simply run the program by directly typing 'soapsnp' in the console.

Quick Start:
For diploid genome resequencing:
soapsnp -i <Alignment.soap.sort.chrN> -d  <chrN.fasta> -o <chrN.consensus> -r 0.00005 e 0.0001 -t -u -L  <Maximum Read Length> -M <chrN.mat>

For monoploid genome resequencing:
soapsnp -i <Alignment.soap.sort.chrN> -d  <chrN.fasta> -o <chrN.consensus> -r 0.0001 -t -u -L  <Maximum Read Length> -M <chrN.mat> -m

Usage
Command line options:
1. Required parameters:
-i <FILE> Input SORTED SOAPaligner(soap) alignment result
Note that here we say 'sorted' means alignments of each chromosome are sorted first by chromosome name lexicographically and then by coordinates on each chromosome numerically.

-d <FILE> Reference DNA sequence in FASTA format

-o <FILE> Output consensus file

2. Optional parameters:(default in [])

-z <Char> ASCII character that stands for quality score==0 [@]
FASTQ files generated by Illumina base-calling pipeline use '@' as 0, but some institutes use '!' as 0.

-g <Double> Global error dependency coefficient, 0.0(complete dependent)~1.0(complete independent)[0.9]

-p <Double> PCR error dependency coefficient, 0.0(complete dependent)~1.0(complete independent)[0.5]
Sequencing errors are found slightly repeatable (once an error occur, additional errors also tend to occur) due to various reasons. Therefore, observations of sequencing errors are not complete independent. The main source of repeatable errors is believed to be PCR amplification in sequencing process. The proper values of the two parameters rely on wetlab process. Nonetheless, the default value generally work at most time.

-r <Double> novel altHOM prior probability [0.0005]

-e <Double> novel HET prior probability [0.0010]
The two are prior probabilities of homozygous SNPs (altHOM) and heterozygous SNPs (HET), which are used in Bayes formula calculation. Note these are prior probabilities of a new (novel) SNP. They are expected to be stringent. For different species, the two values should change if necessary.

-t set transition/transversion ratio to 2:1 in prior probability

-s <FILE> Pre-formatted known SNP information.
The file consist of a lot of lines like this one:
chr1    201979756       1       1       0       0.161   0       0       0.839   rs568
The columns from left to right are: name of chromosome, coordinate on the chromosome, whether the SNP has allele frequency information (1 is true, 0 is false), whether the SNP is validated by experiment (1 is true, 0 is false), whether the SNP is actually an indel (1 is true, 0 is false), frequency of A, frequency of C, frequency of T, frequency of G, SNP id. For known SNP sites that do not have allele frequency information, the frequency information can be arbitrarily determined as any positive values, which only imply what alleles have already been deposited in the database.

-2 specify this option will REFINE SNP calling using known SNP information [Off]

-a <Double> Validated HET prior, if no allele frequency known [0.1]

-b <Double> Validated altHOM prior, if no allele frequency known[0.05]

-j <Double> Unvalidated HET prior, if no allele frequency known [0.02]

-k <Double> Unvalidated altHOM rate, if no allele frequency known[0.01]
The parameters are related to using external SNP information to alter prior probabilities for SNP calling. SOAPsnp will try using allele frequency information as prior probability in calling genotypes for each site. If the allele frequency information is absent, it will use the above 4 parameters as prior probability.

-u Enable rank sum test (that check whether the two allele of a possible HET call have same sequencing quality) to give HET further penalty for better accuracy. [Off]

-n Enable binomial probability calculation (that check whether the two allele are observed equally)to give HET further penalty for better accuracy. [Off]

-m Enable monoploid calling mode, this will ensure all consensus as HOM and you probably should SPECIFY higher altHOM rate. [Off]

-q Only output potential SNPs. Useful in Text output mode. [Off]

-M <FILE> Output the quality calibration matrix; the matrix can be reused with -I if you rerun the program

-I <FILE> Input previous quality calibration matrix. It cannot be used simutaneously with -M

-L <short> maximum length of read [45]
Please note that once length of some reads exceeds the parameter will probably collapse the program.

-Q <short> maximum FASTQ quality score [40]

-F <int> Output format. 0: Text; 1: GLFv2; 2: GPFv2.[0]

-E <String> Extra headers EXCEPT CHROMOSOME FIELD specified in GLFv2 output. Format is "TypeName1:DataName1:TypeName2:DataName2"[]

-T <FILE> Only call consensus on regions specified in FILE. Format of this file is:
ChrName\tStart\tEnd
ChrName\tStart\tEnd
...

-h Display this help

Output format
1.	Text format
The result of SOAPsnp has 17 columns: 
1)	Chromosome ID
2)	Coordinate on chromosome, start from 1
3)	Reference genotype
4)	Consensus genotype
5)	Quality score of consensus genotype
6)	Best base, average quality score of best base
7)	Count of uniquely mapped best base
8)	Count of all mapped best base
9)	Second best bases, average quality score of second best base
10)	 Count of uniquely mapped second best base
11)	 Count of all mapped second best base
12)	 Sequencing depth of the site, rank sum test p_value
13)	 Average copy number of nearby region
14)	 Whether the site is a dbSNP. 
2.	GLFv2 and GPFv2
GLFv2 (Genome Likelihood Format v2) is a binary file format proposed by Prof. R. Durbin.