SOAPindel User Guide Shengting Li (lishengting@gmail.com) March 27, 2012 Bioinformatics Research Centre (BiRC), Aarhus University, Denmark Beijing Genomics Institute, Shenzhen (BGI-Shenzhen), China $Id: README 4 2013-02-09 05:52:54Z lishengting $ ########################################################################################## History: 2.1.4.89 2013-02-09 Fix the compile issue caused by gcc 4.7 ########################################################################################## 1) Introduction SOAPindel 2.0 is focusing on calling indels from the next-generation paired-end sequencing data. 2) Requirements Input data sources: 1. The reference sequence file used to align the reads. It must be in Fasta format. 2. The files with read-alignments. SOAPindel 2.0 accepts indexed BAM files as input. SOAPindel can guess the library insert sizes by itself, but if users could provide the correct ones, it will save some time. Sequencing platform: Theoretically, SOAPindel is designed for all paired-end sequencing data because it doesn't consider any qualities for now. But we've only tested it on Ilumina GA data, so we don't guaranty that it can work on 454 data. System environment: 1. Linux, Unix or any Linux like system. 2. perl version >= 5.8 3. samtools version >= 0.1.18 3) Workflow 1. Cluster reads around indels 2. Do local assembly and alignment to get indels 3. Filter results 4) Command & Input files perl indel_detection.ibam.pl 1. : reads mapping files list. File format: Lib_insert_size Deviation_of_insert_size Reads_length Lib_insert_size Deviation_of_insert_size Reads_length . . . : the reads alignment results with indexed BAM format (could be viewed by 'samtools view'). Lib_insert_size: insert size of library Deviation_of_insert_size: normally set to SD*2 (~20% of library insert size). Reads_length: maximal read length in the alignment file 2. : reference sequence file or folder. User should provide a whole file which contains all chromosomes. The script will split it automatically. User could also split the file by himself. If so, all reference sequence files must be put in this folder and named like 'chr*.fa' (no quotes, "chr" must be lowercase). Each file contains only one chromosome. Every file should be in fasta format and have id: ">chromosome_name " (chromosome_name must be same with the alignment files). The command could be run directly, but for the big data set (like one or more chromosomes), users are suggested to use "-p" to print out the script and run it manually (please see the details in Parameters section). If user want to use the "SOAP" or "un-indexed bam" format rather than "indexed bam" format, please use the "indel_detection.v1.pl" script and see the details in the "README.v1". 5) Parameters ----String & Number Parameters:--------------------------------------------------------------------- -chr chromosomes (ALL) This parameter could be a string of chromosomes with regions or not, split by plus "+", e.g. "chr7:26730761-27230760+chr7:89428340-90542763+chr20" means detecting all indels from 26730761 to 27230760 and 89428340 to 90542763 on chr7 and the whole chr20. If you want to detect a lot of regions, you could put them in one file (each line contains a region, the file must have a suffix '.list') and use the file name as the parameter. DEFAULT: ALL, means detect every thing in the alignment files. -l layer_num (2) The min valley depth allowed in one cluster. Any thinner layer will be cut and start to find a new cluster. DEFAULT: 2, should be no less than 2. Could be set to 1/20 of sequencing coverage to save some running time for high coverage data. -k kmer size (25) Dimensional of de Bruijn graph. DEFAULT: 25, should be no more than 1/2 reads length and no less than 10. -cpu cpu number (8) Number of local CPUs used. For qsub or other multi-task submitting methods, the total CPU number = this number x job number. -ext extension length for every cluster (max(read_len,50)) In order to find the indels near the edge of cluster, SOAPindel extends every cluster on each side. DEFAULT: max(reads length, 50), take the longer one. -x_num read_max_hits# (1) Don't consider any reads that hit the reference more than this parameter. DEFAULT: 1, means don't consider any repeats. -mm max_mismatch (int(read_len/35)+1) Max mismatches allowed in one read. DEFAULT: (int(reads length/35)+1). -il max gap between nails (100) SOAPindel will stop extending the path if the distance of two adjacent nails is too big. This parameter could set the limitation of indel size softly. DEFAULT: 100 -ol overlap length when cut cluster by max length (il) If the cluster is too long, SOAPindel will cut it into segments with certain length and certain overlap. DEFAULT: 100 -xl max cluster length (ol*3) If the cluster is longer than this parameter, SOAPindel will cut it into segments with certain length and certain overlap. DEFAULT: 300 -sdp use bigger deviation (deviation*sdp) to filter insert size of pair aligned reads (2.0) SOAPindel considers a pair of reads unmapped if the distance between them is bigger than insert_size*(1+deviation*sdp). DEFAULT: 2.0 -wd work_dir (.) SOAPindel will create this folder for all internal and final result. DEFAULT: current folder. This value could NOT contains slash '/'. -st path of samtools -mc max contigs to align (100) If the local assembly generates too many contigs, the local alignment only aligns part of them. DEFAULT: 100 -mx unique reads max reliable coverage (auto) Even if filter reads with SNPs, sometimes there are still too many coverage of reads in some regions (mostly repeats) and could affect the assembly unexpectedly. SOAPindel will drop reads randomly to reduce the coverage to a reasonable number. -bs block size of assemble_align scripts (100) Split the assemble_align jobs to spare some disk seeking operation. -qf path of quality tab file ($BIN_PATH/filter/Indel.qual.tab) -w window size for successive homogeneous indels checking (30) In order to filter out false positive sites, SOAPindel limits max number of successive homogeneous indels in a certain size window. DEFUALT: 30 -n max successive homogeneous indels #/window (2) Max number of successive homogeneous indels allowed. DEFAULT: 2, should be set to higher number if the divergence of the reference and sample is too big, e.g. 7 for Human vs Chimp. -cm path_of_cross_match, only works when -ucm is set () ----Switch Parameters:------------------------------------------------------------------------------ -nf don't filter reads with SNPs. -ng don't count unique gap-mapped reads as matched reads. -p only print script, don't run. Print out the script for user to run it manually. -nb run the assemble_align scripts one line at a time. -bfa compress internal fasta output. -qseq print sample seq for PCR. SOAPindel doesn't output sequence of sample around the indels by default because it will generate too huge result. When users need to do PCR validation or similar options, this parameter could help. -qsub use qsub to submit jobs If users want to use PBS (qsub) or other batch jobs submission system, the $QSUB_OPT1, $QSUB_OPT2 and $QSUB_OPT3 variables within the script must be modified manually in order to fit the object system. -np don't print progress in log -ucm use cross_match to do local alignment (debug) -t test time and memoery (debug) 6) Output files Log SOAPindel stores running log in the WORK_DIR/log folder. Users can use "cat WORK_DIR/log/*.log to check if there is something wrong during the process. Result SOAPindel supports VCF4.0 format in this verison. SOAPindel stores results in the WORK_DIR/result/chr*, one folder for one chromosome. There would be 8 files for each chromosome: chr*.HEAD chr*.variation.raw.gz: compressed raw results chr*.variation.sorted: sorted and unique raw results chr*.variation.list: all variations (include SNPs) with Q value chr*.indel.vcf: VCF4.0 format of chr*.variation.list without SNPs chr*.variation.noq.list: all variations (include SNPs) filtered by internal threshold chr*.indel.noq.list: all indels filtered by internal threshold chr*.indel.noq.vcf: VCF4.0 format of chr*.indel.noq.list All the ".list" files are in same format. Here is the description of each column: 1. cluster regions id 2. assembled contigs id 3. chromosome 4. type[-size] (S:SNP; D:Deletion; I:Insertion; N:Heterozygote Position) 5. start position on reference chromosome 6. local start position on the cluster. 7. indel position region (local start - local end: indels could happen on any position between the start and end, so the local end - local start + size = tandem repeat length, and tandem repeat length / size = repeat times) 8. indel allele 9. average coverage for insertion; minimum coverage for deletion 10. minimum coverage for insertion; overlapping coverage for deletion 11. "+" for homozygote; sample genotype / reference genotype ratio for heterozygote 12. L(left flank length from the assembled contig start to indel start)_R(right flank length from indel end to the contig end) 13. 10 bp left flank on reference_((indel region sequence without allele) + 10 bp right flank on reference) 14. Empty by default; if the -qseq is set, this column is (left flank of the contig)_(right flank of the contig) without allele sequence. 15. used internal 16. used internal 17. used internal And there is one statistics file: WORK_DIR/result/stat.q10.tab