1. Brief introduction ===================== This program corrects the sequencing errors base on kmer frequency spectrum(KFS). It assume that most low-frequency Kmers were generated by sequencing errors, so the key of error correction is that the distinguish rate of the low-frequency and high-frequency Kmers, the larger Kmer size, the better of this effect but need more computing resouce. In order to get an extreme high accuracy result, we balanced the trimmed length and delete ratio with the accuracy level. The practical Kmer size should be chosen based on the genome characteristic. 2. Two sets program under this package ====================================== (1) KmerFreq_AR & Corrector_AR When kmer size less than(<=) 17bp, this set is preferred because program running speed is faster than *_HA, and memory usage is less than 16GB(15mer,1G; 16mer,4G; 17mer,16G;) in KFS construction. Also, KmerFreq_AR support space-kmer in KFS construction and Corrector_AR support Duo-kmer(consective and space kmer) in correction. (2) KmerFreq_HA & Corrector_HA When kmer size larger than(>) 17bp, this set is preferred because memory usage is less in KFS construction. 3. Instruction of the input/output files and parameters ======================================================= (1) KmerFreq_AR [OPTION] -k Set the kmer size, default=17. Note: when k=17, kmer theoretic number is 4^17 = 16G, store 1 byte for a kmer, the kmer frequency spectrum will use 16G memory(k=16, 4G; k=18, 64G). -s Set the size of space-seed in kmer, default=0. Note: for consecutive kmer, use default value: 0; for space-kmer, set this option and make sure it is set to be one even number when kmer size is odd number(for reverse and complement kmer). When kmer size is one even number, the space-kmer format will be like this(for k=10, s=5): 'ATTCG-----GTACG'. When kmer size is odd number, the space-kmer format will be like this(for k=11, s=4): 'ATTCG--A--GTACG'. -c Set min precision rate for kmer, default=-1. Note: program caculate the precision of kmer base on the quality score, set between 0~0.99, or -1 for unrestrained. -t Set the thread number, default=1. Note: the more thread number, the high speed, however, this should be less than the number of CPU cores. -q Set the ASCII shift of the quality value(Q_SHIFT), default=64. Note: usually 64 or 33 for Illumina data. -m Set whether to output kmers depth file, 1:yes, 0:no, default=1. Note: if you just want to get the distribution curve of kmer frequency, set '-m 0'. -b Set total bases number used to get kmers, default: all the bases of read files. Note: when sequencing depth is too much, you can set this option, about 30X data bases is preferred. -p Set the output prefix, default=output. Note: often use species name as prefix. -h Show help information. Result file: (a) Kmer frequency file, in *.freq.cz compressed format, as the input of 'Corrector_AR' program. (b) Compressed block length file, in *.freq.cz.len format, used to record each blocks length of compressed Kmer frequency file, as the input of 'Corrector_AR' program. (c) Kmer frequency statistic file, in *.freq.stat, used to decide the cutoff between low and high frequency. (2) Corrector_AR [OPTION] [prefix.space.freq.cz] [prefix.space.freq.cz.len] Three mandatory input files: 1.: frequency file of consecutive kmer, in *.freq.cz compressed format, as the ouput of 'KmerFreq_AR' program. 2.: compressed block length file of consecutive kmer, in *.freq.cz.len format, as the ouput of 'KmerFreq_AR' program. 3.: the address list of reads file, each reads files take a line, reads file from the same pair-end lane should be neighbored. Two optional input files: 1.[prefix.space.freq.cz]: frequency file of space kmer, in *.space.freq.cz compressed format, as the ouput of 'KmerFreq_AR' program. 2.[prefix.space.freq.cz.len]: compressed block length file of space kmer, in *.space.freq.cz.len format, as the ouput of 'KmerFreq_AR' program. Program will not use space kmer in error correction if these two file are not set. -k Set size of consecutive kmer, default=17. Note: this value must be set to be same as that used in Kmer_freq program. -l Set the low frequency cutoff of consecutive kmer, default=3. Note: the consecutive Kmers with frequency lower than(<=) this cutoff will be taken as false Kmers, while the others as authentic Kmers. -K Set size of space kmer, default=0. Note: this value must be set to be same as that used in Kmer_freq program. If it is default value K=0, program will not use space kmer in error correction. -s Set the size of space seed, default=0. Note: this vaule must be set to be same as that used in Kmer_freq program. -L Set the low frequency cutoff of space kmer, default=3. Note: the space Kmers with frequency lower than(<=) this cutoff will be taken as false Kmers, while the others as authentic Kmers. -m Set the minimum length of a continuous high-freq-kmer region, default=10. Note: use the Branch and bounding tree method, needs a continuous high frequency region as the starting point. Here we set the minimum Kmer number in the continuous high-freq-kmer region. -c Set the maximum change allowed in one read, default=2. Note: set the allowed base number to be changed, when achieve this cutoff, the program will turn to trim data instead of correction. -n Set the maximum node number allowed in the branch and bound tree, default=15000000. Note: this setting affects the maximum memory that will be used. -a Set whether remain all the data(1) or not(0), default=0. Note: 1:remain all the data, 0:trim some low quality and suspicious data to get more accurate data; if set '1', no need to set -x and -r -Q Set the ASCII shift of the quality value(Quality_shift), default=64. Note: usually 64 or 33 for Illumina data. -e Set whether trim(1) suspicious region at the end of reads according to Q_value<=2 or not(0), default=0. Note: 0:no, trim low quality and suspicious data directly; 1:yes, just trim Q_value<=2 continues region. -w Set whether trim(1) error bases with Q_value<=2 instead of correct(0) it , default=0. Note: when base in Q_value<=2 continues region is detected error base, 0:correct this base; 1:trim this end of reads. -q Set Quality threshold of error bases, default=40. Note: bases with quality value higher than this threshold will not be corrected when using BBT method. -x Set the trimmed length at low-quality ends instead of correct them, default=KmerSize/2. Note: on the trimmed ends or real ends of reads, sequencing errors tend to be concentrate, and is hard to correct all of them. In order to get higher accuracy in the final result, we further trimmed some length at these low-quality regions. -r Set the minimum length of trimmed read, default=50. Note: This value set the minimum length of final reads that will be output. -t Set the thread number, default=1. Note: the more thread number, the high speed, however, this should be less than the number of CPU cores. -j Set whether convert read1 and read2 corrected file into Pair-end file: 1, yes, 0, no; default=1. Note: program may discard some low quality reads in corrected file, it should be convert into Pair-end file for using in some other porgram(for example: SOAPdenovo), pair and single file as the result file when it set to "-j 1". If you have only single-end reads, please do not use the function, that is, set option "-j 0". The default value is to deal with paired-end reads. -o Set output file format, default=1. Note: 0, fasta compressed file(*.fa.gz); 1, fastq compressed file(*.fq.gz); 2, fasta text file(*.fa); 3, fastq text file(*.fq). If input read files are in fasta format and ouput in fastq format, program will output all quality char=40+Quality_shift. -h Show detailed help Result files: (a) Each lane will generate two *.pair_*.fq file (contain Pair-End reads), one *.single.fq file(contain Single-End reads, if no Single-End reads, this file will be removed), and one *.pair.single.stat file (contain statistic information). (b) For each reads file, there is one *.cor.stat file, contain statistics information for each file. (c) For reads list, there is one *.QC.xls file, contain quality control information. (3) KmerFreq_HA [OPTION] -k Set kmer size(13~27), default=17. Note: kmer size should be set between 13 and 27. -r Set read length used to get kmers, default=read's real length. Note: the max read length to get kmer frequency spectrum. -a Set ignored length of the beginning of a read, default=0. Note: set this option when sequencing quality is low in the beginning of reads. -d Set ignored length of the end of a read, default=0 Note: set this option when sequencing quality is low in the end of reads. -g Set total bases number used to get kmers, default=all input bases. Note: when sequencing depth is too much, you can set this option, about 30X data bases is preferred. -l Set input read file list. Note: no default value, the address list of reads file, each reads files take a line. -p Set the output prefix, default=output. Note: often use species name as prefix. -i Set initial size of hash table, default=50000000. Note: this is the initial size, and program will enlarge the HashSet memory space (*2 each time) when memeory is insufficient. -t Set thread number, default=1. Note: the more thread number, the high speed, however, this should be less than the number of CPU cores. -L Set maximum read length, default=100. Note: program will alloc memeory buffer base on read length, so you should set the maximun read length in the read files. -f Set whether use Bloom filter to reduce memory usage, 0:no; 1:yes; default=0. Note: reduce memeory usage by using Bloom filter to reduce kmer which frequency lower than(<=) 2. -s Set the Bloom array size, default=1000000000. Note: you can set by this formula:(genome_size+genome_size*coverage*error_rate*kmer_size)*5/4. -b Set the kmer-freq analysis divide into how many steps(=1, 2, 4, 8), default =1. Note: the more step, the less memeory usage, but need more time. -h Show detailed help Result file: (a) Kmer frequence file, in *.freq.gz compressed format('kmerbit freq'),as the input of 'Corrector_HA' program. (b) Kmer frequence statistic file, in *.freq.stat used to decide the cutoff between low and high frequence. (4) Corrector_HA [OPTION] Two input files: : kmer frequency file, in *.gz compressed format, as the ouput of 'KmerFreq_HA' program. : the address list of reads file, each reads files take a line, reads file from the same pair-end lane should be neighbored. -k Set kmer size, default=17. Note: this value must be set to be same as that used in Kmer_freq program. -l Set the low frequency cutoff, default=3. Note: the Kmers with frequency lower than(<=) this cutoff will be taken as suspicious Kmers, while the others as credible Kmers. -m Set the minimum length of a continuous high-freq-kmer region, default=10. Note: use the Branch and bounding tree method, needs a continuous high frequency region as the starting point. Here we set the minimum Kmer number in the continuous high-freq-kmer region. -c Set the maximum change allowed in one read, default=2. Note: set the allowed base number to be changed, when achieve this cutoff, the program will turn to trim data instead of correction. -n Set the maximum node number allowed in the branch and bound tree, default=15000000. Note: this setting affects the maximum memory that will be used. -a Set whether remain all the data(1) or not(0), default=0. Note: 1:remain all the data, 0:trim some low quality and suspicious data to get more accurate data; if set '1', no need to set -x and -r -Q Set the ASCII shift of the quality value(Quality_shift), default=64. Note: usually 64 or 33 for Illumina data. -e Set whether trim(1) suspicious region at the end of reads according to Q_value<=2 or not(0), default=0. Note: 0:no, trim low quality and suspicious data directly; 1:yes, just trim Q_value<=2 continues region. -w Set whether trim(1) error bases with Q_value<=2 instead of correct(0) it , default=0. Note: when base in Q_value<=2 continues region is detected error base, 0:correct this base; 1:trim this end of reads. -q Set Quality threshold of error bases, default=40. Note: bases with quality value higher than this threshold will not be corrected when using BBT method. -x Set the trimmed length at low-quality ends instead of correct them, default=KmerSize/2. Note: on the trimmed ends or real ends of reads, sequencing errors tend to be concentrate, and is hard to correct all of them. In order to get higher accuracy in the final result, we further trimmed some length at these low-quality regions. -r Set the minimum length of trimmed read, default=50. Note: This value set the minimum length of final reads that will be output. -t Set the thread number, default=1. Note: the more thread number, the high speed, however, this should be less than the number of CPU cores. -j Set whether convert read1 and read2 corrected file into Pair-end file: 1, yes, 0, no; default=1. Note: program may discard some low quality reads in corrected file, it should be convert into Pair-end file for using in some other porgram(for example: SOAPdenovo), pair and single file as the result file when it set to "-j 1". If you have only single-end reads, please do not use the function, that is, set option "-j 0". The default value is to deal with paired-end reads. -o Set output file format, default=1. Note: 0, fasta compressed file(*.fa.gz); 1, fastq compressed file(*.fq.gz); 2, fasta text file(*.fa); 3, fastq text file(*.fq). If input read files are in fasta format and ouput in fastq format, program will output all quality char=40+Quality_shift. -h Show detailed help Result files: (a) Each lane will generate two *.pair_*.fq file (contain Pair-End reads), one *.single.fq file(contain Single-End reads, if no Single-End reads, this file will be removed), and one *.pair.single.stat file (contain statistic information). (b) For each reads file, there is one *.cor.stat file, contain statistics information for each file. (c) For reads list, there is one *.QC.xls file, contain quality control information. 4. Memory usage =============== (1) KmerFreq_AR: Memory usage is related to the kmer size. The peak value of memory usage can be estimated roughly by this formula: 4^KmerSize(Byte). If KmerSize=17, then the memory usage is 16G(KmerSize=16 Memory=4G; KmerSize=18 Memory=64G; KmerSize=19 Memory=256G). (2) Corrector_AR: Memory usage is also related to the kmer size. The peak value of memory usage can be estimated roughly by this formula: 4^KmerSize/8(Byte). If KmerSize=17, then the memory usage is 2G(KmerSize=18 Memory=8G; KmerSize=19 Memory=32G). For Duo-kmer mode, 17bp consecutive and 17bp space-kmer , then the memory usage is 2G+2G=4G. (3) KmerFreq_HA: Memory usage is related to the total number of kmer species in sequencing data, including the kmers which were generated by the sequencing error. The peak value of memory usage can be estimated roughly by this formula: KmerSpeciesNumber * 8Byte / 0.75. (4) Corrector_HA: Memory usage is related to the number of high frequence(greater than low frequence cutoff) kmer species. The peak value of memory usage can be estimated roughly by this formula: HighFreqKmerSpeciesNumber * 8Byte. 5. Filter the low-quality reads before ====================================== The purpose of this program is to generate very-high accuracy reads data (error rate < 0.1%), which can be directly used for SOAPdenovo assembly. Before using this program, if you have filtered some extrem-low quality reads or trim some low-quality 3'-ends, that will enhance the performance. This program will take the "N" bases as base "A", and will try to correct them with normal bases. 6. Alert and advice for users ============================= (1) When calculate the KFS, about 30X data is preferred. (2) Remember the ASCII shift of quality default value(Quality_shift -Q) is 64, user should check the file and make sure this option set suitably. (3) The low-frequency kmers regions will be taken as sequencing errors, and will be corrected or removed in the final result. However, this is not always the fact. The whole genome shotgun sequencing will generate random reads accross the genome, and some regions will be really in low coverage. These regions will be unluckily removed. So you should remember what the program have done to your data, and be clear what effect will cause in the following assembly. 7. A quick start example: (the frequently-used parameters are showing) ====================================================================== (1) For small size and repeat-less genome(Most frequently used): ./KmerFreq_AR -k 17 -t 10 -p Yeast read.lst >kmerfreq.log 2>kmerfreq.err ./Corrector_AR -k 17 -l 3 -r 50 -t 10 Yeast.freq.cz Yeast.freq.cz.len read.lst >corr.log 2>corr.err Note: for kmer=17bp, about 16G memory usage in KFS construction and 2G in correction. (2) For large size or repeat-rich genome, two strategies for selection: (a) Use Duo-kmer in correction(Faster and 16G memory usage): ./KmerFreq_AR -k 17 -t 10 -p Human read.lst >conkmerfreq.log 2>conkmerfreq.err ./KmerFreq_AR -k 17 -s 6 -t 10 -p Human read.lst >spacekmerfreq.log 2>spacekmerfreq.err ./Corrector_AR -k 17 -l 3 -K 17 -s 6 -L 3 -r 50 -t 10 Human.freq.cz Human.freq.cz.len Human.space.freq.cz Human.space.freq.cz.len read.lst >corr.log 2>corr.err Note: for consecutive-kmer=17bp space-kmer=17bp, about 16G memory usage in KFS construction and 4G in correction. (b) Use larger kmer size in correction(Better result but slower): ./KmerFreq_HA -k 23 -t 10 -p Human -l read.lst -L 100 >kmerfreq.log 2>kmerfreq.err ./Corrector_HA -k 23 -l 3 -r 50 -t 10 Human.freq.gz read.lst >corr.log 2>corr.err Note: About 3G genome size for human, suppose the sequencing error generate 4G error kmers, then program will use about (3G+4G)*8Byte/0.75=74.7G memory in KFS construction, and about 3G*8Byte=24G in correction.