GCE (genomic charactor estimator) is a bayes model based method to estimate the 
genome size ,genomic repeat content and the heterozygsis rate of the sequencing
sample. The estimated result can be used to design the sequencing strategy.


INSTALLATION
Download the package and run

tar -xzvf gce.tar.gz 
make (to build the executable file "gce")

in the compiled version, you can use the gce directly.

USAGE
gce -f test.freq [-g total_kmer_num]

test.freq is a list file containing at least two linescolumns, the first linecolumn
is depth and the second linecolumn is frequency(not the ratio) of the depth, other
linecolumn is not recognized in the program. 

Options:
-g 	total kmer number counted from the reads. It is suggested to set this
	value. If not, the total kmer number will be calculated using data in
	kmer_depth_file.
-c	unqiue coverage depth. It is suggested to be set when there is no
	clear peak or there is clear un-unique peaks, especially when the
	heterozygous ratio is high.
-H	when the heterozygous caused peak is clear, it is suggested to use
	hybrid mode.
-b	when there is sequencing bias, you need to set the value.

-m	estimation mode, there are standard discrete model(default) and continuous model. You can
	set 1 to use continuous model, but its stability is not well.

-D	set the raw distance for continuous model, which decide the peak
	number.
	
-h: display help information.

OUTPUT
when you run: gce -f test.freq >gce.table 2>gce.error

Estimation result file: gce.table:
there are two tables, one is ai table and the other is frequency table.
#ai table:
showing the estimated ci and ai for kmer species and Ci and bi for kmer individuals. 
the range of i is from 1 to max peak.
#i      c[i]    a[i]    C[i]    b[i]

#frequency table:
showing the raw depth distribution of kmer species(real_P(x)), the raw depth distribution
of kmer individuals(real_F(x)), the estimated depth distribution of kmer species(est_P(x)) 
and the estimated depth distribution of kmer individuals(est_F(x)).
#depth  real_P(x)       real_F(x)       est_P(x)        est_F(x)

For more details about kmer species and kmer individuals, please read the manuscript.

Estimation log file: gce.error
we use the Final estimation table:
raw_peak        now_node        low_kmer        now_kmer        cvg genome_size     a[1]    b[1]
20      101211427       0       2442901555      20.9675 1.16509e+08 0.915339        0.799342
the genome size estimated here can be used in practise.

PERFORMANCE
gce is a extremely fast tool for estimating the genomic charactor. For the
standard discrete model, the max memory is about 1.5MB, taking less than one
second. For the continuous model, when setting -D 8, the max memeroy is about
1.5MB, taking about 5 seconds. The memory and time cost is only related to the
max kmer depth and the depth distribution, not related with K size.

COMMENTS/QUESTIONS/REQUESTS
Please send an e-mail to liubinghang@genomics.cn