Usage: seqclust_cmd.py [options] Options: -h, --help show this help message and exit -s SEQS, --sequences=SEQS input sequences in fasta format -m MINCL, --mincl=MINCL minimal size of cluster for detailed analysis [% of total reads] -o MINOVL, --minovl=MINOVL minimal overlap for assembly -d REPEATMASKER, --repeatmasker=REPEATMASKER repeatmasker database, possible options are All, Viridiplantae, Metazoa, Mammalia, Fungi, None : -v OUTPUT_DIR, --output_dir=OUTPUT_DIR Output directory -p, --paired pair reads -a, --sq_rename do not rename sequences -l OVERLAP, --overlap=OVERLAP minimal overlap(default 55, 30-500) -k CUSTOM_DATABASE, --custom_database=CUSTOM_DATABASE file with custom repeat masker database -e RPS_BLAST, --rps_blast=RPS_BLAST if you want to run rpsblast against CDD specify e value (1e-2 - 1e-10 -f PREFIX, --prefix=PREFIX prefix length - for comparative analysis -z SEQCLUST_DIR, --seqclust_dir=SEQCLUST_DIR directory which contain previous clustering results with seqclust directory, this directory must be different from output directory -b MERGE, --merge=MERGE file with lists of clusters for merging -r MAX_MEM, --max_mem=MAX_MEM Maximal amount of available RAM in kB if not set, clustering tries to use whole available RAM -c CPU, --cpu=CPU number of cpu to use, by default all available processors are used EXAMPLES: clustering with default: seqclust_cmd.py -s sequences.fas -v output_directory clustering with comparative analysis when specieas are coded by the first 4 characters in sequence names: seqclust_cmd.py -s sequences.fas -f 4 -v output_directory clustering with pair illumina reads: seqclust_cmd.py -s sequences.fas -p -v output_directory merging of clusters from previous clustering: seqclust_cmd.py -z output_directory -b merge.txt -v output_directory2 file merge.txt contain space delimited lists of clusters which should be rgeged into one e.g.: 1 4 5 6 8 3 9 10 2 .. Input sequences requirements: - To obtain optimal results, use only high quality sequences 80 bp long or longer. Avoid including sequences shorter than 55 nt. - Ideally, trim all sequences to the same length. - Minimum of 5,000 sequences is required for clustering. - Make sure that all adapters were removed from sequences. Presence of adapters will invalidate clustering results! Use of pair reads: - all pairs must be complete - input sequences contain both read mates and left mates alternate with their right mates Sequence ID renaming (-a option): - Sequences are renamed by default. If you want to keep original sequence names, uncheck this option. For paired reads it is required that the left and rigth mates are distinguished by the last character of sequence name. It is also neccessary that all reads are paired and left mates alternate with their right mates! Prefix (-f option): - If you wish to keep part of the sequences name, enter the number of characters which should be kept (1-10). Use this setting if you are doing comparative analysis Minimum overlap length for clustering (option -l): - Minimal length (in nucleotides) of similarity hits to be considered significant. It can be used to increase default threshold which requires similarity over at least 55%. Custom database of repeats (-k option): - Library of repeats as DNA sequences in fasta format. The recommended format for IDs in a custom library is : '>reapeatname#class/subclass' For details of usage consult manual with case examples at http://repeatexplore.ubmr.cas.cz