change-history.org

Table of Contents

1 Database schema

In current we use Ensembl database schema as template.A full featured Ensembl database is consist of over 70 tables. For a gene prediction task using Augustus as annotation engine,we only need 3 of them.

1.1 table 'dna'

Contains DNA sequence. This table has a 1:1 relationship with the contig table. There's a one-one map for each record in this table to each single row in a plain file 'dna.txt' in which sequences are stored in format of 'int-id\tsequence'.

ColumnTypeDefault valueDescriptionIndex
seq_region_idINT(10)Primary key, internal identifier. Foreign key references to the seq_region table.primary key
sequenceLONGTEXTDNA sequence.

1.2 table 'seq_region'

Stores information about sequence regions. The primary key is used as a pointer into the dna table so that actual sequence can be obtained, and the coord_system_id allows sequence regions of multiple types to be stored.Contigs are stored with the 'coord_system_id=2'. Chromosomes have 'coord_system_id=1',they have no corresponding record in table 'dna'. The relationship between contigs and chromosomes is stored in the assembly table.

ColumnTypeDefault valueDescriptionIndex
seq_region_idINT(10)Primary key, internal identifier.primary key
nameVARCHAR(40)Sequence region name.unique key: name_cs_idx
coord_system_idINT(10)Foreign key references to the coord_system table.unique key: name_cs_idx
key: cs_idx
lengthINT(10)Sequence length.

1.3 table 'assembly'

This is the assembly table structure.

FieldTypeNullKeyDefaultExtra
asm_seq_region_idint(10) unsignedNOPRINULL
cmp_seq_region_idint(10) unsignedNOPRINULL
asm_startint(10)NOPRINULL
asm_endint(10)NOPRINULL
cmp_startint(10)NOPRINULL
cmp_endint(10)NOPRINULL
oritinyint(4)NOPRINULL

2 mysql++ API

In current,we use a third-part mysql API:mysql++ to handle sequence from database.I choose it because of its lightweight and it supports STL perfectly.

2.1 configuration

'mysqlpplib' is added to in trunk base directory and 'mysql' 'mysqlppheader' are added in include directory.'mysql' is set of header files comes with mysql.

2.2 run-time libpath

Copy mysqlpplib to a path that is registered in 'etc/ld.so.cathe '.In most Unix-like OS,they are '/usr/lib','lib','usr/share/lib'.If you you want indicate your own lib-path but have no authority to run 'ldconfig',copy 'mysqlpplib' to 'your/run-time lib/path',compile Augustus with '-Wl,rpath= your/run-time lib/path'.

2.3 use SSQLS

mysqlpp allows user defined 'Specialized SQL Structure'.At the most superficial level,and SSQLS has a member variable corresponding to each field in the SQL table. In 'trunks/include/mysqlppheader/table_structure.h' defined 'dna','seq_region','assembly'.

sql_create_2(dna,
             1, 2,
             int,seq_region_id,
             std::string, sequence)  
sql_create_4(seq_region,
             1,4,
             int,seq_region_id,
             std::string,name,
             std::string,coord_system_id,
             int,length)
sql_create_6(assembly,
             1, 6,
             int, asm_seq_region_id,
             int, cmp_seq_region_id,
             int, asm_start,
             int, asm_end,
             int, cmp_start,
             int, cmp_end)

3 cmdline parameters

  • –dbaccess accepts comma separated string "database name,host name,user,passwd,table name"
  • the only parameter without a '–' is the query.If '–dbaccess' is indicated,query corresponds to a name in 'seq_region' table.So skip filetype detect in this case.
  • –predictionStart and –predictionEnd still work the same way as when input file is a fasta or genebank.
augustus --dbaccess="fly,localhost,henry,123456,," 3L --predictionStart=100 --predictionEnd=30000000 --species=fly 

4 modification

filedesc
Makefileadd 2 header path and 2 lib path;add -Wl,rpath=/your/run-timelib/path
types.ccl-322~l-324,comment an exception thow message to allow 'dbaccess' in sigle mode.I don't want to modify this behavior in system level so I just comment it.
types.ccreorder –dbaccess to "database name,host name,user,passwd,tablename"
randaccess.{hh,cc}accomplish the AnnoSequence* DbSeqAccess::getSeq method.Give a mysqlpp::connection object to class DbSeqAccess.
genbank.ccGBSplitter(string fname ),l-526. If input fname is a name in 'seq_region' in database,skip the filetype detect.
table_structure.hin 'trunks/include/mysqlppheader' add 3 SSQLS: 'dna','seq_region','assembly'

Author: yuqiulin <yuqiulin@genomics.cn>

Date: 2012-06-09 Sat

HTML generated by org-mode 6.33x in emacs 23