change-history.org

1 Database schema
2 mysql++ API
3 cmdline parameters
4 modification

1 Database schema

In current we use Ensembl database schema as template.A full featured Ensembl database is consist of over 70 tables. For a gene prediction task using Augustus as annotation engine,we only need 3 of them.

1.1 table 'dna'

Contains DNA sequence. This table has a 1:1 relationship with the contig table. There's a one-one map for each record in this table to each single row in a plain file 'dna.txt' in which sequences are stored in format of 'int-id\tsequence'.


Column	Type	Default value	Description	Index
seq_region_id	INT(10)		Primary key, internal identifier. Foreign key references to the seq_region table.	primary key
sequence	LONGTEXT		DNA sequence.

1.2 table 'seq_region'

Stores information about sequence regions. The primary key is used as a pointer into the dna table so that actual sequence can be obtained, and the coord_system_id allows sequence regions of multiple types to be stored.Contigs are stored with the 'coord_system_id=2'. Chromosomes have 'coord_system_id=1',they have no corresponding record in table 'dna'. The relationship between contigs and chromosomes is stored in the assembly table.


Column	Type	Description	Index
seq_region_id	INT(10)	Primary key, internal identifier.	primary key
name	VARCHAR(40)	Sequence region name.	unique key: name_cs_idx
coord_system_id	INT(10)	Foreign key references to the coord_system table.	unique key: name_cs_idx
key: cs_idx
length	INT(10)	Sequence length.

1.3 table 'assembly'

This is the assembly table structure.


Field	Type	Null	Key	Default
asm_seq_region_id	int(10) unsigned	NO	PRI	NULL
cmp_seq_region_id	int(10) unsigned	NO	PRI	NULL
asm_start	int(10)	NO	PRI	NULL
asm_end	int(10)	NO	PRI	NULL
cmp_start	int(10)	NO	PRI	NULL
cmp_end	int(10)	NO	PRI	NULL
ori	tinyint(4)	NO	PRI	NULL

2 mysql++ API

In current,we use a third-part mysql API:mysql++ to handle sequence from database.I choose it because of its lightweight and it supports STL perfectly.

2.1 configuration

'mysqlpplib' is added to in trunk base directory and 'mysql' 'mysqlppheader' are added in include directory.'mysql' is set of header files comes with mysql.

2.2 run-time libpath

Copy mysqlpplib to a path that is registered in 'etc/ld.so.cathe '.In most Unix-like OS,they are '/usr/lib','lib','usr/share/lib'.If you you want indicate your own lib-path but have no authority to run 'ldconfig',copy 'mysqlpplib' to 'your/run-time lib/path',compile Augustus with '-Wl,rpath= your/run-time lib/path'.

2.3 use SSQLS

mysqlpp allows user defined 'Specialized SQL Structure'.At the most superficial level,and SSQLS has a member variable corresponding to each field in the SQL table. In 'trunks/include/mysqlppheader/table_structure.h' defined 'dna','seq_region','assembly'.

sql_create_2(dna,
             1, 2,
             int,seq_region_id,
             std::string, sequence)  
sql_create_4(seq_region,
             1,4,
             int,seq_region_id,
             std::string,name,
             std::string,coord_system_id,
             int,length)
sql_create_6(assembly,
             1, 6,
             int, asm_seq_region_id,
             int, cmp_seq_region_id,
             int, asm_start,
             int, asm_end,
             int, cmp_start,
             int, cmp_end)

3 cmdline parameters

–dbaccess accepts comma separated string "database name,host name,user,passwd,table name"
the only parameter without a '–' is the query.If '–dbaccess' is indicated,query corresponds to a name in 'seq_region' table.So skip filetype detect in this case.
–predictionStart and –predictionEnd still work the same way as when input file is a fasta or genebank.

augustus --dbaccess="fly,localhost,henry,123456,," 3L --predictionStart=100 --predictionEnd=30000000 --species=fly

4 modification


file	desc
Makefile	add 2 header path and 2 lib path;add -Wl,rpath=/your/run-timelib/path
types.cc	l-322~l-324,comment an exception thow message to allow 'dbaccess' in sigle mode.I don't want to modify this behavior in system level so I just comment it.
types.cc	reorder –dbaccess to "database name,host name,user,passwd,tablename"
randaccess.{hh,cc}	accomplish the AnnoSequence* DbSeqAccess::getSeq method.Give a mysqlpp::connection object to class DbSeqAccess.
genbank.cc	GBSplitter(string fname ),l-526. If input fname is a name in 'seq_region' in database,skip the filetype detect.
table_structure.h	in 'trunks/include/mysqlppheader' add 3 SSQLS: 'dna','seq_region','assembly'

Author: yuqiulin <yuqiulin@genomics.cn>

Date: 2012-06-09 Sat

HTML generated by org-mode 6.33x in emacs 23