Database construction

Overview of the data construction procedure

Essentially, the data construction process consists of the following steps:

  1. Prepare your genomic sequence data. If your sequences are already in the MBGD database, you can skip this step, because the procedure will automatically try to download missing data from MBGD.

  2. Edit $CGAT_HOME/etc/speclist to specify a set of species to be analyzed and a set of programs to be executed for each species set.

  3. Run $CGAT_HOME/build/BuildAll.pl. This script executes all procedures for building the database according to the $CGAT_HOME/etc/speclist file, and eventually constructs data on the $CGAT_HOME/database.work directory.

  4. Run $CGAT_HOME/build/Release.pl to release the data from the $CGAT_HOME/database.work directory to the $CGAT_HOME/database directory.

Preparing genomic sequences

Before running programs, you must prepare genome sequences data. There are three possible ways.

  1. If your sequences are already in the MBGD database, the simplest way to prepare data is to remember the abbreviated names of those genomes in the MBGD database (e.g. 'eco' for Escherichi coli K12). and use these names in the configuration table described in the next section. The build procedure automatically try to download missing data from MBGD. Note that the sequence name specified here is used throughout the system (hereafter we call it SPNAME).

  2. Alternatively, if you have genomic data in GenBank format, you can use the $CGAT_HOME/build/getDataFromGenBank.pl script to convert it to the CGAT database. The syntax of the command is

    getDataFromGenBank.pl GBK_FILE SPNAME

    For example:
    
getDataFromGenBank.pl NC00913.gbk eco
    
    Here, you can use as SPNAME any name that is composed of alphanumeric characters, provided that it is unique.

  3. Otherwise, you must prepare data by yourself. The data you should prepare are as follows:

    • Choose an appropriate unique name (SPNAME) that is composed of alphanumeric characters.

    • $CGAT_HOME/database/genomes/SPNAME: a genomics sequence in fasta format.

    • $CGAT_HOME/database/genes/aa/SPNAME: translated sequences of genes in fasta format (optional; if you want to calculate attribute values associated with protein sequences).

    • $CGAT_HOME/database/genes/nt/SPNAME: nucleotide sequences of genes in fasta format (optional; if you want to calculate attribute values associated with nucleotide sequences).

    • $CGAT_HOME/database/genes/tab/SPNAME: a tab-delimited table of genes containing the following information: beginning position, ending position, direction (1/-1), color code (= function category code; optional), the name of the gene, and the name of the product (optional). The beginning position must be smaller than the ending position even for the gene on the reverse chain. The file must begin with a header line containing a tab-delimited list of field names as follows,

      
#from	to	dir	color	name	product
      190	255	1	1	B0001	thr operon leader peptide
      337	2799	1	1	B0002	bifunctional aspartokinase I/homeserine dehydrogenase I
      2801	3733	1	1	B0003	homoserine kinase
      3734	5020	1	1	B0004	threonine synthase
      5234	5530	1	100	B0005	hypothetical protein
      
      By default color code is defined base on the MBGD function category. Specify '100' if you do not need to assign any color. You can change the color code by modifying $CGAT_HOME/etc/colorTab/colorTab.gene.

Configuring database building procedure

Next, you must prepare $CGAT_HOME/etc/speclist file, which contains information required for the database building procedure. The following information should be specified in this file:

The speclist file consists of macro definition and dataset definition sections, where macro definition must precede the dataset definition. Syntax of the definition of macro variable is as follows:


SET varname = value
A dollar sign followed by variable name, e.g. $varname, causes a variable substitution, as usual.

The dataset definition section is a tab-delimited table, containing the following fields: SPNAME_LIST, PROGRAM_LIST,FLAG_UPDATE, and FLAG_PUBLIC. SPNAME_LIST is a comma-delimited list specifying a set of species to be compared. PROGRAM_LIST is a space-delimited list specifying script files to be executed for each species set. The file names are relative to the $CGAT_HOME/build directory and the wildcard characters such as '*' can be used. For example, align/* specifies all (executable) files under the $CGAT_HOME/build/align directory. FLAG_UPDATE and FLAG_PUBLIC are flags specifying the data should be updated and the data should be open to the public (through the CGI script), respectively. The value should be 1 (yes) or 0 (no). Default value is 1.

The following is an example of the speclist file. The file directs the build script to compare genome sequences "hpy" and "hpj" by executing all programs for identifying feature segments in the $CGAT_HOME/build/segment directory, followed by all programs for calculating gene attribute values in the $CGAT_HOME/build/geneattr directory, followed by all programs for calculating alignment between the two genomes n the $CGAT_HOME/build/align directory.


####################
# macro definition
####################
SET AlignAll = align/*
SET SegmentAll = segment/* geneattr/*
####################
# dataset definition
####################
hpy,hpj	$SegmentAll $AlignAll

Start building

Run BuildAll.pl under the $CGAT_HOME/build directory to execute building procedures. This script creates $CGAT_HOME/work directory if not exists, and move to this directory, and then executes programs that have been specified in the $CGAT_HOME/etc/speclist file in the previous section.


BuildAll.pl

Release data

Before accessing the created database, you must run Release.pl under the $CGAT_HOME/build directory to release the data from the $CGAT_HOME/database.work directory to the $CGAT_HOME/database directory.


Release.pl