Clustering Parameters
In MBGD, similarity relationships identified by all-against-all BLAST searches
with BLAST E-value <= 1e-2 are stored. For each gene pair,
an optimal local alignment score calculated by the dynamic
programming (DP) algorithm was also stored.
After filtering the similarity data with some criteria [selection step],
a hierarchical clustering algorithm, DomClust
(Uchiyama 2006), is applied to
the similarity data for grouping genes [clustering step].
By default, MBGD minimally filters similarity data in the selection step
so that the clustering program can use as much information as possible.
You can control the similarity cutoff by the following parameters,
although generally we do not recommend you to modify them extremely.
Note that all pairs that are removed in the selection step
are treated as "missing relationships" in the clustering step and
are assigned the same value that is specified in
the "Score for missing relationships" below.
Note also that you can choose only at most 150 organisms
when you want to make your own clusters using the new parameter set.
Cutoff BLAST E-value [selection]
This value specifies a cutoff E-value of the BLAST results. The maximum possible value is 1e-2.
Note that in MBGD, E-value is adjusted so that the size
of the search space (the database size times the query length)
is 1e9.
Cutoff DP score [selection,clustering]
Cutoff score of the optimal local alignment with JTT-PAM250 scoring matrix (Jones et al. 1992).
The same cutoff is used for both the selection and the
clustering steps when you use score as a similarity measure.
Cutoff percent identity [selection]
Percent identity is defined as {the number of identical residue pairs} / {alignment length} * 100. Alignment
length includes internal gaps.
Cutoff PAM distance [selection,clustering]
PAM is a unit of evolutionary distance defined as the number of accepted point mutations per 100 residues
(Dayhoff et al. 1978). Actually, PAM distance is
estimated by which PAM substitution matrix gives
the best alignment score.
The same cutoff is used for both the selection and the
clustering steps when you use PAM as a dissimilarity measure.
Alignment coverage [selection]
Alignment coverage is defined as {alignment length} / {length of the shorter sequence} * 100.
Raising this parameter removes matches in only short regions
*before* the clustering procedure.
MBGD does not make this check by default.
Alignment coverage for domain splitting [clustering]
In MBGD, a domain-splitting procedure is incorporated in the hierarchical clustering algorithm. When merging two most similar
sequences (or clusters), the algorithm searches for
another sequence (S3) that matches one of the merged
sequences (S1) in the region outside the alignment between the
merged pair. The algorithm splits the sequence S1 if such a sequence
S3 is found and the alignment between S1 and S3 satisfies the
coverage condition specified by this parameter and score condition
specified by the next parameter. Raise this parameter to avoid
too short domains generated due to partial matches.
Score cutoff for domain splitting [clustering]
Cutoff score for the match between S1 and S3 described above to split the sequence. This parameter has similar but possibly
complementary effect with the previous parameter.
Similarity measure for orthology [selection,clustering]
This option specifies which similarity or dissimilarity measure (score or PAM) to use for the orthology identification
or the clustering process.
Note that scores depend on the alignment lengths while PAMs do not.
Best hit criterion [selection]
The bi-directional best hit criterion (i.e. gene pairs (a,b) of genomes A and B s.t. a is the most
similar gene of b in A and vice versa)
is a conventional approach for ortholog identification
between two genomes. The uni-directional version is also
routinely used for predicting gene functions.
MBGD does not use such a criterion in the selection step by default
since the UPGMA algorithm itself must involve it,
but in some situation it might be useful for the purpose of
filtering out some apparent paralogs before clustering.
See the next section for details.
Cutoff ratio of the score against the best [selection]
This parameter is not effective when you do not use the best-hit criterion above.
Orthology need not be one-to-one relationship.
For bidirectional best-hit criterion, a gene pair
(a,b) is considered as orthologs when score(a,b) satisfies
score(a,b) / max( max_y( score(a,y) ), max_x( score(x,b) ) )
* 100 >= cutoff_ratio
where x and y are any genes of genomes A and B, respectively.
Using cutoff_ratio =100 corresponds to the exact bidirectional
best-hit criterion.
Similarly, for unidirectional best-hit criterion, a gene pair
(a,b) is considered as orthologs when
score(a,b) / min( max_y( score(a,y) ), max_x( score(x,b) ) )
* 100 >= cutoff_ratio
Score for missing relationships [clustering]
Although the usual hierarchical clustering algorithm requires a complete similarity/dissimilarity matrix,
here we use only significant similarities found by the search.
This option specifies a value
to be assigned for the relationships missed by the search.
The value must be smaller (larger) than the
similarity (dissimilarity) cutoff.
Specifying an extremely small (large) value will result in
classification similar to that by complete linkage clustering,
whereas specifying a value close to the cutoff gives similar
results to that by single linkage clustering.
The default value (=blank) is {score_cutoff * 0.95} or
{pam_cutoff / 0.95}.
Clustering Mode [clustering]
This option specifies whether orthologous or homologous groups shall be made. Actually, this is just equivalent to
omitting the tree-splitting procedure described below
by specifying phylocut > 1.
Cutoff ratio of paralogs for tree splitting [clustering]
In MBGD, orthologous groups are made by splitting trees of homologous clusters created by the hierarchical
clustering algorithm.
The node with two children A and B is split when
| Intersect(Ph(A),Ph(B)) | / min( |Ph(A)|, |Ph(B)| ) > phylocut,
where Ph(A) denotes a set of species contained
in the node A (phylogenetic pattern), |Ph| denotes the cardinality
of Ph, and Intersect(A,B) is an intersection
of sets A and B. This parameter is not effective when you
specify ClusteringMode = 'homology'.
Phylogenetically related organisms [clustering]
When counting the number of species in the above calculation, one can incorporate taxonomic information by counting related
species only once. You can specify a taxonomic rank to determine
which set of organimsms you consider to be related.
Overlap ratio (radj1) for merging adjacent clusters [clustering]
After the tree splitting procedure described above, two clusters of domains are joined when they are almost always
adjacent to each other. More precisely, two clusters A and B
are joined when
|adjacent(A,B)| / max(|A|,|B|) ≥ radj1
or
|adjacent(A,B)| / min(|A|,|B|) ≥ radj2 ,
where adjacent(A,B) is a set of domains belonging to
A and B that are adjacent to each other, and
radj1 and radj2 are parameters
satisfying
0 ≤ radj1 ≤ radj2 ≤ 1.
Coverage ratio (radj2) for absorbing adjacent small clusters [clustering]
See above. Note that this parameter is not effective if radj2 ≤ radj1 .
Microbial Genome Database for Comparative Analysis
Questions and comments to: uchiyama@nibb.ac.jp