MBGD: Clustering Parameters

Clustering Parameters

In MBGD, similarity relationships identified by all-against-all BLAST searches with BLAST E-value <= 1e-2 are stored. For each gene pair, an optimal local alignment score calculated by the dynamic programming (DP) algorithm was also stored. After filtering the similarity data with some criteria [selection step], a hierarchical clustering algorithm, DomClust (Uchiyama 2006), is applied to the similarity data for grouping genes [clustering step]. By default, MBGD minimally filters similarity data in the selection step so that the clustering program can use as much information as possible. You can control the similarity cutoff by the following parameters, although generally we do not recommend you to modify them extremely. Note that all pairs that are removed in the selection step are treated as "missing relationships" in the clustering step and are assigned the same value that is specified in the "Score for missing relationships" below. Note also that you can choose only at most 150 organisms when you want to make your own clusters using the new parameter set.

Cutoff BLAST E-value [selection]

This value specifies a cutoff E-value of the BLAST results. The maximum possible value is 1e-2. Note that in MBGD, E-value is adjusted so that the size of the search space (the database size times the query length) is 1e9.

Cutoff DP score [selection,clustering]

Cutoff score of the optimal local alignment with JTT-PAM250 scoring matrix (Jones et al. 1992). The same cutoff is used for both the selection and the clustering steps when you use score as a similarity measure.

Cutoff percent identity [selection]

Percent identity is defined as {the number of identical residue pairs} / {alignment length} * 100. Alignment length includes internal gaps.

Cutoff PAM distance [selection,clustering]

PAM is a unit of evolutionary distance defined as the number of accepted point mutations per 100 residues (Dayhoff et al. 1978). Actually, PAM distance is estimated by which PAM substitution matrix gives the best alignment score. The same cutoff is used for both the selection and the clustering steps when you use PAM as a dissimilarity measure.

Alignment coverage [selection]

Alignment coverage is defined as {alignment length} / {length of the shorter sequence} * 100. Raising this parameter removes matches in only short regions *before* the clustering procedure. MBGD does not make this check by default.

Alignment coverage for domain splitting [clustering]

In MBGD, a domain-splitting procedure is incorporated in the hierarchical clustering algorithm. When merging two most similar sequences (or clusters), the algorithm searches for another sequence (S3) that matches one of the merged sequences (S1) in the region outside the alignment between the merged pair. The algorithm splits the sequence S1 if such a sequence S3 is found and the alignment between S1 and S3 satisfies the coverage condition specified by this parameter and score condition specified by the next parameter. Raise this parameter to avoid too short domains generated due to partial matches.

Score cutoff for domain splitting [clustering]

Cutoff score for the match between S1 and S3 described above to split the sequence. This parameter has similar but possibly complementary effect with the previous parameter.

Similarity measure for orthology [selection,clustering]

This option specifies which similarity or dissimilarity measure (score or PAM) to use for the orthology identification or the clustering process. Note that scores depend on the alignment lengths while PAMs do not.

Best hit criterion [selection]

The bi-directional best hit criterion (i.e. gene pairs (a,b) of genomes A and B s.t. a is the most similar gene of b in A and vice versa) is a conventional approach for ortholog identification between two genomes. The uni-directional version is also routinely used for predicting gene functions. MBGD does not use such a criterion in the selection step by default since the UPGMA algorithm itself must involve it, but in some situation it might be useful for the purpose of filtering out some apparent paralogs before clustering. See the next section for details.

Cutoff ratio of the score against the best [selection]

This parameter is not effective when you do not use the best-hit criterion above.
Orthology need not be one-to-one relationship. For bidirectional best-hit criterion, a gene pair (a,b) is considered as orthologs when score(a,b) satisfies

cutoff_ratio

where x and y are any genes of genomes A and B, respectively. Using cutoff_ratio =100 corresponds to the exact bidirectional best-hit criterion.
Similarly, for unidirectional best-hit criterion, a gene pair (a,b) is considered as orthologs when

cutoff_ratio

Score for missing relationships [clustering]

Although the usual hierarchical clustering algorithm requires a complete similarity/dissimilarity matrix, here we use only significant similarities found by the search. This option specifies a value to be assigned for the relationships missed by the search. The value must be smaller (larger) than the similarity (dissimilarity) cutoff. Specifying an extremely small (large) value will result in classification similar to that by complete linkage clustering, whereas specifying a value close to the cutoff gives similar results to that by single linkage clustering. The default value (=blank) is {score_cutoff * 0.95} or {pam_cutoff / 0.95}.

Clustering Mode [clustering]

This option specifies whether orthologous or homologous groups shall be made. Actually, this is just equivalent to omitting the tree-splitting procedure described below by specifying phylocut > 1.

Cutoff ratio of paralogs for tree splitting [clustering]

In MBGD, orthologous groups are made by splitting trees of homologous clusters created by the hierarchical clustering algorithm. The node with two children A and B is split when

phylocut

where Ph(A) denotes a set of species contained in the node A (phylogenetic pattern), |Ph| denotes the cardinality of Ph, and Intersect(A,B) is an intersection of sets A and B. This parameter is not effective when you specify ClusteringMode = 'homology'.

Phylogenetically related organisms [clustering]

When counting the number of species in the above calculation, one can incorporate taxonomic information by counting related species only once. You can specify a taxonomic rank to determine which set of organimsms you consider to be related.

Overlap ratio (r_adj1) for merging adjacent clusters [clustering]

After the tree splitting procedure described above, two clusters of domains are joined when they are almost always adjacent to each other. More precisely, two clusters A and B are joined when

r_adj1

r_adj2

where adjacent(A,B) is a set of domains belonging to A and B that are adjacent to each other, and r_adj1 and r_adj2 are parameters satisfying 0 ≤ r_adj1 ≤ r_adj2 ≤ 1.

Coverage ratio (r_adj2) for absorbing adjacent small clusters [clustering]

See above. Note that this parameter is not effective if r_adj2 ≤ r_adj1 .

Microbial Genome Database for Comparative Analysis
Questions and comments to: uchiyama@nibb.ac.jp