Hierarchical Clustering Program for Orthologous Protein Domain Classification

DomClust is an effective tool for orthologous grouping in multiple genomes, which is a crucial first step in large-scale comparative genomics. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. DomClust outputs a set of hierarchical clustering trees, but these trees may overlap with each other. The overlapping trees, which are represented in the above logo, actually result from the domain fusion/fission event, and are the salient feature of the DomClust program. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, DomClust generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that DomClust showed relatively good stability in comparison to the BBH-based methods.

DomClust has been used for classifying hundreds of mocrobial genomes in MBGD (Microbial genome database for comparative analysis), which itself provides currently the most user-friendly interface for DomClust.

Download program

README The readme file for the program
domclust.tgz The program source code

Download data

README The readme file for the dataset
cog02.tgz The COG02 dataset used in the DomClust paper (including all-all similarities, 65MB).
cog03.tgz The COG03 dataset used in the DomClust paper (including all-all similarities, 190MB).
You can also download homology data from MBGD


Uchiyama, I.
Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes.
Nucleic Acids Res. 34, 647-658 (2006). [ Download PDF ]
Please send questions and comments to: Ikuo Uchiyama (