On the Effectiveness of Constraints Sets in Clustering Genes


Erliang Zeng1, Chengyong Yang1, Tao Li1, Giri Narasimhan1,#

1Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University
#To whom correspondence should be addressed: School of Computer Science, Florida International University, Miami, FL 33199, giri@cs.fiu.edu,Phone: (305) 348-3748, Fax: (305) 348-3549


Abstract

A major disadvantage of the traditional clustering algorithm is that it requires all of the data sources to be complete (i.e., data available on all genes to be analyzed). In this paper, we have modified a constrained clustering algorithm (based on K-means) to perform exploratory analysis on gene expression data using prior knowledge in the form of constraints and have studied the effectiveness of constraints sets. We have also shown how these constraints can be generated automatically from existing biological knowledgebases or from biomedical text literature. To address the problem of automatically generating constraints from biological text literature, we considered two methods (cluster-based and similarity-based), both capable of generating positive as well as negative constraints. The results were also evaluated by measuring the enrichment of Gene Ontology (GO) terms. We compared constrained clustering algorithm with a previous algorithm called Multi-Source Clustering (MSC), which performs clustering using data from multiple, but complete, sources. We concluded that incomplete information in form of constraints set should be generated carefully, in order to outperform the standard clustering algorithm which works on one data source without any constraint. For sufficiently large constraints sets, the constrained clustering algorithm outperformed the MSC algorithm. The research presented here is the first time the effectiveness of constraints sets and robustness of the constrained clustering algorithm were studied using multiple sources of biological data, and is the first time biomedical text literature were incorporated into constrained clustering algorithm in form of constraints sets.

Manuscript (pdf) [Paper will be available soon]

Supplemental Material



Back to top