On the Selection of Appropriate Proximity Measurement for Gene Expression Data
International Journal of Biomedical Materials Research
Volume 5, Issue 5, October 2017, Pages: 59-63
Received: Jan. 28, 2017; Accepted: Feb. 17, 2017; Published: Jun. 30, 2017
Views 2335      Downloads 167
Md. Bipul Hossen, Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Arefin Mowla, Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Md. Harun or Rashid, Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Md. Binyamin, Department of Statistics, Mawlana Bhashani Science and Technology University, Santosh, Tangail, Bangladesh
Article Tools
Follow on us
Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.
Proximity Measures, Agglomerative Hierarchical Clustering, Adjusted Rand Index, Silhouette Index, Gene Expressions Data
To cite this article
Md. Bipul Hossen, Arefin Mowla, Md. Harun or Rashid, Md. Binyamin, On the Selection of Appropriate Proximity Measurement for Gene Expression Data, International Journal of Biomedical Materials Research. Vol. 5, No. 5, 2017, pp. 59-63. doi: 10.11648/j.ijbmr.20170505.11
Copyright © 2017 Authors retain the copyright of this article.
This article is an open access article distributed under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Brown M P and Bostein D (1999); Exploring the new world of genome with DNA microarrays. Nature Genetics, vol. 21 (1), pp. 33-37.
Cunningham K M and Ogilvie J C (1972); Evaluation of hierarchical grouping techniques: A preliminary study. The Computer Journal, vol. 15 (3), pp. 209–213.
Johnson R A and Wichern D W (2002). Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice Hall.
Monti S, Tamayo P, Mesirov J, Golub T (2003); Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data; Machine Learning. Vol. 52 (1), pp. 91-118.
Daxin J, Chun T, and Aidong Z (2004); Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, vol. 16 (11), pp. 1370-1386.
Costa I G, Carvalho F A D and Souto M C P D (2004); Comparative Analysis of Clustering Methods for Gene Expression Time Course Data. Genetics and Molecular Biology, vol. 27 (4), pp. 623-631.
Kerr G, Ruskin H J, Crane M and Doolan P (2008); Techniques for clustering gene expression data. ComputBiol Med, vol. 38 (3), pp. 283-293.
Geetha T and Michael A (2010); Enhanced Hierarchical Clustering for Gene Expression data. International Journal of Computer Applications, vol. 1 (20), pp. 92–98.
Marcilio C P de Souto, Ivan G Costa, Daniel S A de Araujo, Teresa B Ludermir and Alexander Schliep (2008); Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, pp. 01-14.
Kuiper F K and Fisher L (1975); A Monte Carlo comparison of six clustering procedures. Biometrics, vol. 31 (8), pp. 777–783.
Hubert L (1974); Approximate evaluation techniques for the single-link and complete link hierarchical clustering procedures. Journal of the American Statistical Association, vol. 69, pp. 698–704.
Blashfield R K (1976); Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. The Psychological Bulletin, vol. 83, pp. 377–388.
Hands S and Everitt B (1987); A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques. Multivariate Behavioral Research, vol. 22 (2), pp. 235–243.
Anderberg M (1973); Cluster analysis for applications. New York: Academic Press.
Jain A K and Dubes R C (1988); Algorithms for clustering data, Prentice Hall.
Guojun G, Chaoqun M and Jianhong W (2007); ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA. Data Clustering: Theory, Algorithms, and Applications
Gentleman R, Ding B, Dudoit S and Ibrahim J (2005); Bioinformatics and Computational Biology Solutions Using R and Bioconductor Statistics for Biology and Health, Springer-Verlag London Limited.
Pablo A Jaskowiak, Ricardo J G B Campello and Ivan G Costa (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10 (4), pp. 845-857.
Md. Bipul Hossen, Md. Siraj-Ud-Doulah, Aminul Hoque (2015); Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study, Computaitonal Biology and Bioinformatics, Vol. 3 (6), pp. 88-94.
Md. Siraj-Ud-Doulah, Md. Bipul Hossen (2016); Performance Evaluation of Clustering Methods in Microarray Data. American Journal of Bioinformatics Research, Vol. 6 (1), pp. 19-25.
Jaskowiak P A, Campello R J G B and Costa I G (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis, Computational Biology and Bioinformatics. Vol. 10 (4), pp. 845-857.
Eldesoky, A. E, M. Saleh, N. A. Sakr (2009); Novel Similarity Measure fo Document Clustering Basedon Topic Phrase, International Conferenceon Networking and Media Convergence, vol. 24, pp. 92-96.
Milligan G W and Cooper M C (1988); A study of standardization of variables in cluster analysis. Journal of Classification, vol. 5 (2), pp. 181-204.
Peter J. Rousseeuw (1987); Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. Vol. 20: pp. 53–65.
Science Publishing Group
1 Rockefeller Plaza,
10th and 11th Floors,
New York, NY 10020
Tel: (001)347-983-5186