On the Selection of Appropriate Proximity Measurement for Gene Expression Data

Md. Bipul Hossen; Arefin Mowla; Md. Harun or Rashid; Md. Binyamin

doi:doi:10.11648/j.ijbmr.20170505.11

| Peer-Reviewed

On the Selection of Appropriate Proximity Measurement for Gene Expression Data

Md. Bipul Hossen, Arefin Mowla, Md. Harun or Rashid, Md. Binyamin

Published in International Journal of Biomedical Materials Research (Volume 5, Issue 5)

Received: 28 January 2017 Accepted: 17 February 2017 Published: 30 June 2017

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.

Published in	International Journal of Biomedical Materials Research (Volume 5, Issue 5)
DOI	10.11648/j.ijbmr.20170505.11
Page(s)	59-63
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2017. Published by Science Publishing Group

Keywords

Proximity Measures, Agglomerative Hierarchical Clustering, Adjusted Rand Index, Silhouette Index, Gene Expressions Data

References

[1]	Brown M P and Bostein D (1999); Exploring the new world of genome with DNA microarrays. Nature Genetics, vol. 21 (1), pp. 33-37.
[2]	Cunningham K M and Ogilvie J C (1972); Evaluation of hierarchical grouping techniques: A preliminary study. The Computer Journal, vol. 15 (3), pp. 209–213.
[3]	Johnson R A and Wichern D W (2002). Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice Hall.
[4]	Monti S, Tamayo P, Mesirov J, Golub T (2003); Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data; Machine Learning. Vol. 52 (1), pp. 91-118.
[5]	Daxin J, Chun T, and Aidong Z (2004); Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, vol. 16 (11), pp. 1370-1386.
[6]	Costa I G, Carvalho F A D and Souto M C P D (2004); Comparative Analysis of Clustering Methods for Gene Expression Time Course Data. Genetics and Molecular Biology, vol. 27 (4), pp. 623-631.
[7]	Kerr G, Ruskin H J, Crane M and Doolan P (2008); Techniques for clustering gene expression data. ComputBiol Med, vol. 38 (3), pp. 283-293.
[8]	Geetha T and Michael A (2010); Enhanced Hierarchical Clustering for Gene Expression data. International Journal of Computer Applications, vol. 1 (20), pp. 92–98.
[9]	Marcilio C P de Souto, Ivan G Costa, Daniel S A de Araujo, Teresa B Ludermir and Alexander Schliep (2008); Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, pp. 01-14.
[10]	Kuiper F K and Fisher L (1975); A Monte Carlo comparison of six clustering procedures. Biometrics, vol. 31 (8), pp. 777–783.
[11]	Hubert L (1974); Approximate evaluation techniques for the single-link and complete link hierarchical clustering procedures. Journal of the American Statistical Association, vol. 69, pp. 698–704.
[12]	Blashfield R K (1976); Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. The Psychological Bulletin, vol. 83, pp. 377–388.
[13]	Hands S and Everitt B (1987); A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques. Multivariate Behavioral Research, vol. 22 (2), pp. 235–243.
[14]	Anderberg M (1973); Cluster analysis for applications. New York: Academic Press.
[15]	Jain A K and Dubes R C (1988); Algorithms for clustering data, Prentice Hall.
[16]	Guojun G, Chaoqun M and Jianhong W (2007); ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA. Data Clustering: Theory, Algorithms, and Applications
[17]	Gentleman R, Ding B, Dudoit S and Ibrahim J (2005); Bioinformatics and Computational Biology Solutions Using R and Bioconductor Statistics for Biology and Health, Springer-Verlag London Limited.
[18]	Pablo A Jaskowiak, Ricardo J G B Campello and Ivan G Costa (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10 (4), pp. 845-857.
[19]	Md. Bipul Hossen, Md. Siraj-Ud-Doulah, Aminul Hoque (2015); Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study, Computaitonal Biology and Bioinformatics, Vol. 3 (6), pp. 88-94.
[20]	Md. Siraj-Ud-Doulah, Md. Bipul Hossen (2016); Performance Evaluation of Clustering Methods in Microarray Data. American Journal of Bioinformatics Research, Vol. 6 (1), pp. 19-25.
[21]	Jaskowiak P A, Campello R J G B and Costa I G (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis, Computational Biology and Bioinformatics. Vol. 10 (4), pp. 845-857.
[22]	Eldesoky, A. E, M. Saleh, N. A. Sakr (2009); Novel Similarity Measure fo Document Clustering Basedon Topic Phrase, International Conferenceon Networking and Media Convergence, vol. 24, pp. 92-96.
[23]	Milligan G W and Cooper M C (1988); A study of standardization of variables in cluster analysis. Journal of Classification, vol. 5 (2), pp. 181-204.
[24]	Peter J. Rousseeuw (1987); Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. Vol. 20: pp. 53–65.

Cite This Article

Plain Text BibTeX RIS

APA Style

Md. Bipul Hossen, Arefin Mowla, Md. Harun or Rashid, Md. Binyamin. (2017). On the Selection of Appropriate Proximity Measurement for Gene Expression Data. International Journal of Biomedical Materials Research, 5(5), 59-63. https://doi.org/10.11648/j.ijbmr.20170505.11

Copy | Download

ACS Style

Md. Bipul Hossen; Arefin Mowla; Md. Harun or Rashid; Md. Binyamin. On the Selection of Appropriate Proximity Measurement for Gene Expression Data. Int. J. Biomed. Mater. Res. 2017, 5(5), 59-63. doi: 10.11648/j.ijbmr.20170505.11

Copy | Download

AMA Style

Md. Bipul Hossen, Arefin Mowla, Md. Harun or Rashid, Md. Binyamin. On the Selection of Appropriate Proximity Measurement for Gene Expression Data. Int J Biomed Mater Res. 2017;5(5):59-63. doi: 10.11648/j.ijbmr.20170505.11

Copy | Download

@article{10.11648/j.ijbmr.20170505.11,
  author = {Md. Bipul Hossen and Arefin Mowla and Md. Harun or Rashid and Md. Binyamin},
  title = {On the Selection of Appropriate Proximity Measurement for Gene Expression Data},
  journal = {International Journal of Biomedical Materials Research},
  volume = {5},
  number = {5},
  pages = {59-63},
  doi = {10.11648/j.ijbmr.20170505.11},
  url = {https://doi.org/10.11648/j.ijbmr.20170505.11},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijbmr.20170505.11},
  abstract = {Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.},
 year = {2017}
}

Copy | Download

TY - JOUR
T1 - On the Selection of Appropriate Proximity Measurement for Gene Expression Data
AU - Md. Bipul Hossen
AU - Arefin Mowla
AU - Md. Harun or Rashid
AU - Md. Binyamin
Y1 - 2017/06/30
PY - 2017
N1 - https://doi.org/10.11648/j.ijbmr.20170505.11
DO - 10.11648/j.ijbmr.20170505.11
T2 - International Journal of Biomedical Materials Research
JF - International Journal of Biomedical Materials Research
JO - International Journal of Biomedical Materials Research
SP - 59
EP - 63
PB - Science Publishing Group
SN - 2330-7579
UR - https://doi.org/10.11648/j.ijbmr.20170505.11
AB - Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.
VL - 5
IS - 5
ER -

Copy | Download

Author Information

Md. Bipul Hossen

Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Arefin Mowla

Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Md. Harun or Rashid

Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Md. Binyamin

Department of Statistics, Mawlana Bhashani Science and Technology University, Santosh, Tangail, Bangladesh

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Md. Bipul Hossen, Arefin Mowla, Md. Harun or Rashid, Md. Binyamin. (2017). On the Selection of Appropriate Proximity Measurement for Gene Expression Data. International Journal of Biomedical Materials Research, 5(5), 59-63. https://doi.org/10.11648/j.ijbmr.20170505.11

Copy | Download

ACS Style

Md. Bipul Hossen; Arefin Mowla; Md. Harun or Rashid; Md. Binyamin. On the Selection of Appropriate Proximity Measurement for Gene Expression Data. Int. J. Biomed. Mater. Res. 2017, 5(5), 59-63. doi: 10.11648/j.ijbmr.20170505.11

Copy | Download

AMA Style

Md. Bipul Hossen, Arefin Mowla, Md. Harun or Rashid, Md. Binyamin. On the Selection of Appropriate Proximity Measurement for Gene Expression Data. Int J Biomed Mater Res. 2017;5(5):59-63. doi: 10.11648/j.ijbmr.20170505.11

Copy | Download

@article{10.11648/j.ijbmr.20170505.11,
  author = {Md. Bipul Hossen and Arefin Mowla and Md. Harun or Rashid and Md. Binyamin},
  title = {On the Selection of Appropriate Proximity Measurement for Gene Expression Data},
  journal = {International Journal of Biomedical Materials Research},
  volume = {5},
  number = {5},
  pages = {59-63},
  doi = {10.11648/j.ijbmr.20170505.11},
  url = {https://doi.org/10.11648/j.ijbmr.20170505.11},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijbmr.20170505.11},
  abstract = {Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.},
 year = {2017}
}

Copy | Download