International Journal of Genetics and Genomics

| Peer-Reviewed |

Statistical Data Mining for Symbol Associations in Genomic Databases

Received: 10 November 2014    Accepted: 28 November 2014    Published: 02 December 2014
Views:       Downloads:

Share This Article

Abstract

A methodology is proposed to automatically detect significant symbol associations in genomic databases. A new statistical test assesses the significance of a group of symbols when found in several genesets of a given database. To each pair of symbols, a p-value depending on the frequency of the two symbols and on the number of joint occurrences, is associated. All pairs with p-values below a certain threshold define a graph structure on the set of symbols. The cliques of that graph are significant symbol associations, linked to a set of genesets where they can be found. The method can be applied to any database, and is illustrated on the MSigDB C2 database. Many of the symbol associations detected in C2 or in non-specific selections correspond to already known interactions. On more specific selections of C2, many previously unknown symbol associations have been detected. These associations unveal new candidates for gene or protein interactions, needing further investigation for biological evidence.

DOI 10.11648/j.ijgg.20140206.11
Published in International Journal of Genetics and Genomics (Volume 2, Issue 6, December 2014)
Page(s) 97-104
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Genomic Databases, Protein-Protein Interaction, Frequent Itemset Searching, P-Value Graph

References
[1] Schaefer CF: Pathway databases. Ann. N. Y. Acad. Sci. 2004, 1020:77–91.
[2] Cary M, Bader G, Sander C: Pathway information for systems biology. FEBS Lett 2005, 579:1815–1820.
[3] Kanehisha M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acid Res. 2004, 32:270–280.
[4] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpeting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 2005, 102(43):15545–15550.
[5] The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat. Genet 2000, 25:25–29.
[6] Jones S, Thornton J: Principles of protein-protein interactions. Proc. Natl. Acad. Sci. USA 1996, 93:13–20.
[7] De Las Rivas J, Fontanillo C: Protein-protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput. Biol 2010, 6(6):e1000808.
[8] Vidal M, Cisick ME, Barabasi AL: Interactome networks and human disease. Cell 2011, 144(6):986–998.
[9] Kirouac DC, Saez-Rodriguez J, Swantek J, Burke JM, Lauffenburger DA, Sorger PK: Creating and analyzing pathway and protein interaction compendia for modelling signal tranduction networks. BMC Systems Biology 2012, 6(29):1–18.
[10] Agrawal R, Imielski T, Swami A: Mining association rules between sets of items in large databases. In Proc. 1993 ACM-SIGMOD International Cofnerence on Management of Data, Volume 22 of SIGMOD Record. Edited by Buneman P, Jajodia S, ACM Press 1993:207–216.
[11] Goethals B: Frequent sets mining. In Data mining and knowledge discovery handbook, LNCS. Edited by Maimon OZ, Rokach L, Springer, Berlin 2005:377–398.
[12] Han J, Cheng H, Xin D, Yan X: Frequent pattern mining: current status and future directions. Data Min. Knowl. Disc. 2007, 15:55–86.
[13] Borgelt C: Frequent item set mining. Wiley Interdisc. Rev.: Data Mining and Knowledge Discovery 2012, 2(6):437–456.
[14] de Graaf JM, de Menezes RX, Boer JM, Kosters WA: Frequent itemsets for genomic profiling. In CompLife05: Computational Life Sciences, Volume 3695 of LNCS, Springer, Berlin 2005:104–116.
[15] Janson S: Coupling and Poisson approximation. Acta Appl. Math. 1994, 34:7–15.
[16] Lee JK, Williams PD, Cheon S: Data Mining in Genomics. Clinics in Laboratory Medicine 2008, 28:145–166
[17] Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 1998, 95(25):14863–14868.
[18] Hussain S, Hazarika G: Enhanced hierarchical clustering for genome databases. Int. J. Comp. Sci. Issues 2011, 4:245–250.
[19] Butenko S, Wilhelm WE: Clique-detection Models in Computational Biochemistry and Genomics. European J. of Operational Research 2006, 173:1–17.
[20] Harley E, Bonner A, Goodman N: Uniform integration of genome mapping data using intersection graphs. Bioinformatics 2001, 17(6):487–494
[21] R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2008, [http://www.R-project.org]. [ISBN 3-900051-07-0].
[22] Szkarczyk D, et al.: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acid Res. 2011, 39:D561–D568.
[23] Feng MY, Wang K, Shi QT, Yu XW, Geng JS: Gene expression profiling in TWIST-depleted gastric cancer cells. Anatomical Record 2009, 292:262–270
[24] Ohno S: Evolution by gene duplication. Springer-Verlag, New York 1970.
[25] Conant GC, Wolfe KH: Turning a hobby into a job; how duplicated genes find new functions. Nat. Rev. Genet. 2008, 9(12):938–950.
[26] Dittmar K, Liberles D: Evolution after gene duplication. Wiley-Blackwell, New York 2010.
[27] Hittinger CT, Carroll SB: Gene duplication and the adaptive evolution of a classic genetic switch. Nature 2007, 449(7163):677–681.
[28] Csardi G, Nepusz T: The igraph software package for complex network research. InterJournal 2006, Complex Systems:1695, [http://igraph.sf.net].
Author Information
  • Université Grenoble-Alpes, Grenoble, France; Laboratoire Jean Kuntzmann, CNRS UMR5224, Grenoble, France; Laboratoire d'Excellence TOUCAN, Toulouse, France

  • Laboratoire d'Excellence TOUCAN, Toulouse, France; INSERM UMR1037-Cancer Research Center of Toulouse, Toulouse, France; Université Toulouse III Paul Sabatier, Toulouse, France; ERL 5294 CNRS, Toulouse, France

  • Laboratoire d'Excellence TOUCAN, Toulouse, France; INSERM UMR1037-Cancer Research Center of Toulouse, Toulouse, France; Université Toulouse III Paul Sabatier, Toulouse, France; ERL 5294 CNRS, Toulouse, France

Cite This Article
  • APA Style

    Bernard Ycart, Frederic Pont, Jean-Jacques Fournie. (2014). Statistical Data Mining for Symbol Associations in Genomic Databases. International Journal of Genetics and Genomics, 2(6), 97-104. https://doi.org/10.11648/j.ijgg.20140206.11

    Copy | Download

    ACS Style

    Bernard Ycart; Frederic Pont; Jean-Jacques Fournie. Statistical Data Mining for Symbol Associations in Genomic Databases. Int. J. Genet. Genomics 2014, 2(6), 97-104. doi: 10.11648/j.ijgg.20140206.11

    Copy | Download

    AMA Style

    Bernard Ycart, Frederic Pont, Jean-Jacques Fournie. Statistical Data Mining for Symbol Associations in Genomic Databases. Int J Genet Genomics. 2014;2(6):97-104. doi: 10.11648/j.ijgg.20140206.11

    Copy | Download

  • @article{10.11648/j.ijgg.20140206.11,
      author = {Bernard Ycart and Frederic Pont and Jean-Jacques Fournie},
      title = {Statistical Data Mining for Symbol Associations in Genomic Databases},
      journal = {International Journal of Genetics and Genomics},
      volume = {2},
      number = {6},
      pages = {97-104},
      doi = {10.11648/j.ijgg.20140206.11},
      url = {https://doi.org/10.11648/j.ijgg.20140206.11},
      eprint = {https://download.sciencepg.com/pdf/10.11648.j.ijgg.20140206.11},
      abstract = {A methodology is proposed to automatically detect significant symbol associations in genomic databases. A new statistical test assesses the significance of a group of symbols when found in several genesets of a given database. To each pair of symbols, a p-value depending on the frequency of the two symbols and on the number of joint occurrences, is associated. All pairs with p-values below a certain threshold define a graph structure on the set of symbols. The cliques of that graph are significant symbol associations, linked to a set of genesets where they can be found. The method can be applied to any database, and is illustrated on the MSigDB C2 database. Many of the symbol associations detected in C2 or in non-specific selections correspond to already known interactions. On more specific selections of C2, many previously unknown symbol associations have been detected. These associations unveal new candidates for gene or protein interactions, needing further investigation for biological evidence.},
     year = {2014}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Statistical Data Mining for Symbol Associations in Genomic Databases
    AU  - Bernard Ycart
    AU  - Frederic Pont
    AU  - Jean-Jacques Fournie
    Y1  - 2014/12/02
    PY  - 2014
    N1  - https://doi.org/10.11648/j.ijgg.20140206.11
    DO  - 10.11648/j.ijgg.20140206.11
    T2  - International Journal of Genetics and Genomics
    JF  - International Journal of Genetics and Genomics
    JO  - International Journal of Genetics and Genomics
    SP  - 97
    EP  - 104
    PB  - Science Publishing Group
    SN  - 2376-7359
    UR  - https://doi.org/10.11648/j.ijgg.20140206.11
    AB  - A methodology is proposed to automatically detect significant symbol associations in genomic databases. A new statistical test assesses the significance of a group of symbols when found in several genesets of a given database. To each pair of symbols, a p-value depending on the frequency of the two symbols and on the number of joint occurrences, is associated. All pairs with p-values below a certain threshold define a graph structure on the set of symbols. The cliques of that graph are significant symbol associations, linked to a set of genesets where they can be found. The method can be applied to any database, and is illustrated on the MSigDB C2 database. Many of the symbol associations detected in C2 or in non-specific selections correspond to already known interactions. On more specific selections of C2, many previously unknown symbol associations have been detected. These associations unveal new candidates for gene or protein interactions, needing further investigation for biological evidence.
    VL  - 2
    IS  - 6
    ER  - 

    Copy | Download

  • Sections