Statistical Data Mining for Symbol Associations in Genomic Databases
International Journal of Genetics and Genomics
Volume 2, Issue 6, December 2014, Pages: 97-104
Received: Nov. 10, 2014; Accepted: Nov. 28, 2014; Published: Dec. 2, 2014
Views 1802      Downloads 173
Authors
Bernard Ycart, Université Grenoble-Alpes, Grenoble, France; Laboratoire Jean Kuntzmann, CNRS UMR5224, Grenoble, France; Laboratoire d'Excellence TOUCAN, Toulouse, France
Frederic Pont, Laboratoire d'Excellence TOUCAN, Toulouse, France; INSERM UMR1037-Cancer Research Center of Toulouse, Toulouse, France; Université Toulouse III Paul Sabatier, Toulouse, France; ERL 5294 CNRS, Toulouse, France
Jean-Jacques Fournie, Laboratoire d'Excellence TOUCAN, Toulouse, France; INSERM UMR1037-Cancer Research Center of Toulouse, Toulouse, France; Université Toulouse III Paul Sabatier, Toulouse, France; ERL 5294 CNRS, Toulouse, France
Article Tools
Follow on us
Abstract
A methodology is proposed to automatically detect significant symbol associations in genomic databases. A new statistical test assesses the significance of a group of symbols when found in several genesets of a given database. To each pair of symbols, a p-value depending on the frequency of the two symbols and on the number of joint occurrences, is associated. All pairs with p-values below a certain threshold define a graph structure on the set of symbols. The cliques of that graph are significant symbol associations, linked to a set of genesets where they can be found. The method can be applied to any database, and is illustrated on the MSigDB C2 database. Many of the symbol associations detected in C2 or in non-specific selections correspond to already known interactions. On more specific selections of C2, many previously unknown symbol associations have been detected. These associations unveal new candidates for gene or protein interactions, needing further investigation for biological evidence.
Keywords
Genomic Databases, Protein-Protein Interaction, Frequent Itemset Searching, P-Value Graph
To cite this article
Bernard Ycart, Frederic Pont, Jean-Jacques Fournie, Statistical Data Mining for Symbol Associations in Genomic Databases, International Journal of Genetics and Genomics. Vol. 2, No. 6, 2014, pp. 97-104. doi: 10.11648/j.ijgg.20140206.11
References
[1]
Schaefer CF: Pathway databases. Ann. N. Y. Acad. Sci. 2004, 1020:77–91.
[2]
Cary M, Bader G, Sander C: Pathway information for systems biology. FEBS Lett 2005, 579:1815–1820.
[3]
Kanehisha M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acid Res. 2004, 32:270–280.
[4]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpeting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 2005, 102(43):15545–15550.
[5]
The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat. Genet 2000, 25:25–29.
[6]
Jones S, Thornton J: Principles of protein-protein interactions. Proc. Natl. Acad. Sci. USA 1996, 93:13–20.
[7]
De Las Rivas J, Fontanillo C: Protein-protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput. Biol 2010, 6(6):e1000808.
[8]
Vidal M, Cisick ME, Barabasi AL: Interactome networks and human disease. Cell 2011, 144(6):986–998.
[9]
Kirouac DC, Saez-Rodriguez J, Swantek J, Burke JM, Lauffenburger DA, Sorger PK: Creating and analyzing pathway and protein interaction compendia for modelling signal tranduction networks. BMC Systems Biology 2012, 6(29):1–18.
[10]
Agrawal R, Imielski T, Swami A: Mining association rules between sets of items in large databases. In Proc. 1993 ACM-SIGMOD International Cofnerence on Management of Data, Volume 22 of SIGMOD Record. Edited by Buneman P, Jajodia S, ACM Press 1993:207–216.
[11]
Goethals B: Frequent sets mining. In Data mining and knowledge discovery handbook, LNCS. Edited by Maimon OZ, Rokach L, Springer, Berlin 2005:377–398.
[12]
Han J, Cheng H, Xin D, Yan X: Frequent pattern mining: current status and future directions. Data Min. Knowl. Disc. 2007, 15:55–86.
[13]
Borgelt C: Frequent item set mining. Wiley Interdisc. Rev.: Data Mining and Knowledge Discovery 2012, 2(6):437–456.
[14]
de Graaf JM, de Menezes RX, Boer JM, Kosters WA: Frequent itemsets for genomic profiling. In CompLife05: Computational Life Sciences, Volume 3695 of LNCS, Springer, Berlin 2005:104–116.
[15]
Janson S: Coupling and Poisson approximation. Acta Appl. Math. 1994, 34:7–15.
[16]
Lee JK, Williams PD, Cheon S: Data Mining in Genomics. Clinics in Laboratory Medicine 2008, 28:145–166
[17]
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 1998, 95(25):14863–14868.
[18]
Hussain S, Hazarika G: Enhanced hierarchical clustering for genome databases. Int. J. Comp. Sci. Issues 2011, 4:245–250.
[19]
Butenko S, Wilhelm WE: Clique-detection Models in Computational Biochemistry and Genomics. European J. of Operational Research 2006, 173:1–17.
[20]
Harley E, Bonner A, Goodman N: Uniform integration of genome mapping data using intersection graphs. Bioinformatics 2001, 17(6):487–494
[21]
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2008, [http://www.R-project.org]. [ISBN 3-900051-07-0].
[22]
Szkarczyk D, et al.: The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acid Res. 2011, 39:D561–D568.
[23]
Feng MY, Wang K, Shi QT, Yu XW, Geng JS: Gene expression profiling in TWIST-depleted gastric cancer cells. Anatomical Record 2009, 292:262–270
[24]
Ohno S: Evolution by gene duplication. Springer-Verlag, New York 1970.
[25]
Conant GC, Wolfe KH: Turning a hobby into a job; how duplicated genes find new functions. Nat. Rev. Genet. 2008, 9(12):938–950.
[26]
Dittmar K, Liberles D: Evolution after gene duplication. Wiley-Blackwell, New York 2010.
[27]
Hittinger CT, Carroll SB: Gene duplication and the adaptive evolution of a classic genetic switch. Nature 2007, 449(7163):677–681.
[28]
Csardi G, Nepusz T: The igraph software package for complex network research. InterJournal 2006, Complex Systems:1695, [http://igraph.sf.net].
ADDRESS
Science Publishing Group
548 FASHION AVENUE
NEW YORK, NY 10018
U.S.A.
Tel: (001)347-688-8931