Computational Biology and Bioinformatics
Volume 8, Issue 1, June 2020, Pages: 15-19
Received: May 24, 2020;
Accepted: Jun. 8, 2020;
Published: Jun. 20, 2020
Views 171 Downloads 43
Mengmeng Zhang, College of Life Sciences, Capital Normal University, Beijing, China
Lu Wang, College of Life Sciences, Capital Normal University, Beijing, China
Ping Wan, College of Life Sciences, Capital Normal University, Beijing, China
The mechanism of prokaryotic gene expression remains incompletely understood. Promoters are regions in genome that locating upstream to genes and regulate of gene expressions. Despite more and more E. coli K-12 promoter sequences have been obtained experimentally, and some regions such as -10 region and -30 region have been described, the features in promoter sequences are far from explicitly characterized. Here, we address this challenge using an approach based on the deep convolutional neural network (CNN). We collected six classes of E. coli K-12 promoter sequences which are all annotated as with strong evidence and belong to only one promoter class in RegulonDB database. Then, we applied the CNN model to recognize the six classes of promoters. The CNN model achieved an accuracy of above 97% for all six classes of promoters. Next, we extracted the weight matrix of the last convolution layer in CNN with the Grad-Cam algorithm, and convert the weight matrix to an information content matrix. Finally, we visualized the information content matrix as promoter logos using the logomaker tool and discover the promoter features in six classes of promoters. Our approach could not only find the previous described promoter feature regions, but could also discover promoter features with better sensitivity and accuracy. We provide a novel computational approach to discover features in biological sequences.
Discovering Escherichia coli K-12 Promoter Features Using Convolutional Neural Network, Computational Biology and Bioinformatics.
Vol. 8, No. 1,
2020, pp. 15-19.
He W, Jia C, Duan Y, et al. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. [J] BMC Systems Biology, 2018, 12 (4): 44.
Barrios H, Valderrama B, Morett E. Compilation and analysis of sigma (54)-dependent promoter sequences. [J] Nucleic Acids Research, 1999, 27 (22): 4305-4313.
Gruber TM, Gross CA. Multiple sigma subunits and the partitioning of bacterial transcription space. [J] Annual Review of Microbiology, 2003, 57: 441–66.
Kang JG, Hahn MY, Ishihama A, Roe JH. Identification of sigma factors for growth phase-related promoter selectivity of RNA polymerases from Streptomyces coelicolor A3 (2). [J] Nucleic Acids Research, 1997, 25 (13): 2566-73.
Santos-Zavaleta A, Salgado H, Gama-Castro S, et al. RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12. [J] Nucleic acids research, 2019, 47: D212-D220.
Lecun Y L, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition. [J] Proceedings of the IEEE, 1998, 86 (11): 2278-2324.
Lecun Y, Boser B, Denker J, et al. Backpropagation Applied to Handwritten Zip Code Recognition. [J] Neural Computation, 2014, 1 (4): 541-551.
Alipanahi B, Delong A, Weirauch MT, et al. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. [J] Nature biotechnology, 2015, 33, 831.
Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. [J] Nature methods, 2015, 12: 931.
Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. [J] Genome research, 2016, 26: 990-999.
Eraslan G, Avsec Ž, Gagneur J, et al. Deep learning: new computational modelling techniques for genomics. [J] Nature Reviews Genetics, 2019, 20: 389-403.
Gershenzon NI, Stormo GD, Ioshikhes IP. Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites. [J] Nucleic Acids Research, 2005, 33 (7): 2290-301.
Zhang L, Luo L. Splice site prediction with quadratic discriminant analysis using diversity measure. [J] Nucleic Acids Research, 2003, 31 (21): 6214-6220.
Drioli S, Felluga F, Forzato C, et al. The recognition and prediction of σ 70, promoters in Escherichia coli K-12. [J] Journal of Theoretical Biology, 2006, 242 (1): 135.
Gordon JJ, Towsey MW, Hogan JM, et al. Improved prediction of bacterial transcription start sites. [J] Bioinformatics, 2006, 22 (2): 142-148.
Wang L, Wan P. Prediction of Escherichia Coli K-12 Promoters Using Convolutional Neural Network. [J] Computational Biology and Bioinformatics, 2018, 6: 2.
Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. arXiv: 1610.02391, 2019, DOI: 10.1007/s11263-019-01228-7.
Tareen A, Kinney JB. Logomaker: Beautiful sequence logos in python. [J] Bioinformatics, 2020, 36 (7): 2272–2274.
Crooks GE, Hon G, Chandonia JM, et al. WebLogo: a sequence logo generator. [J] Genome research, 2004, 14: 1188-1190.