The Method of a Gene Sequence Alignment BWT Index Based on Hadoop
International Journal of Genetics and Genomics
Volume 4, Issue 3, June 2016, Pages: 24-30
Received: Jul. 1, 2016; Published: Jul. 12, 2016
Views 4251      Downloads 170
Authors
Nan Li, College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
Jing Gao, College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
Bailong Feng, East Inner Mongolia Electric Power Company Limited, Hohhot, Inne
Article Tools
Follow on us
Abstract
Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.
Keywords
Gene Sequence, BWT Index, Suffix Array, Hadoop
To cite this article
Nan Li, Jing Gao, Bailong Feng, The Method of a Gene Sequence Alignment BWT Index Based on Hadoop, International Journal of Genetics and Genomics. Vol. 4, No. 3, 2016, pp. 24-30. doi: 10.11648/j.ijgg.20160403.13
References
[1]
ZHAO Ya-Nan Research on BWT index and algorithms in biological sequence alignment [D], University of Science and Technology of China.)2015.
[2]
MCKENNA A, HANNA M, BANKS E, et al. The Genome Analysis Toolkit: a Map Reduce framework for analyzing next-generation DNA sequencing data. [J]. Genome research, 2010, 20(9):1297-1303.
[3]
Niek B, Inna D, Lior P. AVID: a global alignment Program. Genome Research [J]. 2003, 13(1): 97–102.
[4]
Delcher A L, Phillippy A, Carlton J, et al. Fast algorithms for large-scale genome alignment and comparison [J]. Nucleic Acids es, 2002, 30(11): 2478–2483.
[5]
Kurtz S, Phillippy A L, Smoot M, et al. ersatile and open software for comparing large genomes[J]. Genome Biol, 2004, 5(2): R12.
[6]
FERRAGINA P, MANZINI G., MAKINEN V, et al. AN Alphabet-Friendly FM-index[C]//APOSTOLICO A, MELUCCI M.11th International
[7]
LANGMEAD B, TRAPNELL C, POP M, et al. Ultrafast and memory-efficient alignment of short DNA sequence to the human genome [J]. Genome Biology, 2009, 10(3): R25.
[8]
Li H, Durbin R. Fast and accurate long-read alignment with” Burrow Wheeler” transform [J]. Bioinformatics, 2010, 26(5):589-595.
[9]
LAMT W, SUNG WK, TAM SL, et al. Compressed indexing and local alignment of DNA [J]. Bioinformatics, 2008, 24(6): 791-797.
[10]
Burrows M. A block-sorting lossless data compression algorithm [J]. Technical Report Digital Src Research Report, 1994, 57(4): 425.
[11]
GAO Jing, JIAO Ya ZHANG Wen-Guang Overview of Sequence Alignment for HIGH-throughput Sequencing Data[J] LIFE SCIENCE RESEARCH 2014(05).
[12]
Li Bing, Long Bing-jie, Liu Yong. A Fast Algorithm for Burrows-Wheeler Transform Using Suffix Sorting [J]. Journal of Electronics & Information Technology, 2015(02): 504-508.
[13]
YANG Xiao-Liang, The Application Case Study of Mapreduce Parallel Computation and the Optimization of Its Runtime Framework [D] Nanjing University, 2012.
[14]
FOSTER I, YONG ZHAO, RAICU I, et al. Cloud computing and grid computing 360-degree compared [C] Proceedings of the 2008Grid Computing Environments
[15]
DEAN J, GHEMAWAT S. Map Reduce: simplified data processing on large clusters[C] //Proceedings of the 6th Symposium on Opera-ting System Design and Implementation. New York: ACM, 2004: 137-150
[16]
DONG Xicheng. Hadoop Technology Insider [M]. Beijing: China Machine Press), 2013: 29-32
[17]
Schatz M C. Cloud Burst: highly sensitive read mapping with Map Reduce. [J]. Bioinformatics, 2009, 25(11): 1363–1369.
[18]
“ftp://ftp.ncbi.nih.gov/”
[19]
White T. Hadoop: The Definitive Guide [J]. O’reilly Media Inc Gravenstein Highway North, 2010, 215(11):1-4.
[20]
MIAO Su-Chao, An Anchor-based Algorithm for Multiple Genome Alignment [D], Xidian University), 2010.
ADDRESS
Science Publishing Group
1 Rockefeller Plaza,
10th and 11th Floors,
New York, NY 10020
U.S.A.
Tel: (001)347-983-5186