The Method of a Gene Sequence Alignment BWT Index Based on Hadoop

Nan Li; Jing Gao; Bailong Feng

doi:doi:10.11648/j.ijgg.20160403.13

| Peer-Reviewed |

The Method of a Gene Sequence Alignment BWT Index Based on Hadoop

Nan Li, Jing Gao, Bailong Feng

Received: 1 July 2016 Accepted: Published: 12 July 2016

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.

DOI	10.11648/j.ijgg.20160403.13
Published in	International Journal of Genetics and Genomics (Volume 4, Issue 3, June 2016)
Page(s)	24-30
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Gene Sequence, BWT Index, Suffix Array, Hadoop

References

[1]	ZHAO Ya-Nan Research on BWT index and algorithms in biological sequence alignment [D], University of Science and Technology of China.)2015.
[2]	MCKENNA A, HANNA M, BANKS E, et al. The Genome Analysis Toolkit: a Map Reduce framework for analyzing next-generation DNA sequencing data. [J]. Genome research, 2010, 20(9):1297-1303.
[3]	Niek B, Inna D, Lior P. AVID: a global alignment Program. Genome Research [J]. 2003, 13(1): 97–102.
[4]	Delcher A L, Phillippy A, Carlton J, et al. Fast algorithms for large-scale genome alignment and comparison [J]. Nucleic Acids es, 2002, 30(11): 2478–2483.
[5]	Kurtz S, Phillippy A L, Smoot M, et al. ersatile and open software for comparing large genomes[J]. Genome Biol, 2004, 5(2): R12.
[6]	FERRAGINA P, MANZINI G., MAKINEN V, et al. AN Alphabet-Friendly FM-index[C]//APOSTOLICO A, MELUCCI M.11th International
[7]	LANGMEAD B, TRAPNELL C, POP M, et al. Ultrafast and memory-efficient alignment of short DNA sequence to the human genome [J]. Genome Biology, 2009, 10(3): R25.
[8]	Li H, Durbin R. Fast and accurate long-read alignment with” Burrow Wheeler” transform [J]. Bioinformatics, 2010, 26(5):589-595.
[9]	LAMT W, SUNG WK, TAM SL, et al. Compressed indexing and local alignment of DNA [J]. Bioinformatics, 2008, 24(6): 791-797.
[10]	Burrows M. A block-sorting lossless data compression algorithm [J]. Technical Report Digital Src Research Report, 1994, 57(4): 425.
[11]	GAO Jing, JIAO Ya ZHANG Wen-Guang Overview of Sequence Alignment for HIGH-throughput Sequencing Data[J] LIFE SCIENCE RESEARCH 2014(05).
[12]	Li Bing, Long Bing-jie, Liu Yong. A Fast Algorithm for Burrows-Wheeler Transform Using Suffix Sorting [J]. Journal of Electronics & Information Technology, 2015(02): 504-508.
[13]	YANG Xiao-Liang, The Application Case Study of Mapreduce Parallel Computation and the Optimization of Its Runtime Framework [D] Nanjing University, 2012.
[14]	FOSTER I, YONG ZHAO, RAICU I, et al. Cloud computing and grid computing 360-degree compared [C] Proceedings of the 2008Grid Computing Environments
[15]	DEAN J, GHEMAWAT S. Map Reduce: simplified data processing on large clusters[C] //Proceedings of the 6th Symposium on Opera-ting System Design and Implementation. New York: ACM, 2004: 137-150
[16]	DONG Xicheng. Hadoop Technology Insider [M]. Beijing: China Machine Press), 2013: 29-32
[17]	Schatz M C. Cloud Burst: highly sensitive read mapping with Map Reduce. [J]. Bioinformatics, 2009, 25(11): 1363–1369.
[18]	“ftp://ftp.ncbi.nih.gov/”
[19]	White T. Hadoop: The Definitive Guide [J]. O’reilly Media Inc Gravenstein Highway North, 2010, 215(11):1-4.
[20]	MIAO Su-Chao, An Anchor-based Algorithm for Multiple Genome Alignment [D], Xidian University), 2010.

Cite This Article

Plain Text BibTeX RIS

APA Style

Nan Li, Jing Gao, Bailong Feng. (2016). The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. International Journal of Genetics and Genomics, 4(3), 24-30. https://doi.org/10.11648/j.ijgg.20160403.13

Copy | Download

ACS Style

Nan Li; Jing Gao; Bailong Feng. The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. Int. J. Genet. Genomics 2016, 4(3), 24-30. doi: 10.11648/j.ijgg.20160403.13

Copy | Download

AMA Style

Nan Li, Jing Gao, Bailong Feng. The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. Int J Genet Genomics. 2016;4(3):24-30. doi: 10.11648/j.ijgg.20160403.13

Copy | Download

@article{10.11648/j.ijgg.20160403.13,
  author = {Nan Li and Jing Gao and Bailong Feng},
  title = {The Method of a Gene Sequence Alignment BWT Index Based on Hadoop},
  journal = {International Journal of Genetics and Genomics},
  volume = {4},
  number = {3},
  pages = {24-30},
  doi = {10.11648/j.ijgg.20160403.13},
  url = {https://doi.org/10.11648/j.ijgg.20160403.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijgg.20160403.13},
  abstract = {Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.},
 year = {2016}
}

Copy | Download

TY - JOUR
T1 - The Method of a Gene Sequence Alignment BWT Index Based on Hadoop
AU - Nan Li
AU - Jing Gao
AU - Bailong Feng
Y1 - 2016/07/12
PY - 2016
N1 - https://doi.org/10.11648/j.ijgg.20160403.13
DO - 10.11648/j.ijgg.20160403.13
T2 - International Journal of Genetics and Genomics
JF - International Journal of Genetics and Genomics
JO - International Journal of Genetics and Genomics
SP - 24
EP - 30
PB - Science Publishing Group
SN - 2376-7359
UR - https://doi.org/10.11648/j.ijgg.20160403.13
AB - Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.
VL - 4
IS - 3
ER -

Copy | Download

Author Information

Nan Li

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
Jing Gao

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China
Bailong Feng

East Inner Mongolia Electric Power Company Limited, Hohhot, Inne

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Nan Li, Jing Gao, Bailong Feng. (2016). The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. International Journal of Genetics and Genomics, 4(3), 24-30. https://doi.org/10.11648/j.ijgg.20160403.13

Copy | Download

ACS Style

Nan Li; Jing Gao; Bailong Feng. The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. Int. J. Genet. Genomics 2016, 4(3), 24-30. doi: 10.11648/j.ijgg.20160403.13

Copy | Download

AMA Style

Nan Li, Jing Gao, Bailong Feng. The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. Int J Genet Genomics. 2016;4(3):24-30. doi: 10.11648/j.ijgg.20160403.13

Copy | Download

@article{10.11648/j.ijgg.20160403.13,
  author = {Nan Li and Jing Gao and Bailong Feng},
  title = {The Method of a Gene Sequence Alignment BWT Index Based on Hadoop},
  journal = {International Journal of Genetics and Genomics},
  volume = {4},
  number = {3},
  pages = {24-30},
  doi = {10.11648/j.ijgg.20160403.13},
  url = {https://doi.org/10.11648/j.ijgg.20160403.13},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijgg.20160403.13},
  abstract = {Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.},
 year = {2016}
}

Copy | Download