International Journal of Genetics and Genomics

| Peer-Reviewed |

The Method of a Gene Sequence Alignment BWT Index Based on Hadoop

Received: 1 July 2016    Accepted:     Published: 12 July 2016
Views:       Downloads:

Share This Article

Abstract

Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.

DOI 10.11648/j.ijgg.20160403.13
Published in International Journal of Genetics and Genomics (Volume 4, Issue 3, June 2016)
Page(s) 24-30
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Gene Sequence, BWT Index, Suffix Array, Hadoop

References
[1] ZHAO Ya-Nan Research on BWT index and algorithms in biological sequence alignment [D], University of Science and Technology of China.)2015.
[2] MCKENNA A, HANNA M, BANKS E, et al. The Genome Analysis Toolkit: a Map Reduce framework for analyzing next-generation DNA sequencing data. [J]. Genome research, 2010, 20(9):1297-1303.
[3] Niek B, Inna D, Lior P. AVID: a global alignment Program. Genome Research [J]. 2003, 13(1): 97–102.
[4] Delcher A L, Phillippy A, Carlton J, et al. Fast algorithms for large-scale genome alignment and comparison [J]. Nucleic Acids es, 2002, 30(11): 2478–2483.
[5] Kurtz S, Phillippy A L, Smoot M, et al. ersatile and open software for comparing large genomes[J]. Genome Biol, 2004, 5(2): R12.
[6] FERRAGINA P, MANZINI G., MAKINEN V, et al. AN Alphabet-Friendly FM-index[C]//APOSTOLICO A, MELUCCI M.11th International
[7] LANGMEAD B, TRAPNELL C, POP M, et al. Ultrafast and memory-efficient alignment of short DNA sequence to the human genome [J]. Genome Biology, 2009, 10(3): R25.
[8] Li H, Durbin R. Fast and accurate long-read alignment with” Burrow Wheeler” transform [J]. Bioinformatics, 2010, 26(5):589-595.
[9] LAMT W, SUNG WK, TAM SL, et al. Compressed indexing and local alignment of DNA [J]. Bioinformatics, 2008, 24(6): 791-797.
[10] Burrows M. A block-sorting lossless data compression algorithm [J]. Technical Report Digital Src Research Report, 1994, 57(4): 425.
[11] GAO Jing, JIAO Ya ZHANG Wen-Guang Overview of Sequence Alignment for HIGH-throughput Sequencing Data[J] LIFE SCIENCE RESEARCH 2014(05).
[12] Li Bing, Long Bing-jie, Liu Yong. A Fast Algorithm for Burrows-Wheeler Transform Using Suffix Sorting [J]. Journal of Electronics & Information Technology, 2015(02): 504-508.
[13] YANG Xiao-Liang, The Application Case Study of Mapreduce Parallel Computation and the Optimization of Its Runtime Framework [D] Nanjing University, 2012.
[14] FOSTER I, YONG ZHAO, RAICU I, et al. Cloud computing and grid computing 360-degree compared [C] Proceedings of the 2008Grid Computing Environments
[15] DEAN J, GHEMAWAT S. Map Reduce: simplified data processing on large clusters[C] //Proceedings of the 6th Symposium on Opera-ting System Design and Implementation. New York: ACM, 2004: 137-150
[16] DONG Xicheng. Hadoop Technology Insider [M]. Beijing: China Machine Press), 2013: 29-32
[17] Schatz M C. Cloud Burst: highly sensitive read mapping with Map Reduce. [J]. Bioinformatics, 2009, 25(11): 1363–1369.
[18] “ftp://ftp.ncbi.nih.gov/”
[19] White T. Hadoop: The Definitive Guide [J]. O’reilly Media Inc Gravenstein Highway North, 2010, 215(11):1-4.
[20] MIAO Su-Chao, An Anchor-based Algorithm for Multiple Genome Alignment [D], Xidian University), 2010.
Cite This Article
  • APA Style

    Nan Li, Jing Gao, Bailong Feng. (2016). The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. International Journal of Genetics and Genomics, 4(3), 24-30. https://doi.org/10.11648/j.ijgg.20160403.13

    Copy | Download

    ACS Style

    Nan Li; Jing Gao; Bailong Feng. The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. Int. J. Genet. Genomics 2016, 4(3), 24-30. doi: 10.11648/j.ijgg.20160403.13

    Copy | Download

    AMA Style

    Nan Li, Jing Gao, Bailong Feng. The Method of a Gene Sequence Alignment BWT Index Based on Hadoop. Int J Genet Genomics. 2016;4(3):24-30. doi: 10.11648/j.ijgg.20160403.13

    Copy | Download

  • @article{10.11648/j.ijgg.20160403.13,
      author = {Nan Li and Jing Gao and Bailong Feng},
      title = {The Method of a Gene Sequence Alignment BWT Index Based on Hadoop},
      journal = {International Journal of Genetics and Genomics},
      volume = {4},
      number = {3},
      pages = {24-30},
      doi = {10.11648/j.ijgg.20160403.13},
      url = {https://doi.org/10.11648/j.ijgg.20160403.13},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijgg.20160403.13},
      abstract = {Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.},
     year = {2016}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - The Method of a Gene Sequence Alignment BWT Index Based on Hadoop
    AU  - Nan Li
    AU  - Jing Gao
    AU  - Bailong Feng
    Y1  - 2016/07/12
    PY  - 2016
    N1  - https://doi.org/10.11648/j.ijgg.20160403.13
    DO  - 10.11648/j.ijgg.20160403.13
    T2  - International Journal of Genetics and Genomics
    JF  - International Journal of Genetics and Genomics
    JO  - International Journal of Genetics and Genomics
    SP  - 24
    EP  - 30
    PB  - Science Publishing Group
    SN  - 2376-7359
    UR  - https://doi.org/10.11648/j.ijgg.20160403.13
    AB  - Gene sequence alignment, used to recognize the homology and variability in different species, is an important part of Bioinformatics. Creating indexes is a crucial step of gene sequence alignment algorithm. Usual algorithms of creating indexes are divided into two types. The first is algorithm based on hash table, while another is based on suffix tree or suffix array, among which BWT (Burrows-Wheeler Transform) index is a significant index structure. Currently, BWT index needs several hours’ serial computing in building a large genome sequence (such as human genome sequence). A parallel computing method based on Hadoop is presented is this paper to build suffix array and BWT index. Map Reduce is adopted as a type of data processing function, cutting suffix array into block, which will be handled separately. Ultimately, totally ordered suffix array and BWT index are output, reducing the time in building index. Meanwhile, verifying the effectiveness of the algorithm by experiments.
    VL  - 4
    IS  - 3
    ER  - 

    Copy | Download

Author Information
  • College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China

  • College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia, China

  • East Inner Mongolia Electric Power Company Limited, Hohhot, Inne

  • Sections