Science Discovery

| Peer-Reviewed |

Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion

Received: 09 August 2018    Accepted:     Published: 10 August 2018
Views:       Downloads:

Share This Article

Abstract

For the current speech conversion algorithms based on neural networks, the method of using the mean-variance linear transformation fundamental frequency (F0) can easily cause some “mechanical tones” and strange adjustments in the converted speech, and the similarity of the transformed speech is low. This paper proposes the nonlinear mapping of F0 using BLSTM (Bi-directional Long Short Term Memory) neural network, and merging the structural information with the source fundamental frequency. Firstly, the stable BLSTM network is trained through the paired fundamental frequency F0, and then the final required fundamental frequency is obtained by fusing the converted F0′ and the original structural information F0, and finally the speech synthesis is performed, thereby improving the similarity between the converted speech and the target speech. degree. At the same time, it is verified that the method proposed in this paper can reduce the mechanical sounds of speech conversion to a certain extent and improve the similarity of the converted speech.

DOI 10.11648/j.sd.20180604.21
Published in Science Discovery (Volume 6, Issue 4, August 2018)
Page(s) 298-305
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Voice Conversion, BLSTM Neural Network, Fundamental Frequency Conversion, Nonlinear Mapping

References
[1] Xiaohai Tian, Zhizheng Wu, S. W. Lee, and Eng Siong Chng. Correlation-based Frequency Warping for Voice Conversion [C] International Symposium on Chinese Spoken Language Processing. 2014:211-215
[2] 陈芝,张玲华.基频轨迹转换算法及在语音转换系统中的应用研究[J]南京邮电大学学报(自然科学版),2010,10,30(5):83-87
[3] L Sun, S Kang, K Li et.. al. Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks [C]. IEEE International Conference on Acoustics, 2015:4869-4873
[4] Jumpei Niwa, Takenori Yoshimura, Kei Hashimoto. et.. al. Statistical Voice Conversion based on WaveNet [C]. Speech and Signal Processing (ICASSP) 2018:5289-5293
[5] 王民,杨秀峰,要趁红.基于PSO优化GRNN的语音转换方法[J].计算机工程与科学.2018,4(40):752-756
[6] Y Kang, J Tao, B Xu. Applying Pitch Target Model to Convert F0 Contour for Expressive Mandarin Speech Synthesis [C]. IEEE International Conference on Acoustics, 2006, 1:I-I
[7] Huaiping Ming1, Dongyan Huang1, Lei Xie. et.. al. Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion[ C]. Interspeech, 2016:2453-2457
[8] Hy Quy Nguyen · Siu Wa Lee · Xiaohai Tian. et.. cl. High quality voice conversion using prosodic and high-resolution spectral features [J]. Multimedia Tools & Applications, 2016, 75 (9): 5265-5285
[9] Meyer, G. A.: The Semantics of Stress and Pitch in English. The Faculty Association, Utah State University (1961)
[10] Martin Wollmer, Angeliki Metallinou, Nassos Katsamanis et.. al. Analyzing the memory of BLSTM Neural Networks for enhanced emotion classification in dyadic spoken interactions [C]. IEEE International Conference on Acoustics, 2012, 1 (15):4157-4160
[11] James Zhang's Blog. 双向长短时记忆循环神经网络详解(Bi-directional LSTM RNN)[DB/OL]https://blog.csdn.net /jojozhangju/article/details/51982254
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory.[J] Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[13] Xiangang Li and Xihong Wu. Improving long short-termmemory networks using maxout units for large vocabulary speech recognition. [C] Speech and Signal Processing (ICASSP), in Acoustics, 2015 IEEE International Conference on. IEEE, 2015, pp. 4600–4604.
[14] Zhiying Huang, Jian Tang, Shaofei Xue et.. al. Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code.[C]IEEE International Conference on Acoustics, 2016:5305-5309
[15] 解伟超.语音转换中声道谱参数和基频变换算法的研究[D].南京邮电大学.2013.04
[16] 张超琼,苗夺谦,岳晓冬.基于语音基频的性别识别方法及其改进[J].中文科技论文在线.http://www.paper.edu.cn
Author Information
  • Command & Control Engineering College, Army Engineering University, Nanjing, China

  • Command & Control Engineering College, Army Engineering University, Nanjing, China

  • Command & Control Engineering College, Army Engineering University, Nanjing, China

Cite This Article
  • APA Style

    Miao Xiaokong, Zhang Xiongwei, Sun Meng. (2018). Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Science Discovery, 6(4), 298-305. https://doi.org/10.11648/j.sd.20180604.21

    Copy | Download

    ACS Style

    Miao Xiaokong; Zhang Xiongwei; Sun Meng. Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Sci. Discov. 2018, 6(4), 298-305. doi: 10.11648/j.sd.20180604.21

    Copy | Download

    AMA Style

    Miao Xiaokong, Zhang Xiongwei, Sun Meng. Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Sci Discov. 2018;6(4):298-305. doi: 10.11648/j.sd.20180604.21

    Copy | Download

  • @article{10.11648/j.sd.20180604.21,
      author = {Miao Xiaokong and Zhang Xiongwei and Sun Meng},
      title = {Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion},
      journal = {Science Discovery},
      volume = {6},
      number = {4},
      pages = {298-305},
      doi = {10.11648/j.sd.20180604.21},
      url = {https://doi.org/10.11648/j.sd.20180604.21},
      eprint = {https://download.sciencepg.com/pdf/10.11648.j.sd.20180604.21},
      abstract = {For the current speech conversion algorithms based on neural networks, the method of using the mean-variance linear transformation fundamental frequency (F0) can easily cause some “mechanical tones” and strange adjustments in the converted speech, and the similarity of the transformed speech is low. This paper proposes the nonlinear mapping of F0 using BLSTM (Bi-directional Long Short Term Memory) neural network, and merging the structural information with the source fundamental frequency. Firstly, the stable BLSTM network is trained through the paired fundamental frequency F0, and then the final required fundamental frequency is obtained by fusing the converted F0′ and the original structural information F0, and finally the speech synthesis is performed, thereby improving the similarity between the converted speech and the target speech. degree. At the same time, it is verified that the method proposed in this paper can reduce the mechanical sounds of speech conversion to a certain extent and improve the similarity of the converted speech.},
     year = {2018}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion
    AU  - Miao Xiaokong
    AU  - Zhang Xiongwei
    AU  - Sun Meng
    Y1  - 2018/08/10
    PY  - 2018
    N1  - https://doi.org/10.11648/j.sd.20180604.21
    DO  - 10.11648/j.sd.20180604.21
    T2  - Science Discovery
    JF  - Science Discovery
    JO  - Science Discovery
    SP  - 298
    EP  - 305
    PB  - Science Publishing Group
    SN  - 2331-0650
    UR  - https://doi.org/10.11648/j.sd.20180604.21
    AB  - For the current speech conversion algorithms based on neural networks, the method of using the mean-variance linear transformation fundamental frequency (F0) can easily cause some “mechanical tones” and strange adjustments in the converted speech, and the similarity of the transformed speech is low. This paper proposes the nonlinear mapping of F0 using BLSTM (Bi-directional Long Short Term Memory) neural network, and merging the structural information with the source fundamental frequency. Firstly, the stable BLSTM network is trained through the paired fundamental frequency F0, and then the final required fundamental frequency is obtained by fusing the converted F0′ and the original structural information F0, and finally the speech synthesis is performed, thereby improving the similarity between the converted speech and the target speech. degree. At the same time, it is verified that the method proposed in this paper can reduce the mechanical sounds of speech conversion to a certain extent and improve the similarity of the converted speech.
    VL  - 6
    IS  - 4
    ER  - 

    Copy | Download

  • Sections