Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion

Miao Xiaokong; Zhang Xiongwei; Sun Meng

doi:doi:10.11648/j.sd.20180604.21

| Peer-Reviewed

Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion

Miao Xiaokong, Zhang Xiongwei, Sun Meng

Published in Science Discovery (Volume 6, Issue 4)

Received: 9 August 2018 Published: 10 August 2018

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

For the current speech conversion algorithms based on neural networks, the method of using the mean-variance linear transformation fundamental frequency (F0) can easily cause some “mechanical tones” and strange adjustments in the converted speech, and the similarity of the transformed speech is low. This paper proposes the nonlinear mapping of F0 using BLSTM (Bi-directional Long Short Term Memory) neural network, and merging the structural information with the source fundamental frequency. Firstly, the stable BLSTM network is trained through the paired fundamental frequency F0, and then the final required fundamental frequency is obtained by fusing the converted F0′ and the original structural information F0, and finally the speech synthesis is performed, thereby improving the similarity between the converted speech and the target speech. degree. At the same time, it is verified that the method proposed in this paper can reduce the mechanical sounds of speech conversion to a certain extent and improve the similarity of the converted speech.

Published in	Science Discovery (Volume 6, Issue 4)
DOI	10.11648/j.sd.20180604.21
Page(s)	298-305
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2018. Published by Science Publishing Group

Keywords

Voice Conversion, BLSTM Neural Network, Fundamental Frequency Conversion, Nonlinear Mapping

References

[1]	Xiaohai Tian, Zhizheng Wu, S. W. Lee, and Eng Siong Chng. Correlation-based Frequency Warping for Voice Conversion [C] International Symposium on Chinese Spoken Language Processing. 2014:211-215
[2]	陈芝,张玲华.基频轨迹转换算法及在语音转换系统中的应用研究[J]南京邮电大学学报(自然科学版),2010,10,30(5):83-87
[3]	L Sun, S Kang, K Li et.. al. Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks [C]. IEEE International Conference on Acoustics, 2015:4869-4873
[4]	Jumpei Niwa, Takenori Yoshimura, Kei Hashimoto. et.. al. Statistical Voice Conversion based on WaveNet [C]. Speech and Signal Processing (ICASSP) 2018:5289-5293
[5]	王民，杨秀峰，要趁红.基于PSO优化GRNN的语音转换方法[J].计算机工程与科学.2018,4(40):752-756
[6]	Y Kang, J Tao, B Xu. Applying Pitch Target Model to Convert F0 Contour for Expressive Mandarin Speech Synthesis [C]. IEEE International Conference on Acoustics, 2006, 1:I-I
[7]	Huaiping Ming1, Dongyan Huang1, Lei Xie. et.. al. Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion[ C]. Interspeech, 2016:2453-2457
[8]	Hy Quy Nguyen · Siu Wa Lee · Xiaohai Tian. et.. cl. High quality voice conversion using prosodic and high-resolution spectral features [J]. Multimedia Tools & Applications, 2016, 75 (9): 5265-5285
[9]	Meyer, G. A.: The Semantics of Stress and Pitch in English. The Faculty Association, Utah State University (1961)
[10]	Martin Wollmer, Angeliki Metallinou, Nassos Katsamanis et.. al. Analyzing the memory of BLSTM Neural Networks for enhanced emotion classification in dyadic spoken interactions [C]. IEEE International Conference on Acoustics, 2012, 1 (15):4157-4160
[11]	James Zhang's Blog. 双向长短时记忆循环神经网络详解（Bi-directional LSTM RNN）[DB/OL]https://blog.csdn.net /jojozhangju/article/details/51982254
[12]	S. Hochreiter and J. Schmidhuber. Long short-term memory.[J] Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[13]	Xiangang Li and Xihong Wu. Improving long short-termmemory networks using maxout units for large vocabulary speech recognition. [C] Speech and Signal Processing (ICASSP), in Acoustics, 2015 IEEE International Conference on. IEEE, 2015, pp. 4600–4604.
[14]	Zhiying Huang, Jian Tang, Shaofei Xue et.. al. Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code.[C]IEEE International Conference on Acoustics, 2016:5305-5309
[15]	解伟超.语音转换中声道谱参数和基频变换算法的研究[D].南京邮电大学.2013.04
[16]	张超琼,苗夺谦,岳晓冬.基于语音基频的性别识别方法及其改进[J].中文科技论文在线.http://www.paper.edu.cn

Cite This Article

Plain Text BibTeX RIS

APA Style

Miao Xiaokong, Zhang Xiongwei, Sun Meng. (2018). Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Science Discovery, 6(4), 298-305. https://doi.org/10.11648/j.sd.20180604.21

Copy | Download

ACS Style

Miao Xiaokong; Zhang Xiongwei; Sun Meng. Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Sci. Discov. 2018, 6(4), 298-305. doi: 10.11648/j.sd.20180604.21

Copy | Download

AMA Style

Miao Xiaokong, Zhang Xiongwei, Sun Meng. Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Sci Discov. 2018;6(4):298-305. doi: 10.11648/j.sd.20180604.21

Copy | Download

@article{10.11648/j.sd.20180604.21,
  author = {Miao Xiaokong and Zhang Xiongwei and Sun Meng},
  title = {Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion},
  journal = {Science Discovery},
  volume = {6},
  number = {4},
  pages = {298-305},
  doi = {10.11648/j.sd.20180604.21},
  url = {https://doi.org/10.11648/j.sd.20180604.21},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.sd.20180604.21},
  abstract = {For the current speech conversion algorithms based on neural networks, the method of using the mean-variance linear transformation fundamental frequency (F0) can easily cause some “mechanical tones” and strange adjustments in the converted speech, and the similarity of the transformed speech is low. This paper proposes the nonlinear mapping of F0 using BLSTM (Bi-directional Long Short Term Memory) neural network, and merging the structural information with the source fundamental frequency. Firstly, the stable BLSTM network is trained through the paired fundamental frequency F0, and then the final required fundamental frequency is obtained by fusing the converted F0′ and the original structural information F0, and finally the speech synthesis is performed, thereby improving the similarity between the converted speech and the target speech. degree. At the same time, it is verified that the method proposed in this paper can reduce the mechanical sounds of speech conversion to a certain extent and improve the similarity of the converted speech.},
 year = {2018}
}

Copy | Download

TY  - JOUR
T1  - Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion
AU  - Miao Xiaokong
AU  - Zhang Xiongwei
AU  - Sun Meng
Y1  - 2018/08/10
PY  - 2018
N1  - https://doi.org/10.11648/j.sd.20180604.21
DO  - 10.11648/j.sd.20180604.21
T2  - Science Discovery
JF  - Science Discovery
JO  - Science Discovery
SP  - 298
EP  - 305
PB  - Science Publishing Group
SN  - 2331-0650
UR  - https://doi.org/10.11648/j.sd.20180604.21
AB  - For the current speech conversion algorithms based on neural networks, the method of using the mean-variance linear transformation fundamental frequency (F0) can easily cause some “mechanical tones” and strange adjustments in the converted speech, and the similarity of the transformed speech is low. This paper proposes the nonlinear mapping of F0 using BLSTM (Bi-directional Long Short Term Memory) neural network, and merging the structural information with the source fundamental frequency. Firstly, the stable BLSTM network is trained through the paired fundamental frequency F0, and then the final required fundamental frequency is obtained by fusing the converted F0′ and the original structural information F0, and finally the speech synthesis is performed, thereby improving the similarity between the converted speech and the target speech. degree. At the same time, it is verified that the method proposed in this paper can reduce the mechanical sounds of speech conversion to a certain extent and improve the similarity of the converted speech.
VL  - 6
IS  - 4
ER  -

Copy | Download

Author Information

Miao Xiaokong

Command & Control Engineering College, Army Engineering University, Nanjing, China
Zhang Xiongwei

Command & Control Engineering College, Army Engineering University, Nanjing, China
Sun Meng

Command & Control Engineering College, Army Engineering University, Nanjing, China

Download PDF

Sections

Plain Text BibTeX RIS

APA Style

Miao Xiaokong, Zhang Xiongwei, Sun Meng. (2018). Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Science Discovery, 6(4), 298-305. https://doi.org/10.11648/j.sd.20180604.21

Copy | Download

ACS Style

Miao Xiaokong; Zhang Xiongwei; Sun Meng. Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Sci. Discov. 2018, 6(4), 298-305. doi: 10.11648/j.sd.20180604.21

Copy | Download

AMA Style

Miao Xiaokong, Zhang Xiongwei, Sun Meng. Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion. Sci Discov. 2018;6(4):298-305. doi: 10.11648/j.sd.20180604.21

Copy | Download

@article{10.11648/j.sd.20180604.21,
  author = {Miao Xiaokong and Zhang Xiongwei and Sun Meng},
  title = {Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion},
  journal = {Science Discovery},
  volume = {6},
  number = {4},
  pages = {298-305},
  doi = {10.11648/j.sd.20180604.21},
  url = {https://doi.org/10.11648/j.sd.20180604.21},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.sd.20180604.21},
  abstract = {For the current speech conversion algorithms based on neural networks, the method of using the mean-variance linear transformation fundamental frequency (F0) can easily cause some “mechanical tones” and strange adjustments in the converted speech, and the similarity of the transformed speech is low. This paper proposes the nonlinear mapping of F0 using BLSTM (Bi-directional Long Short Term Memory) neural network, and merging the structural information with the source fundamental frequency. Firstly, the stable BLSTM network is trained through the paired fundamental frequency F0, and then the final required fundamental frequency is obtained by fusing the converted F0′ and the original structural information F0, and finally the speech synthesis is performed, thereby improving the similarity between the converted speech and the target speech. degree. At the same time, it is verified that the method proposed in this paper can reduce the mechanical sounds of speech conversion to a certain extent and improve the similarity of the converted speech.},
 year = {2018}
}

Copy | Download

TY  - JOUR
T1  - Fundamental Frequency (F0) Fusion Transformation-Based on BLSTM for Voice Conversion
AU  - Miao Xiaokong
AU  - Zhang Xiongwei
AU  - Sun Meng
Y1  - 2018/08/10
PY  - 2018
N1  - https://doi.org/10.11648/j.sd.20180604.21
DO  - 10.11648/j.sd.20180604.21
T2  - Science Discovery
JF  - Science Discovery
JO  - Science Discovery
SP  - 298
EP  - 305
PB  - Science Publishing Group
SN  - 2331-0650
UR  - https://doi.org/10.11648/j.sd.20180604.21
AB  - For the current speech conversion algorithms based on neural networks, the method of using the mean-variance linear transformation fundamental frequency (F0) can easily cause some “mechanical tones” and strange adjustments in the converted speech, and the similarity of the transformed speech is low. This paper proposes the nonlinear mapping of F0 using BLSTM (Bi-directional Long Short Term Memory) neural network, and merging the structural information with the source fundamental frequency. Firstly, the stable BLSTM network is trained through the paired fundamental frequency F0, and then the final required fundamental frequency is obtained by fusing the converted F0′ and the original structural information F0, and finally the speech synthesis is performed, thereby improving the similarity between the converted speech and the target speech. degree. At the same time, it is verified that the method proposed in this paper can reduce the mechanical sounds of speech conversion to a certain extent and improve the similarity of the converted speech.
VL  - 6
IS  - 4
ER  -

Copy | Download