Language Independent Speech-to-Singing-Voice Conversion

Ito, Akinori

doi:10.51094/jxiv.1902

##article.authors##

Ito, Akinori 東北大学大学院工学研究科 https://orcid.org/0000-0002-8835-7877

DOI:

https://doi.org/10.51094/jxiv.1902

キーワード:

Singing voice、 voiced/unvoiced classification、 HuBERT

抄録

This research addresses the challenge of converting spoken voice into singing voice in a language-independent manner. Traditional speech-to-singing systems often rely on language-specific phoneme alignment or require parallel singing datasets, which limits their applicability across languages and speakers. To overcome these constraints, the authors propose a novel framework that utilizes voiced/unvoiced (V/UV) classification and music state modeling to align speech with musical scores without relying on linguistic content. The approach begins by extracting a V/UV state sequence from input speech using a convolutional layer built on top of a pretrained HuBERT model. Simultaneously, a music state sequence is generated from a monophonic musical score using a decay function that models note intensity over time. These two sequences are then aligned using Dynamic Time Warping (DTW), allowing the system to synchronize speech features with musical timing and pitch. After alignment, the World vocoder is employed to analyze and synthesize the singing voice. The spectral and aperiodic components of speech are aligned to the music sequence, while pitch is replaced with musical pitch to produce the final singing output. Experimental results demonstrate the effectiveness of the proposed V/UV classification using the ATR speech database. The system could generate singing voices from spoken input without requiring phoneme-level annotations or parallel singing data, but still, there is room for quality improvement.

利益相反に関する開示

I declare I have no conflict of interest.

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

Campbell, L. and Belew, A., [Cataloguing the world’s endangered languages], vol. 711, Routledge London and New York (2018).

Hinton, L., “Language revitalization: An overview,” The green book of language revitalization in practice 1,18 (2001).

Guerrettaz, A. M. and Engman, M., “Indigenous language revitalization,” in [Oxford encyclopedia of race and education], Oxford University Press (2023).

Hinton, L., “Language revitalization and language pedagogy: New teaching and learning strategies,” in [Applied linguists needed], 41–52, Routledge (2014).

Nee, J., “Creating books for use in language revitalization classrooms: considerations and outcomes,” L2 Journal: An Open Access Refereed Journal for World Language Educators 12(1) (2020).

Hara, K. and Heinrich, P., “27. linguistic and cultural revitalization,” Handbook of the Ryukyuan Languages (2015).

Vallejo, J. M., “Revitalising language through music: a case study of music and culturally grounded pedagogy in two Kanien’ke: ha (Mohawk) language immersion programmes,” in [Ethnomusicology Forum], 28(1), 89–117, Taylor & Francis (2019).

Ansah, M. A., Agyeman, N. A., and Adjei, G., “Revitalizing minority languages using music: Three South-Guan languages of Ghana in focus,” Research Journal in Advanced Humanities 3(1), 19–34 (2022).

Dembling, J., “Instrumental music and Gaelic revitalization in Scotland and Nova Scotia,” International Journal of the Sociology of Language 2010(206), 245–254 (2010).

Huang, K., “‘we are indigenous people, not primitive people.’: the role of popular music in indigenous language revitalization in Taiwan,” Current Issues in Language Planning 24(4), 440–459 (2023).

Sleeper, M., “Singing synthesizers: Musical language revitalization through UTAUloid,” Canadian Journal of Applied Linguistics 27(2), 52–84 (2024).

Saitou, T., Goto, M., Unoki, M., and Akagi, M., “Speech-to-singing synthesis: Converting speaking voices to singing voices by controlling acoustic features unique to singing voices,” in [2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics], 215–218, IEEE (2007).

Saitou, T., Goto, M., Unoki, M., and Akagi, M., “Speech-to-singing synthesis system: Vocal conversion from speaking voices to singing voices by controlling acoustic features unique to singing voices,” in [National Conference on Man-Machine Speech Communication (NCMMSC2009)], (2009).

Kawahara, H., Masuda-Katsuse, I., and De Cheveigne, A., “Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech communication 27(3-4), 187–207 (1999).

Vijayan, K., Dong, M., and Li, H., “A dual alignment scheme for improved speech-to-singing voice conversion,” in [2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)], 1547–1555, IEEE (2017).

Parekh, J., Rao, P., and Yang, Y.-H., “Speech-to-singing conversion in an encoder-decoder framework,” in [ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)], 261–265, IEEE (2020).

Li, R., Huang, R., Wang, Y., Hong, Z., and Zhao, Z., “Self-supervised singing voice pre-training towards speech-to-singing conversion,” arXiv preprint arXiv:2406.02429 (2024).

Morise, M., Yokomori, F., and Ozawa, K., “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems 99(7), 1877–1884 (2016).

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A., “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language processing 29, 3451–3460 (2021).

Koshikawa, T., Ito, A., and Nose, T., “Fast and speaker-independent utterance selection for ASR-free CALL systems of minority languages,” in [APSIPA Annual Summit and Conference], (2025).

Sredojev, B., Samardzija, D., and Posarac, D., “WebRTC technology overview and signaling solution design and implementation,” in [2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO)], 1006–1009, IEEE (2015).

Mauch, M. and Dixon, S., “pYIN: A fundamental frequency estimator using probabilistic threshold distributions,” in [2014 ieee International conference on acoustics, speech and signal processing (icassp)], 659–663, IEEE (2014).