身体構造推定に基づくキャラクター指向音声合成の設計枠組み

鷹宮, 佳奈

doi:10.51094/jxiv.4033

##article.authors##

鷹宮, 佳奈独立研究者

DOI:

https://doi.org/10.51094/jxiv.4033

キーワード:

音声合成、キャラクター指向音声合成、仮想話者デザイン

抄録

近年主流となっている深層学習に基づく話者模倣型手法は、高品質な音声生成を可能にしている一方で、声質を規定する身体的要因との対応関係を潜在表現の内部に取り込み、明示的には扱ってこなかった。この傾向は、外見や身体設定が先行して設計されるキャラクター音声において、特に設計上の課題として現れる。

本稿では、心理学・認知科学における顔と声の統合知覚に関する知見を理論的動機とし、解剖学および音声学の観点から、キャラクターの外見から推定される身体的制約を音声生成の初期条件として位置づける。さらに、視覚情報から推定される体格、声道構造、声帯特性、呼吸容量などを物理・音響パラメータに対応付けることで、話者模倣や統計的類似性に依存しないキャラクター指向音声設計の設計基盤を提示する。

利益相反に関する開示

本稿に関して開示すべき利益相反はない。

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

M. Kamachi, H. Hill, K. Lander, and E. Vatikiotis-Bateson, “Putting the face to the voice’:Matching identity across modality,” Current Biology, vol. 13, no. 19, pp. 1709–1714, 2003.

H. M. Smith, A. K. Dunn, T. Baguley, and P. C. Stacey, “Matching novel face and voice identity using static and dynamic facial images,” Attention, Perception, & Psychophysics, vol. 78, no. 3,pp. 868–879, 2016.

伊東裕司，高山宏，日比谷潤，渡辺茂，“顔と声の関連性の判断: 人物の同一性について”，哲學，vol. 98，pp. 123–139，1995．

A. Nagrani, S. Albanie, and A. Zisserman, “Seeing voices and hearing faces: Cross-modal biometric matching,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8427–8436.

C. Kim, H. V. Shin, T.-H. Oh, A. Kaspar, M. Elgharib, and W. Matusik, “On learning associations of faces and voices,” in Asian Conference on Computer Vision. Springer, 2018, pp.276–292.

W. J. Mitchell, K. A. Szerszen Sr, A. S. Lu, P. W. Schermerhorn, M. Scheutz, and K. F.MacDorman, “A mismatch in the human realism of face and voice produces an uncanny valley,”i-Perception, vol. 2, no. 1, pp. 10–12, 2011.

Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao,Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.

Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren,and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International conference on machine learning. PMLR, 2018, pp.5180–5189.

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang,R. Skerry-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing(ICASSP). IEEE, 2018, pp. 4779–4783.

Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu et al., “Transfer learning from speaker veriﬁcation to multispeaker text-to-speech synthesis,” Advances in neural information processing systems, vol. 31, 2018.

S. Um, J. Kim, J. Lee, and H.-G. Kang, “Facetron: A multi-speaker face-to-speech model based on cross-modal latent representations,” in 2023 31st European Signal Processing Conference(EUSIPCO). IEEE, 2023, pp. 281–285.

S. Goto, K. Onishi, Y. Saito, K. Tachibana, and K. Mori, “Face2speech: Towards multi-speaker text-to-speech synthesis using an embedding vector predicted from a face image.” in Interspeech,2020, pp. 1321–1325.

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12,pp. 8717–8727, 2018.

W. T. Fitch, “Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques,” The Journal of the Acoustical Society of America, vol. 102, no. 2, pp.1213–1222, 1997.

W. T. Fitch, “The evolution of speech: a comparative review,” Trends in cognitive sciences, vol. 4,no. 7, pp. 258–267, 2000.

J. Dang and K. Honda, “Acoustic characteristics of the human paranasal sinuses derived from transmission characteristic measurement and morphological observation,” The Journal of the Acoustical Society of America, vol. 100, no. 5, pp. 3374–3383, 1996.

Dediu Dan，Jennings Emily M， van’t Ent Dennis，Moisik Scott R，Di Pisa Grazia，Schulze Janna， de Geus Eco JC， den Braber Anouk，Dolan Conor V，Boomsma Dorret I，“The heritability of vocal tract structures estimated from structural mri in a large cohort of dutch twins”，Human genetics，vol. 141，no. 12，pp. 1905–1923，2022．

Z. Yang, Z. Wu, Y. Shan, and J. Jia, “What does your face sound like? 3d face shape towards voice,” in Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 37, no. 11, 2023,pp. 13 905–13 913.

L. P. Pawelec, K. Slowik, and A. Lipowicz, “Dimensions of the face, head, and neck affect acoustic parameters in polish males and females,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7659–7663.

W. T. Fitch and J. Giedd, “Morphology and development of the human vocal tract: A study using magnetic resonance imaging,” The Journal of the Acoustical Society of America, vol. 106,no. 3, pp. 1511–1522, 1999.

K. Pisanski, P. J. Fraccaro, C. C. Tigue, J. J. O’Connor, S. R¨oder, P. W. Andrews, B. Fink, L. M. DeBruine, B. C. Jones, and D. R. Feinberg, “Vocal indicators of body size in men and women: a meta-analysis,” Animal Behaviour, vol. 95, pp. 89–99, 2014.

K. Pisanski, B. C. Jones, B. Fink, J. J. O’Connor, L. M. DeBruine, S. R¨oder, and D. R. Feinberg,“Voice parameters predict sex-speciﬁc body morphology in men and women,” Animal behaviour,vol. 112, pp. 13–22, 2016.

S. Rohatgi, V. Gupta, B. Yadav, and B. Yadav, “Forensic phonetics: A linguistic approach,”Journal of Punjab Academy of Forensic Medicine and Toxicology, vol. 18, no. 2, pp. 36–41, 2018.

K. Ishizaka and J. L. Flanagan, “Synthesis of voiced sounds from a two-mass model of the vocalcords,” Bell system technical journal, vol. 51, no. 6, pp. 1233–1268, 1972.

P. Birkholz, “A survey of self-oscillating lumped-element models of the vocal folds,” in Konferenz Elektronische Sprachsignalverarbeitung. TUDpress, Dresden, 2011, pp. 47–58.

H. R. Weerathunge, G. A. Alzamendi, G. J. Cler, F. H. Guenther, C. E. Stepp, and M. Za˜nartu,“Ladiva: A neurocomputational model providing laryngeal motor control for speech acquisition and production,” PLoS computational biology, vol. 18, no. 6, p. e1010159, 2022.

U. Bernardet, S.-H. Kang, A. Feng, S. DiPaola, and A. Shapiro, “Speech breathing in virtual humans: An interactive model and empirical study,” in 2019 IEEE Virtual Humans and Crowds for Immersive Environments (VHCIE). IEEE, 2019, pp. 1–9.

´E. Sz´ekely, G. E. Henter, J. Beskow, and J. Gustafson, “Breathing and speech planning in spontaneous speech synthesis,” in ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7649–7653.

菊池遥斗，能勢隆，伊藤彰則，“キャラクター画像からの音声合成に向けた動向分析”，情報処理学会研究報告，vol. 2024-MUS-140，no. 11，pp. 1–10，2024．

I. R. Titze, “The physics of small-amplitude oscillation of the vocal folds,” The Journal of the Acoustical Society of America, vol. 83, no. 4, pp. 1536–1552, 1988.

I. R. Titze, E. S. Luschei, and M. Hirano, “Role of the thyroarytenoid muscle in regulation of fundamental frequency,” Journal of Voice, vol. 3, no. 3, pp. 213–224, 1989.

J. Sundberg, The Science of the Singing Voice. DeKalb, IL: Northern Illinois University Press,1987.

B. H. Story, “A parametric model of the vocal tract area function for vowel and consonant simulation,” The Journal of the Acoustical Society of America, vol. 117, no. 5, pp. 3231–3254,2005.

R. Angulu, J. R. Tapamo, and A. O. Adewumi, “Age estimation via face images: a survey,”EURASIP Journal on Image and Video Processing, vol. 2018, no. 1, p. 42, 2018.

L. Wen and G. Guo, “A computational approach to body mass index prediction from face images,” Image and Vision Computing, vol. 31, no. 5, pp. 392–400, 2013.

R. Baudouin, A. Amelot, S. Nicolleau, I. Huynh-Charlier, L. Crevier-Buchman, S. Hans, and P. Charlier, “Voice of mummiﬁed king henri iv recreated via 3d functional vocal tract model,”Journal of Voice, 2026.