Full-Scratch Development and Release of a Text-to-Video Generation System with Native Japanese Language Support
DOI:
https://doi.org/10.51094/jxiv.1248Keywords:
Generative AI, Video Generation, Artificial IntelligenceAbstract
This technical report presents the full-scratch development and public release of a text-to-video generation system that natively supports Japanese language input. Japan’s content industry rivals the semiconductor sector in export value, making technological support in this domain a pressing need. Leveraging insights from both the United States and China while utilizing existing video generation frameworks, we propose a novel approach tailored for the Japanese language. Our model outperforms existing systems in terms of Fréchet Video Distance (FVD) and alignment accuracy when processing Japanese text. The study also highlights the need for scaling computational resources to further improve video quality.
Conflicts of Interest Disclosure
This study was conducted as part of the GENIAC program, supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO). The author declares no competing financial interests related to this research.Downloads *Displays the aggregated results up to the previous day.
References
Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R. and Ramesh, A.: Video generation models as world simulators, 2024
Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z. and Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis, Proceedings of the 41st International
Conference on Machine Learning, 2024
Farr´e, M., Marafioti, A., Tunstall, L., Von Werra, L. and Wolf, T.: FineVideo, 2024
Liu, X., Gong, C. and qiang liu: Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, The Eleventh International Conference on Learning Representations, 2023
Peebles, W. and Xie, S.: Scalable Diffusion Models with Transformers, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J. and Lin, J.: Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution, arXiv, 2024
Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Yuxuan.Zhang, Wang, W., Cheng, Y., Xu, B., Gu, X., Dong, Y. and Tang, J.: CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, The Thirteenth International Conference on Learning Representations, 2025
Downloads
Posted
Submitted: 2025-05-08 12:04:18 UTC
Published: 2025-05-13 23:55:00 UTC
License
Copyright (c) 2025
Yasunori Ozaki
Masabumi Ishihara
Toshiki Tomihira

This work is licensed under a Creative Commons Attribution 4.0 International License.