Preprint / Version 1

Full-Scratch Development and Release of a Text-to-Video Generation System with Native Japanese Language Support

##article.authors##

DOI:

https://doi.org/10.51094/jxiv.1248

Keywords:

Generative AI, Video Generation, Artificial Intelligence

Abstract

This technical report presents the full-scratch development and public release of a text-to-video generation system that natively supports Japanese language input. Japan’s content industry rivals the semiconductor sector in export value, making technological support in this domain a pressing need. Leveraging insights from both the United States and China while utilizing existing video generation frameworks, we propose a novel approach tailored for the Japanese language. Our model outperforms existing systems in terms of Fréchet Video Distance (FVD) and alignment accuracy when processing Japanese text. The study also highlights the need for scaling computational resources to further improve video quality.

Conflicts of Interest Disclosure

This study was conducted as part of the GENIAC program, supported by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO). The author declares no competing financial interests related to this research.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R. and Ramesh, A.: Video generation models as world simulators, 2024

Esser, P., Kulal, S., Blattmann, A., Entezari, R., M¨uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z. and Rombach, R.: Scaling rectified flow transformers for high-resolution image synthesis, Proceedings of the 41st International

Conference on Machine Learning, 2024

Farr´e, M., Marafioti, A., Tunstall, L., Von Werra, L. and Wolf, T.: FineVideo, 2024

Liu, X., Gong, C. and qiang liu: Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, The Eleventh International Conference on Learning Representations, 2023

Peebles, W. and Xie, S.: Scalable Diffusion Models with Transformers, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J. and Lin, J.: Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution, arXiv, 2024

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., Yin, D., Yuxuan.Zhang, Wang, W., Cheng, Y., Xu, B., Gu, X., Dong, Y. and Tang, J.: CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, The Thirteenth International Conference on Learning Representations, 2025

Posted


Submitted: 2025-05-08 12:04:18 UTC

Published: 2025-05-13 23:55:00 UTC
Section
Information Sciences