Preprint / Version 1

CommonArt β: Diffusion Transformer for Text-to-Image Generation by Japanese Large Language Model

##article.authors##

  • Yasunori Ozaki Head Quarters, AIdeaLab, Inc.
  • Ryuji Mishima Head Quarters, AI Picasso, Inc.
  • Toshiki Tomihira Head Quarters, AIdeaLab, Inc.

DOI:

https://doi.org/10.51094/jxiv.936

Keywords:

Text-to-Image Generation, Large Language Model, Diffusion Model, GenAI

Abstract

In this paper, we propose CommonArt $\beta$, a transparent image generation model that respects copyright. The dataset consists of approximately 25 million modifiable images under licenses such as CC-0 and CC-BY. For the algorithm, we used a diffusion transformer conditioned on a domestically developed LLM. After 30,000 L4 GPU hours of training, quantitative evaluation combining Japanese and English metrics in terms of image quality and instruction following showed that our method achieved the highest performance compared to conventional approaches. Future work may include applications to video generation models.

Conflicts of Interest Disclosure

No potential conflicts of interest were disclosed

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Yuya Yoshikawa, et al. Stair captions: Constructing a large-scale japanese image caption dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 417–421, Vancouver, Canada, July 2017. Association for Computational Linguistics.

Marah Abdin, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.

James Betker, et al. Improving image generation with better captions. preprint, 2023.

Ollin Boer Bohan. Megalith-10m. https://huggingface.co/datasets/madebyollin/megalith-10m, June

Accessed: 2024-10-07.

Junsong Chen, et al. Pixart-Sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In The 18th European Conference on Computer Vision, 2024.

Junsong Chen, et al. Pixart-Alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2024.

Mehdi Cherti, et al. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.

Jacob Devlin, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.

Aaron Gokaslan, et al. Commoncanvas: Open diffusion models trained on creative-commons images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8250–8260, June 2024.

Jack Hessel, et al. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.

Martin Heusel, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Imagen-Team-Google, et al. Imagen 3, 2024.

Junnan Li, et al. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023.

Peiyuan Liao, et al. The artbench dataset: Benchmarking generative models with artworks, 2022.

Haotian Liu, et al. Visual instruction tuning, 2023.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.

Cheng Lu, et al. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.

Yang Luo, et al. Came: Confidence-guided adaptive memory efficient optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4442–4453, 2023.

Colin Raffel, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.

Robin Rombach, et al. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.

Kei Sawada, et al. Release of pre-trained models for the Japanese language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13898–13905, 5 2024. https://arxiv.org/abs/2404.01657.

Christoph Schuhmann, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, volume 35, pages 25278–25294. Curran Associates, Inc., 2022.

Makoto Shing and Kei Sawada. rinna/japanese-stable-diffusion.

R. Shokri, et al. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18, Los Alamitos, CA, USA, may 2017. IEEE Computer Society.

Shuhei Yokoo, et al. Clip japanese base.

Gowthami Somepalli, et al. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023.

Gowthami Somepalli, et al. Understanding and mitigating copying in diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.

Bart Thomee, et al. Yfcc100m: the new data in multimedia research. Commun. ACM, 59(2):64–73, January 2016.

Hugo Touvron, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

Peng Wang, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.

B. Xiao, et al. Florence-2: Advancing a unified representation for a variety of vision tasks. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4818–4829, Los Alamitos, CA, USA, jun 2024. IEEE Computer Society.

AI 戦略会議. AI に関する暫定的な論点整理. May 2024.

Posted


Submitted: 2024-10-17 05:04:17 UTC

Published: 2024-10-21 10:52:23 UTC
Section
Information Sciences