CommonArt β: 国産大規模言語モデルによる透明性の高い画像生成用拡散トランスフォーマー

尾崎, 安範; 三嶋, 隆史; 冨平, 準喜

doi:10.51094/jxiv.936

##article.authors##

尾崎, 安範本部, 株式会社AIdeaLab
三嶋, 隆史本部, 株式会社AI Picasso
冨平, 準喜本部, 株式会社AIdeaLab

DOI:

https://doi.org/10.51094/jxiv.936

キーワード:

画像生成、大規模言語モデル、拡散モデル、生成AI

抄録

本研究では、著作権に配慮した透明性の高い画像生成モデルであるCommonArt βを提案する。データセットにはCC-0やCC-BYといった改変可能な画像約2500万枚と合成キャプション5000万個を使い、アルゴリズムには拡散トランスフォーマーを国産LLMで条件付けすることとした。 30000 L4 GPU時間による学習の結果、FIDといった画像品質やCLIP Scoreといった指示追従の観点から日本語と英語を総合して定量評価した場合、従来の手法よりも最も高い性能になることが示された。今後は動画生成モデルへの応用が考えられる。

利益相反に関する開示

開示すべき利益相反関係はありません。

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

Yuya Yoshikawa, et al. Stair captions: Constructing a large-scale japanese image caption dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 417–421, Vancouver, Canada, July 2017. Association for Computational Linguistics.

Marah Abdin, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.

James Betker, et al. Improving image generation with better captions. preprint, 2023.

Ollin Boer Bohan. Megalith-10m. https://huggingface.co/datasets/madebyollin/megalith-10m, June

Accessed: 2024-10-07.

Junsong Chen, et al. Pixart-Sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In The 18th European Conference on Computer Vision, 2024.

Junsong Chen, et al. Pixart-Alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2024.

Mehdi Cherti, et al. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.

Jacob Devlin, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.

Aaron Gokaslan, et al. Commoncanvas: Open diffusion models trained on creative-commons images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8250–8260, June 2024.

Jack Hessel, et al. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.

Martin Heusel, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Imagen-Team-Google, et al. Imagen 3, 2024.

Junnan Li, et al. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023.

Peiyuan Liao, et al. The artbench dataset: Benchmarking generative models with artworks, 2022.

Haotian Liu, et al. Visual instruction tuning, 2023.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.

Cheng Lu, et al. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.

Yang Luo, et al. Came: Confidence-guided adaptive memory efficient optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4442–4453, 2023.

Colin Raffel, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.

Robin Rombach, et al. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.

Kei Sawada, et al. Release of pre-trained models for the Japanese language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13898–13905, 5 2024. https://arxiv.org/abs/2404.01657.

Christoph Schuhmann, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, volume 35, pages 25278–25294. Curran Associates, Inc., 2022.

Makoto Shing and Kei Sawada. rinna/japanese-stable-diffusion.

R. Shokri, et al. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18, Los Alamitos, CA, USA, may 2017. IEEE Computer Society.

Shuhei Yokoo, et al. Clip japanese base.

Gowthami Somepalli, et al. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023.

Gowthami Somepalli, et al. Understanding and mitigating copying in diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818, 2024.

Bart Thomee, et al. Yfcc100m: the new data in multimedia research. Commun. ACM, 59(2):64–73, January 2016.

Hugo Touvron, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

Peng Wang, et al. Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.

B. Xiao, et al. Florence-2: Advancing a unified representation for a variety of vision tasks. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4818–4829, Los Alamitos, CA, USA, jun 2024. IEEE Computer Society.

AI 戦略会議. AI に関する暫定的な論点整理. May 2024.