プレプリント / バージョン1

llm-japanese-dataset v0: 大規模言語モデルのための日本語チャットデータセット構築

##article.authors##

DOI:

https://doi.org/10.51094/jxiv.383

キーワード:

大規模言語モデル、 データセット、 日本語、 チャット

抄録

本研究では,大規模言語モデルのための,日本語チャットデータセットを構築した.
本データセットは,約840万件のデータを含んでおり,翻訳タスクや,知識タスクなど,様々なタスクをチャット形式で含んだものとなっている.
構築したデータセットの有効性を確認するために,既存の大規模言語モデルをチューニングし,性能向上を定性的に確認し,日本語における大規模言語モデルや言語資源の構築における課題を明らかにした.

利益相反に関する開示

本論文に関して,開示すべき利益相反関連事項はない.

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is AllYouNeed,” Advances in Neural Information Processing Systems, vol.30, pp.5999–6009, 2017.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp.4171–4186, Association for Computational Linguistics, 2019.

A. Radford, K.Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Generative Pre-Training,” 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

T. Brown, B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol.33, pp.1877–1901, 2020.

OpenAI, “ChatGPT,” https://openai.com/blog/chatgpt/, 2023.

OpenAI, “GPT-4 Technical Report,” 2023. https://arxiv.org/abs/2303.08774

Google, “Bard,” https://bard.google.com/, 2023.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, et al., “LLaMA: Open and Efficient Foundation Language Models,” 2023. https://arxiv.org/abs/2302.13971

Databricks, “Dolly,” https://github.com/databrickslabs/dolly, 2023.

T.L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili´c, D. Hesslow, R. Castagn´e, A.S. Luccioni, F. Yvon, M. Gall´e, et al., “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model,” 2022. https://arxiv.org/abs/2211.05100

Vicuna, “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,” https://vicuna.lmsys.org/, 2023.

R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T.B. Hashimoto, “Stanford Alpaca: An Instructionfollowing LLaMA model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.

H. Song, R. Dabre, A. Fujita, and S. Kurohashi, “Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation,” Proceedings of the 12th Language Resources and Evaluation Conference, pp.3640–3649, May 2020.

M. Utiyama, “ParaNatCom—Parallel English-Japanese abstract corpus made from Nature Communications articles,” 2019. https://www2.nict.go.jp/astrec-att/member/mutiyama/paranatcom/

H. Riza, M. Purwoadi, T. Uliniansyah, A.A. Ti, S.M. Aljunied, L.C. Mai, V.T. Thang, N.P. Thai, V. Chea, S. Sam, et al., “Introduction of the asian language treebank,” 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, pp.1–6, 2016.

R. Pryzant, Y. Chung, D. Jurafsky, and D. Britz, “JESC: Japanese-English Subtitle Corpus,” Language Resources and Evaluation Conference, pp.1133–1137, 2018.

H. Isahara, F. Bond, K. Uchimoto, M. Utiyama, and K. Kanzaki, “Development of the Japanese WordNet,” Language Resources and Evaluation, pp.2420–2423, 2008.

C. Fellbaum, “WordNet,” Theory and applications of ontology: computer applications, pp.231–243, Springer, 2010.

T. Maruyama and K. Yamamoto, “Simplified Corpus with Core Vocabulary,” Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pp.1153–1160, 2018.

A. Katsuta and K. Yamamoto, “Crowdsourced Corpus of Sentence Simplification with Core Vocabulary,” Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pp.461–466, 2018.

T. Kodama, R. Tanaka, and S. Kurohashi, “Construction of Hierarchical Structured Knowledge-based Recommendation Dialogue Dataset and Dialogue System,” Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pp.83–92, 2022.

K. Kurihara, D. Kawahara, and T. Shibata, “JGLUE: Japanese General Language Understanding Evaluation,” Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.2957–2966, 2022.

E.J. Hu, yelongshen, P.Wallis, Z. Allen-Zhu, Y. Li, S.Wang, L.Wang, and W. Chen, “LoRA: Low-RankAdaptation of Large Language Models,” International Conference on Learning Representations, pp.1–13, 2022. https://arxiv.org/abs/2106.09685

S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, and S. Paul, “PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods,” https://github.com/huggingface/peft, 2022.

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory Optimizations toward Training Trillion Parameter Models,” SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1–16, 2020.

L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muennighoff, J. Phang, L. Reynolds, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “A framework for few-shot language model evaluation,” 2021. https://doi.org/10.5281/zenodo.5371628

ダウンロード

公開済


投稿日時: 2023-05-21 14:14:47 UTC

公開日時: 2023-05-24 00:41:45 UTC
研究分野
一般工学・総合工学