llm-japanese-dataset v0: 大規模言語モデルのための日本語チャットデータセット構築

Masanori HIRANO; Masahiro SUZUKI; Hiroki SAKAJI

doi:10.51094/jxiv.383

##article.authors##

Masanori HIRANO School of Engineering, The University of Tokyo https://orcid.org/0000-0001-5883-8250 https://mhirano.jp
Masahiro SUZUKI School of Engineering, The University of Tokyo https://orcid.org/0000-0001-8519-5617 https://msuzuki.me/
Hiroki SAKAJI School of Engineering, The University of Tokyo https://researchmap.jp/hiroki_sakaji

DOI:

https://doi.org/10.51094/jxiv.383

Keywords:

Large Language Model, Dataset, Japanese, Chat

Abstract

This study constructed a Japanese chat dataset for large language models.
The dataset contains approximately 8.4 million records and includes various tasks in chat format, such as translation and knowledge tasks.
To confirm the benefits of our constructed dataset, we tuned an existing large language model and confirmed its performance qualitatively. Those results revealed challenges in building large language models and language resources for them in Japanese.

Conflicts of Interest Disclosure

The authors declare no conflict of interest.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is AllYouNeed,” Advances in Neural Information Processing Systems, vol.30, pp.5999–6009, 2017.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp.4171–4186, Association for Computational Linguistics, 2019.

A. Radford, K.Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Generative Pre-Training,” 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

T. Brown, B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol.33, pp.1877–1901, 2020.

OpenAI, “ChatGPT,” https://openai.com/blog/chatgpt/, 2023.

OpenAI, “GPT-4 Technical Report,” 2023. https://arxiv.org/abs/2303.08774

Google, “Bard,” https://bard.google.com/, 2023.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, et al., “LLaMA: Open and Efficient Foundation Language Models,” 2023. https://arxiv.org/abs/2302.13971

Databricks, “Dolly,” https://github.com/databrickslabs/dolly, 2023.

T.L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili´c, D. Hesslow, R. Castagn´e, A.S. Luccioni, F. Yvon, M. Gall´e, et al., “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model,” 2022. https://arxiv.org/abs/2211.05100

Vicuna, “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality,” https://vicuna.lmsys.org/, 2023.

R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T.B. Hashimoto, “Stanford Alpaca: An Instructionfollowing LLaMA model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.

H. Song, R. Dabre, A. Fujita, and S. Kurohashi, “Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation,” Proceedings of the 12th Language Resources and Evaluation Conference, pp.3640–3649, May 2020.

M. Utiyama, “ParaNatCom—Parallel English-Japanese abstract corpus made from Nature Communications articles,” 2019. https://www2.nict.go.jp/astrec-att/member/mutiyama/paranatcom/

H. Riza, M. Purwoadi, T. Uliniansyah, A.A. Ti, S.M. Aljunied, L.C. Mai, V.T. Thang, N.P. Thai, V. Chea, S. Sam, et al., “Introduction of the asian language treebank,” 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques, pp.1–6, 2016.

R. Pryzant, Y. Chung, D. Jurafsky, and D. Britz, “JESC: Japanese-English Subtitle Corpus,” Language Resources and Evaluation Conference, pp.1133–1137, 2018.

H. Isahara, F. Bond, K. Uchimoto, M. Utiyama, and K. Kanzaki, “Development of the Japanese WordNet,” Language Resources and Evaluation, pp.2420–2423, 2008.

C. Fellbaum, “WordNet,” Theory and applications of ontology: computer applications, pp.231–243, Springer, 2010.

T. Maruyama and K. Yamamoto, “Simplified Corpus with Core Vocabulary,” Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pp.1153–1160, 2018.

A. Katsuta and K. Yamamoto, “Crowdsourced Corpus of Sentence Simplification with Core Vocabulary,” Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pp.461–466, 2018.

T. Kodama, R. Tanaka, and S. Kurohashi, “Construction of Hierarchical Structured Knowledge-based Recommendation Dialogue Dataset and Dialogue System,” Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pp.83–92, 2022.

K. Kurihara, D. Kawahara, and T. Shibata, “JGLUE: Japanese General Language Understanding Evaluation,” Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.2957–2966, 2022.

E.J. Hu, yelongshen, P.Wallis, Z. Allen-Zhu, Y. Li, S.Wang, L.Wang, and W. Chen, “LoRA: Low-RankAdaptation of Large Language Models,” International Conference on Learning Representations, pp.1–13, 2022. https://arxiv.org/abs/2106.09685

S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, and S. Paul, “PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods,” https://github.com/huggingface/peft, 2022.

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory Optimizations toward Training Trillion Parameter Models,” SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1–16, 2020.

L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muennighoff, J. Phang, L. Reynolds, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “A framework for few-shot language model evaluation,” 2021. https://doi.org/10.5281/zenodo.5371628

llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models

##article.authors##

DOI:

Keywords:

Abstract

Conflicts of Interest Disclosure

Downloads *Displays the aggregated results up to the previous day.

References

Downloads

Posted

License

Language