pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク
DOI:
https://doi.org/10.51094/jxiv.1008キーワード:
大規模言語モデル、 日本語、 ベンチマーク、 評価抄録
本研究では、日本語事前学習モデルの文章生成性能を評価するためのベンチマークであるpfgen-benchを提案する。
従来の大規模言語モデル(LLM)を用いた日本語生成性能評価は、回答の正確性などに重きが置かれており、LLM-as-a-judgeのような生成の良さを見るようなベンチマークであっても、英語での回答を高く評価してしまい、日本語としての流暢さが十分に評価されないという問題点がある。
そこで、本手法では、Fluency(流暢さ)、Truthfulness(真実性)、Helpfulness(有用性)の3つの評価軸からなるベンチマークを提案する。
まず、日本の小中高の学習指導要領を参考に、13科目50問からなる、日本語圏特有の常識問題集を作成した。
さらに、複数のLLMとルールベースのフィルタリング手法を用いて、高品質な参照回答群を構築した。
その上で、さまざまなモデルの回答と参照回答群の近さをはかる3つの評価軸を設計し、生成結果の評価が可能なベンチマークを構築した。
このベンチマークを使用した評価結果に基づくと、事前学習モデル間の性能差を明確に示し、LLMによる従来の評価とも一致する点が確認された。
構築したベンチマークは公開し、一般に使用可能である。
利益相反に関する開示
筆者らは、pfnet/plamo-100bの開発元である、株式会社Preferred Networks/Elementsに所属しているが、本研究におけるモデル選定などにおいては、客観的根拠を以って公平な評価を行うように努めている。 また、透明性の確保のために、ベンチマークの計測コードを公開している。ダウンロード *前日までの集計結果を表示します
引用文献
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models. arXiv, 2023. https://arxiv.org/abs/2302.13971.
Hugo Touvron, Louis Martin, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv, 2023. https://arxiv.org/abs/2307.09288v2.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
OpenAI. ChatGPT, 2023. https://openai.com/blog/chatgpt/.
OpenAI. GPT-4 Technical Report, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5999–6009, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Association for Computational Linguistics, 2019.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Tom Brown, Benjamin Mann, et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901, 2020.
Google. Bard, 2023. https://bard.google.com/.
Databricks. Dolly, 2023. https://github.com/databrickslabs/dolly.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagn´e, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Gall´e, et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv, 2022. https://arxiv.
org/abs/2211.05100.
Vicuna. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023. https://vicuna.lmsys.org/.
Aakanksha Chowdhery, Sharan Narang, et al. PaLM: Scaling Language Modeling with Pathways. arXiv, 2022. https://arxiv.org/abs/2204.02311v5.
Rohan Anil, Andrew M. Dai, et al. PaLM 2 Technical Report. arXiv, 2023. https://arxiv.org/abs/2305.10403v3.
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, et al. A framework for few-shot language model evaluation, 2021. https://github.com/EleutherAI/lm-evaluation-harness.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024.
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduatelevel google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021.
Marc Eulerich, Aida Sanatizadeh, Hamid Vakilzadeh, and David A. Wood. Is it All Hype? ChatGPT’s Performance and Disruptive Potential in the Accounting and Auditing Industries. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4452175.
Harsha Nori, Nicholas King, Scott Mayer Mckinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on Medical Challenge Problems. arXiv, 2023. https://arxiv.org/abs/2303.13375v2.
Kwan Yuen Iu and Vanessa Man-Yi Wong. ChatGPT by OpenAI: The End of Litigation Lawyers? SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4339839.
Jonathan H. Choi, Kristin E. Hickman, Amy Monahan, and Daniel B. Schwarcz. ChatGPT Goes to Law School. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4335905.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, Vol. 36, pp. 46595–46623. Curran Associates, Inc., 2023.
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint
arXiv:2403.04132, 2024.
ダウンロード
公開済
投稿日時: 2024-12-24 09:54:00 UTC
公開日時: 2024-12-25 23:52:22 UTC
ライセンス
Copyright(c)2024
今城, 健太郎
平野, 正徳
鈴木, 脩司
三上, 裕明
この作品は、Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Licenseの下でライセンスされています。