pfgen-bench: 日本語事前学習モデルのための文章生成性能評価ベンチマーク

Kentaro Imajo; Masanori Hirano; Shuji Suzuki; Hiroaki Mikami

doi:10.51094/jxiv.1008

##article.authors##

Kentaro Imajo Preferred Networks, Inc. / Preferred Elements, Inc.
Masanori Hirano Preferred Networks, Inc. / Preferred Elements, Inc. https://orcid.org/0000-0001-5883-8250 https://mhirano.jp
Shuji Suzuki Preferred Networks, Inc. / Preferred Elements, Inc.
Hiroaki Mikami Preferred Networks, Inc. / Preferred Elements, Inc.

DOI:

https://doi.org/10.51094/jxiv.1008

Keywords:

large language model, Japanese, benchmark, evaluation

Abstract

In this study, we propose pfgen-bench, a benchmark for evaluating the text generation performance of Japanese pre-trained models. Traditional evaluations of Japanese text generation using large language models (LLMs) have primarily focused on aspects such as the accuracy of responses. Even benchmarks designed to assess the quality of generated content, such as LLM-as-a-judge, tend to rate English responses highly, failing to adequately evaluate fluency in Japanese.

To address this issue, our proposed benchmark evaluates performance based on three axes: Fluency, Truthfulness, and Helpfulness. First, we created a set of 50 questions across 13 subjects inspired by the Japanese national curriculum guidelines for elementary, middle, and high schools, incorporating cultural and linguistic nuances specific to the Japanese context. Next, we utilized multiple LLMs and rule-based filtering methods to construct a high-quality reference answer set.

We then designed evaluation metrics for measuring the proximity between model-generated answers and the reference answer set across the three axes, enabling comprehensive evaluation of generated outputs. Evaluation results using this benchmark demonstrated clear performance differences between pre-trained models and alignment with conventional evaluations by LLMs.

The benchmark we developed is publicly available and can be freely used.

Conflicts of Interest Disclosure

The authors, affiliated with Preferred Networks/Elements, developers of pfnet/plamo-100b, ensured fair evaluations based on objective evidence and have publicly released the benchmark evaluation code for transparency.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models. arXiv, 2023. https://arxiv.org/abs/2302.13971.

Hugo Touvron, Louis Martin, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv, 2023. https://arxiv.org/abs/2307.09288v2.

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.

OpenAI. ChatGPT, 2023. https://openai.com/blog/chatgpt/.

OpenAI. GPT-4 Technical Report, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5999–6009, 2017.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Association for Computational Linguistics, 2019.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Tom Brown, Benjamin Mann, et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901, 2020.

Google. Bard, 2023. https://bard.google.com/.

Databricks. Dolly, 2023. https://github.com/databrickslabs/dolly.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagn´e, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Gall´e, et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv, 2022. https://arxiv.

org/abs/2211.05100.

Vicuna. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023. https://vicuna.lmsys.org/.

Aakanksha Chowdhery, Sharan Narang, et al. PaLM: Scaling Language Modeling with Pathways. arXiv, 2022. https://arxiv.org/abs/2204.02311v5.

Rohan Anil, Andrew M. Dai, et al. PaLM 2 Technical Report. arXiv, 2023. https://arxiv.org/abs/2305.10403v3.

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, et al. A framework for few-shot language model evaluation, 2021. https://github.com/EleutherAI/lm-evaluation-harness.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024.

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduatelevel google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021.

Marc Eulerich, Aida Sanatizadeh, Hamid Vakilzadeh, and David A. Wood. Is it All Hype? ChatGPT’s Performance and Disruptive Potential in the Accounting and Auditing Industries. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4452175.

Harsha Nori, Nicholas King, Scott Mayer Mckinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on Medical Challenge Problems. arXiv, 2023. https://arxiv.org/abs/2303.13375v2.

Kwan Yuen Iu and Vanessa Man-Yi Wong. ChatGPT by OpenAI: The End of Litigation Lawyers? SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4339839.

Jonathan H. Choi, Kristin E. Hickman, Amy Monahan, and Daniel B. Schwarcz. ChatGPT Goes to Law School. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4335905.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, Vol. 36, pp. 46595–46623. Curran Associates, Inc., 2023.

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint

arXiv:2403.04132, 2024.

pfgen-bench: Benchmark for Evaluating Text Generation Performance of Japanese Pre-trained Models

##article.authors##

DOI:

Keywords:

Abstract

Conflicts of Interest Disclosure

Downloads *Displays the aggregated results up to the previous day.

References

Downloads

Posted

License

Language