pfgen-bench: Benchmark for Evaluating Text Generation Performance of Japanese Pre-trained Models
DOI:
https://doi.org/10.51094/jxiv.1008Keywords:
large language model, Japanese, benchmark, evaluationAbstract
In this study, we propose pfgen-bench, a benchmark for evaluating the text generation performance of Japanese pre-trained models. Traditional evaluations of Japanese text generation using large language models (LLMs) have primarily focused on aspects such as the accuracy of responses. Even benchmarks designed to assess the quality of generated content, such as LLM-as-a-judge, tend to rate English responses highly, failing to adequately evaluate fluency in Japanese.
To address this issue, our proposed benchmark evaluates performance based on three axes: Fluency, Truthfulness, and Helpfulness. First, we created a set of 50 questions across 13 subjects inspired by the Japanese national curriculum guidelines for elementary, middle, and high schools, incorporating cultural and linguistic nuances specific to the Japanese context. Next, we utilized multiple LLMs and rule-based filtering methods to construct a high-quality reference answer set.
We then designed evaluation metrics for measuring the proximity between model-generated answers and the reference answer set across the three axes, enabling comprehensive evaluation of generated outputs. Evaluation results using this benchmark demonstrated clear performance differences between pre-trained models and alignment with conventional evaluations by LLMs.
The benchmark we developed is publicly available and can be freely used.
Conflicts of Interest Disclosure
The authors, affiliated with Preferred Networks/Elements, developers of pfnet/plamo-100b, ensured fair evaluations based on objective evidence and have publicly released the benchmark evaluation code for transparency.Downloads *Displays the aggregated results up to the previous day.
References
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models. arXiv, 2023. https://arxiv.org/abs/2302.13971.
Hugo Touvron, Louis Martin, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv, 2023. https://arxiv.org/abs/2307.09288v2.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
OpenAI. ChatGPT, 2023. https://openai.com/blog/chatgpt/.
OpenAI. GPT-4 Technical Report, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5999–6009, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Association for Computational Linguistics, 2019.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Tom Brown, Benjamin Mann, et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901, 2020.
Google. Bard, 2023. https://bard.google.com/.
Databricks. Dolly, 2023. https://github.com/databrickslabs/dolly.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagn´e, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Gall´e, et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv, 2022. https://arxiv.
org/abs/2211.05100.
Vicuna. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023. https://vicuna.lmsys.org/.
Aakanksha Chowdhery, Sharan Narang, et al. PaLM: Scaling Language Modeling with Pathways. arXiv, 2022. https://arxiv.org/abs/2204.02311v5.
Rohan Anil, Andrew M. Dai, et al. PaLM 2 Technical Report. arXiv, 2023. https://arxiv.org/abs/2305.10403v3.
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, et al. A framework for few-shot language model evaluation, 2021. https://github.com/EleutherAI/lm-evaluation-harness.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024.
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduatelevel google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021.
Marc Eulerich, Aida Sanatizadeh, Hamid Vakilzadeh, and David A. Wood. Is it All Hype? ChatGPT’s Performance and Disruptive Potential in the Accounting and Auditing Industries. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4452175.
Harsha Nori, Nicholas King, Scott Mayer Mckinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on Medical Challenge Problems. arXiv, 2023. https://arxiv.org/abs/2303.13375v2.
Kwan Yuen Iu and Vanessa Man-Yi Wong. ChatGPT by OpenAI: The End of Litigation Lawyers? SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4339839.
Jonathan H. Choi, Kristin E. Hickman, Amy Monahan, and Daniel B. Schwarcz. ChatGPT Goes to Law School. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4335905.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, Vol. 36, pp. 46595–46623. Curran Associates, Inc., 2023.
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint
arXiv:2403.04132, 2024.
Downloads
Posted
Submitted: 2024-12-24 09:54:00 UTC
Published: 2024-12-25 23:52:22 UTC
License
Copyright (c) 2024
Kentaro Imajo
Masanori Hirano
Shuji Suzuki
Hiroaki Mikami
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.