Construction of a Multi-turn Japanese Language Generative Benchmark for Finance
DOI:
https://doi.org/10.51094/jxiv.1000Keywords:
large language model, finance, Japanese, benchmarkAbstract
With the development of large-scale language models (LLMs), efforts to evaluate their performance in various fields have become necessary.
In this study, we proposed a Japanese generative benchmark, pfmt-bench-fin-ja, to measure the performance of LLM generation in the financial field.
pfmt-bench-fin-ja is a multi-turn Japanese language generation benchmark specialized for finance with 12 categories and 360 questions.
For the evaluation, we decided to use GPT-4o-mini as an LLM-as-a-judge to measure scores on a 10-point scale.
As an experiment, benchmarks were measured for multiple LLMs, and the results were compared.
The results showed that pfmt-bench-fin-ja is able to evaluate the performance of LLMs at a certain level.
The benchmark is available from GitHub.
Conflicts of Interest Disclosure
The authors are affiliated with Preferred Networks/Elements, Inc., the developer of pfnet/plamo-100b, but in the experiments in this study, we are conducting a fair evaluation with other models, and for transparency, we are releasing the benchmark.Downloads *Displays the aggregated results up to the previous day.
References
OpenAI. ChatGPT, 2023. https://openai.com/blog/chatgpt/.
OpenAI. GPT-4 Technical Report, 2023.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5999–6009, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Association for Computational Linguistics, 2019.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
Alec Radford, Jeff Wu, Rewon Child, David Luan, DarioAmodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
Tom Brown, Benjamin Mann, et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901, 2020.
Google. Bard, 2023. https://bard.google.com/.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models. arXiv, 2023. https://arxiv.org/abs/2302.13971.
Hugo Touvron, Louis Martin, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv, 2023. https://arxiv.org/abs/2307.09288v2.
Databricks. Dolly, 2023. https://github.com/databrickslabs/dolly.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagn´e, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Gall´e, et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv, 2022. https://arxiv.org/abs/2211.05100.
Vicuna. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023. https://vicuna.lmsys.org/.
Aakanksha Chowdhery, Sharan Narang, et al. PaLM: Scaling Language Modeling with Pathways. arXiv, 2022. https://arxiv.org/abs/2204.02311v5.
Rohan Anil, Andrew M. Dai, et al. PaLM 2 Technical Report. arXiv, 2023. https://arxiv.org/abs/2305.10403v3.
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, et al. A framework for few-shot language model evaluation, 2021. https://github.com/EleutherAI/lm-evaluation-harness.
Marc Eulerich, Aida Sanatizadeh, Hamid Vakilzadeh, and David A. Wood. Is it All Hype? ChatGPT’s Performance and Disruptive Potential in the Accounting and Auditing Industries. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4452175.
Harsha Nori, Nicholas King, Scott Mayer Mckinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on Medical Challenge Problems. arXiv, 2023. https://arxiv.org/abs/2303.13375v2.
Kwan Yuen Iu and Vanessa Man-Yi Wong. ChatGPT by OpenAI: The End of Litigation Lawyers? SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4339839.
Jonathan H. Choi, Kristin E. Hickman, Amy Monahan, and Daniel B. Schwarcz. ChatGPT Goes to Law School. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4335905.
Shijie Wu, Ozan ˙ Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. BloombergGPT: A Large Language Model for Finance. arXiv, 2023. https://arxiv.org/abs/2303.17564v2.
Pedram Babaei William Todt, Ramtin Babaei. Fin-LLAMA: Efficient Finetuning of Quantized LLMs for Finance, 2023. https://github.com/Bavest/fin-llama.
Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. FinGPT: Open-Source Financial Large Language Models. arXiv, 2023. https://arxiv.org/abs/2306.06031.
Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models. arXiv, 2023. https://arxiv.org/abs/2306.12659.
Kenshin Abe, Kaizaburo Chubachi, Yasuhiro Fujita, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Hiroyoshi Komatsu, Hiroaki Mikami, Tsuguo Mogami, Shogo Murai, et al. Plamo-100b: A ground-up language model designed for japanese proficiency. arXiv preprint arXiv:2410.07563, 2024.
StabilityAI. JP Language Model Evaluation Harness, 2023. https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable.
Masanori Hirano. Construction of a Japanese Financial Benchmark for Large Language Models. In Joint Workshop of the 7th FinNLP, the 5th KDF, and the 4th ECONLP, pp. 1–9, 2024.
Lianmin Zheng, Wei-Lin Chiang, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, Vol. 36, pp. 46595–46623. Curran Associates,
Inc., 2023.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022
Downloads
Posted
Submitted: 2024-12-12 08:54:41 UTC
Published: 2024-12-17 00:57:37 UTC
License
Copyright (c) 2024
Masanori HIRANO
Kentaro Imajo
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.