金融分野に特化した複数ターン日本語生成ベンチマークの構築

平野, 正徳; 今城, 健太郎

doi:10.51094/jxiv.1000

##article.authors##

平野, 正徳株式会社Preferred Networks https://orcid.org/0000-0001-5883-8250 https://mhirano.jp
今城, 健太郎株式会社Preferred Networks

DOI:

https://doi.org/10.51094/jxiv.1000

キーワード:

大規模言語モデル、金融、日本語、ベンチマーク

抄録

大規模言語モデル(LLM)の発展に伴い、様々な分野において性能を評価する取り組みが必要となってきている。本研究では、金融分野においてLLMの生成の良さ測るための日本語生成ベンチマークpfmt-bench-fin-jaを提案した。 pfmt-bench-fin-jaは、MT-benchに対応するような金融分野に特化した複数ターンの日本語生成ベンチマークであり、12カテゴリー、360問のベンチマークを新たに構築した。評価にあたっては、GPT-4o-miniをLLM-as-a-judgeとして用いて、10段階評価でスコア計測をすることとした。実験として、複数のLLMに対してベンチマークを計測し、その結果を比較検討した。その結果、pfmt-bench-fin-jaが一定レベルでLLMの性能評価を行うことができることが示された。構築したベンチマークはGithubより利用可能である。

利益相反に関する開示

著者らは、pfnet/plamo-100bの開発元である、株式会社Preferred Networks/Elementsに所属しているが、本研究での実験においては、他のモデルと公平な評価を行っており、透明性の確保のために、ベンチマークの計測コードを公開している

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

OpenAI. ChatGPT, 2023. https://openai.com/blog/chatgpt/.

OpenAI. GPT-4 Technical Report, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems, Vol. 30, pp. 5999–6009, 2017.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Association for Computational Linguistics, 2019.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training, 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.

Alec Radford, Jeﬀ Wu, Rewon Child, David Luan, DarioAmodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners, 2019. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.

Tom Brown, Benjamin Mann, et al. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901, 2020.

Google. Bard, 2023. https://bard.google.com/.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and Eﬃcient Foundation Language Models. arXiv, 2023. https://arxiv.org/abs/2302.13971.

Hugo Touvron, Louis Martin, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv, 2023. https://arxiv.org/abs/2307.09288v2.

Databricks. Dolly, 2023. https://github.com/databrickslabs/dolly.

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagn´e, Alexandra Sasha Luccioni, Franc¸ois Yvon, Matthias Gall´e, et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv, 2022. https://arxiv.org/abs/2211.05100.

Vicuna. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, 2023. https://vicuna.lmsys.org/.

Aakanksha Chowdhery, Sharan Narang, et al. PaLM: Scaling Language Modeling with Pathways. arXiv, 2022. https://arxiv.org/abs/2204.02311v5.

Rohan Anil, Andrew M. Dai, et al. PaLM 2 Technical Report. arXiv, 2023. https://arxiv.org/abs/2305.10403v3.

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPoﬁ, et al. A framework for few-shot language model evaluation, 2021. https://github.com/EleutherAI/lm-evaluation-harness.

Marc Eulerich, Aida Sanatizadeh, Hamid Vakilzadeh, and David A. Wood. Is it All Hype? ChatGPT’s Performance and Disruptive Potential in the Accounting and Auditing Industries. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4452175.

Harsha Nori, Nicholas King, Scott Mayer Mckinney, Dean Carignan, and Eric Horvitz. Capabilities of GPT-4 on Medical Challenge Problems. arXiv, 2023. https://arxiv.org/abs/2303.13375v2.

Kwan Yuen Iu and Vanessa Man-Yi Wong. ChatGPT by OpenAI: The End of Litigation Lawyers? SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4339839.

Jonathan H. Choi, Kristin E. Hickman, Amy Monahan, and Daniel B. Schwarcz. ChatGPT Goes to Law School. SSRN Electronic Journal, 2023. https://papers.ssrn.com/abstract=4335905.

Shijie Wu, Ozan ˙ Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. BloombergGPT: A Large Language Model for Finance. arXiv, 2023. https://arxiv.org/abs/2303.17564v2.

Pedram Babaei William Todt, Ramtin Babaei. Fin-LLAMA: Eﬃcient Finetuning of Quantized LLMs for Finance, 2023. https://github.com/Bavest/fin-llama.

Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. FinGPT: Open-Source Financial Large Language Models. arXiv, 2023. https://arxiv.org/abs/2306.06031.

Boyu Zhang, Hongyang Yang, and Xiao-Yang Liu. Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models. arXiv, 2023. https://arxiv.org/abs/2306.12659.

Kenshin Abe, Kaizaburo Chubachi, Yasuhiro Fujita, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Hiroyoshi Komatsu, Hiroaki Mikami, Tsuguo Mogami, Shogo Murai, et al. Plamo-100b: A ground-up language model designed for japanese proﬁciency. arXiv preprint arXiv:2410.07563, 2024.

StabilityAI. JP Language Model Evaluation Harness, 2023. https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable.

Masanori Hirano. Construction of a Japanese Financial Benchmark for Large Language Models. In Joint Workshop of the 7th FinNLP, the 5th KDF, and the 4th ECONLP, pp. 1–9, 2024.

Lianmin Zheng, Wei-Lin Chiang, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, Vol. 36, pp. 46595–46623. Curran Associates,

Inc., 2023.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022