Preprint / Version 1

Evaluating GPT in Japanese Bar Examination: Insights and Limitations

##article.authors##

  • Jungmin Choi RIKEN AIP
  • Jungo Kasai Kotoba Technologies, Inc
  • Keisuke Sakaguchi Graduate School of Information Sciences, Tohoku University

DOI:

https://doi.org/10.51094/jxiv.559

Keywords:

Natural Language Processing, Large Language Models, Bar Examinations

Abstract

Large-scale language models like ChatGPT have been reported to exceed the accuracy of human experts in a wide range of tasks. Recent research reports that ChatGPT passed the Japanese National Medical Examination, confirming its high performance in Japanese.
We evaluated the accuracy of GPT-3, GPT-4, and ChatGPT in the Japanese Bar Examination (the multiple-choice format section), focusing on Constitutional Law, Civil Law, and Criminal Law over the past five years. The results revealed that the current correct answer rate for these exams is only 30-40% (compared to the average pass rate of 70%), which is significantly low.
This study went beyond just the correct answer rate, dissecting the necessary reasoning and knowledge for the responses, and examining the performance of large-scale language models from each perspective. The findings show that 1) large-scale language models possess extensive knowledge of many statutes, 2) they have a high correct answer rate for questions that require understanding of legal theories but not specific knowledge of law, and 3) they have a low correct answer rate for questions requiring knowledge of case law. The primary reason for their lower performance compared to the American Bar Examination is thought to be a lack of knowledge in Japanese law, especially in case law.

Conflicts of Interest Disclosure

We declare that we have no conflicts of interest.

Downloads *Displays the aggregated results up to the previous day.

Download data is not yet available.

References

Blair-Stanek, Andrew, Nils Holzenberger, and Benjamin Van Durme (2023). Can GPT-3 Perform Statutory Reasoning? in Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL ’23, p. 22–31, New York, NY, USA: Association for Computing Machinery.

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei (2020). Language Models are Few-Shot Learners. In Larochelle, H., M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin eds. Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901: Curran Associates, Inc.

Choi, Jonathan H., Kristin E Hickman, Amy Monahan, and Daniel Schwarcz (2023). ChatGPT goes to law school. Journal of Legal Education.

Iu, Kwansai and Vanessa Man-Yi Wong (2023). ChatGPT by OpenAI: The End of Litigation Lawyers? SSRN Electronic Journal.

Kasai, Jungo, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, and Dragomir R. Radev (2023). Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations. ArXiv, Vol. abs/2303.18027.

Kim, Mi-Young, Juliano Rabelo, Randy Goebel, Masaharu Yoshioka, Yoshinobu Kano, and Ken Satoh (2023). COLIEE 2022 Summary: Methods for Legal Document Retrieval and Entailment. In Takama, Yasufumi, Katsutoshi Yada, Ken Satoh, and Sachiyo Arai eds. New Frontiers in Artificial Intelligence, pp. 51–67, Cham: Springer Nature Switzerland.

Kung, Tiffany H., Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepa ̃no, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, Vol. 2, No. 2, pp. 1–12, 02.

Kurihara, Kentaro, Daisuke Kawahara, and Tomohide Shibata (2022). JGLUE: Japanese General Language Understanding Evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2957–2966, Marseille, France: European Language Resources Association, June.

Macey-Dare, Rupert (2023). ChatGPT & Generative AI Systems as Quasi-Expert Legal Advice Lawyers - Case Study Considering Potential Appeal Against Conviction of Tom Hayes. SSRN Electronic Journal.

Nay, John J. (2023). Large Language Models as Fiduciaries: A Case Study Toward Robustly Communicating With Artificial Intelligence Through Legal Standards. ArXiv, Vol. abs/2301.10095.

Nguyen, Ha-Thanh, Randy Goebel, Francesca Toni, Kostas Stathis, and Ken Satoh (2023). Black-Box Analysis: GPTs Across Time in Legal Textual Entailment Task.

Oltz, Tammy Pettinato (2023). ChatGPT, Professor of Law. SSRN Electronic Journal.

OpenAI (2023). GPT-4 Technical Report. ArXiv, Vol. abs/2303.08774.

Trautmann, Dietrich, Alina Petrova, and Frank Schilder (2022). Legal Prompt Engineering for Multilingual Legal Judgement Prediction. ArXiv, Vol. abs/2212.02199.

Wei, Jason, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le (2022). Finetuned Language Models are Zero-Shot Learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022: OpenReview.net.

Yu, Fang, Lee Quartey, and Frank Schilder (2022). Legal Prompting: Teaching a Language Model to Think Like a Lawyer. ArXiv, Vol. abs/2212.01326.

Posted


Submitted: 2023-11-29 16:48:27 UTC

Published: 2023-12-01 07:07:58 UTC
Section
Information Sciences