農業検定試験問題を用いた大規模言語モデルの性能評価

戸田, 陽介; 河合, 宏紀

doi:10.51094/jxiv.1203

##article.authors##

戸田, 陽介株式会社フィトメトリクス https://orcid.org/0000-0003-2421-4743 https://researchmap.jp/yosuke_toda
河合, 宏紀エルピクセル株式会社

DOI:

https://doi.org/10.51094/jxiv.1203

キーワード:

LLM、農業検定

抄録

農業分野における大規模言語モデル（LLM）の実用可能性を評価するため、日本語で記述された農業検定1級（2023年度）の4択問題全70問を対象に、近年のLLM 27種に対する予備的なベンチマーク評価を実施した。最高正答率は85.7%に達し、合格基準である70%を大きく上回るモデルも複数確認された。LLMが農業分野における専門的な知識理解・推論において実用域に到達しつつあることが示唆されたと同時に、知識の偏在や構文解釈の限界といった課題も浮き彫りとなった。分野特化型の知識強化や体系的な日本語ベンチマークの整備が、農業用LLMの高度化に向けた鍵となる。

利益相反に関する開示

戸田は株式会社フィトメトリクス、河合はエルピクセル株式会社に雇用されている。

ダウンロード *前日までの集計結果を表示します

ダウンロード実績データは、公開の翌日以降に作成されます。

引用文献

Arshad, M. A., Jubery, T. Z., Tirtho, R., Rim, N., Singh, A. K., Hegde, A. S. C., Baskar, G., Krishnamurthy, A. B. A., & Soumik, S. (2024). AgEval: A benchmark for zero-shot and few-shot plant stress phenotyping with multimodal LLMs. In arXiv. arXiv. https://arxiv.org/html/2407.19617v1

Hakoishi, K., Sugeta, D., Hitokoto, M., Shiono, T., Kudo, A., Omino, S., Ichihara, A., & Yokoyama, K. (2024). Creation of a Japanese Dataset on Agricultural Water Management Facilities and Future Prospects. 農業農村工学会大会講演会講演要旨集, 2024年度（第73回）, 237–238.

Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences (Basel, Switzerland), 11(14), 6421. https://doi.org/10.3390/app11146421

Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., & Radev, D. (2023). Evaluating GPT-4 and ChatGPT on Japanese medical licensing examinations. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2303.18027

Kpodo, J., Kordjamshidi, P., & Nejadhashemi, A. P. (2024). AgXQA: A benchmark for advanced Agricultural Extension question answering. Computers and Electronics in Agriculture, 225(109349), 109349. https://doi.org/10.1016/j.compag.2024.109349

Ollama. (n.d.). https://github.com/ollama/ollama

Phan, L., Gatti, A., Han, Z., & Li, N. (2025). Humanity’s Last Exam. SuperIntelligence - Robotics - Safety & Alignment, 2(1). https://doi.org/10.70777/si.v2i1.13973

Zhang, H., Sun, J., Chen, R., Liu, W., Yuan, Z., Zheng, X., Wang, Z., Yang, Z., Yan, H., Zhong, H.-S., Wang, X., Ouyang, W., Yang, F., & Dong, N. (2024). Empowering and assessing the utility of large language models in crop science. Neural Information Processing Systems, 37, 52670–52722. https://proceedings.neurips.cc/paper_files/paper/2024/hash/5e5783c673cf05cfd4b3ebf46e96abfc-Abstract-Datasets_and_Benchmarks_Track.html

Zhou, Y., & Ryo, M. (2024). AgriBench: A hierarchical agriculture benchmark for MultiModal Large Language Models. In arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2412.00465

国立情報学研究所大学共同利用機関法人情報・システム研究機構. (2016). NII人工知能プロジェクト「ロボットは東大に入れるか」／センター試験模試6科目で偏差値50以上. https://www.nii.ac.jp/news/release/2016/1114.html

株式会社LifePrompt. (2025). 【東大理3合格】ChatGPT o1とDeepSeek R1に2025年度東大受験を解かせた結果と答案分析【採点協力：河合塾】. https://note.com/lifeprompt/n/n0078de2ef36b