Utilizing Large Language Models in Research Practice: Insights from Southeast Asia
DOI:
https://doi.org/10.51094/jxiv.1644Keywords:
South East Asian Area Studies, Malaysian Politics, Large Language Model, GPT, BERTAbstract
This paper examines the practical use of large language models (LLMs) to structure unstructured text in Southeast Asian political research. Prior studies have relied on manual labeling, which creates scalability and consistency challenges for large datasets. This paper presents and compares two approaches on a common task, classifying Malaysian political news and comments into 12 topics with four polarity labels. The first approach employs general-purpose models via the OpenAI API. Through prompt engineering, setting the temperature to 0, and enforcing Structured Outputs in JSON to control diversity and output format, this approach achieves high performance with few in-context examples and delivers a degree of reproducibility, although perfect reproducibility remains elusive. On a test set of 50 items sampled from 43,546 Facebook and Reddit posts, GPT -4o-mini attained a weighted F1 score of 0.8752. The second approach fine-tunes BERT within a Human-in-the-Loop framework. This paper applies diversity sampling using k-means and uncertainty sampling iteratively based on predicted class probabilities to prioritize annotations near the decision boundary, expanding the labeled dataset from 405 to 605, 1,205 and then to 1,705 items. To mitigate class imbalance, particularly the scarcity of positive and neutral instances, this paper augments the data with 200 synthetic examples generated via the OpenAI API. Weighted F1 scores improved from 0.6898 to 0.7134, 0.8212, and 0.8476 across iterations; with hyperparameter tuning, the model reached 0.8606, approaching the performance of GPT -4o-mini. Topic-level analyses indicate persistent difficulty in domains such as political leadership and administrative performance, where single sentences often contain mixed polarity; this paper therefore suggests reconsidering label design (e.g., accommodating mixed polarity) for applied settings. In contexts where some variability in reproducibility is acceptable and rapid, high performance is required, the API-based approach is advantageous. For long-term stable operation, version pinning, and accuracy gains through targeted data growth, fine-tuned BERT is preferable. This paper further argues that human monitoring of outputs, statistical quality control of data, and collaboration between domain expertise and data science are essential and offers a practical roadmap for quantitative analysis in the social sciences using LLMs.
Conflicts of Interest Disclosure
The authors declare no conflicts of interest associated with this manuscript.Downloads *Displays the aggregated results up to the previous day.
References
Alammar, Jay and Grootendorst, Maarten(2025)、『直感LLM―ハンズオンで動かして学ぶ大規模言語モデル入門』(中山光樹訳、原著は2024年発行)オライリー・ジャパン。
Vajjala, Sowmya., Majumder, Bodhisattwa., Gupta, Anuj and Suranam Harshit(2022)『実践自然言語処理―実世界NLPアプリケ ーション開発のベストプラクティス』(中山光樹訳、原著は2020年発行)オライリー・ジャパン。
岡本正明・八木暢昭・久納源太(2024)、「第5回ティックトックの政治化は民主主義を空洞化するのか? 」『IDEスクエア--世界を見る眼』、1- 9ペ ージ。
OpenAI(2020)、『OpenAI API』(2025年10月7日に最終アクセス、https://openai.com/ja-JP/index/openai-api/)。
——— (2024)、『APIにStructured Outputsを導入』(2025年10月7日に最終アクセス、https://openai.com/ja-JP/index/introducing-structured-outputs-in-the-api/)。
Ozdemir, Sinan(2023)、『事例で学ぶ特徴量エンジニアリング』(田村広平・大野真一朗・砂長谷健・土井健・大貫峻平・石山将成訳、原著は2022年発行)オライリー・ジャパン。
Géron, Aurélien(2024)『scikit-learn、Keras、TensorFlowによる実践機械学習第3版』(下田倫大・牧允皓・長尾高弘訳、原著は2022年発行)オライリー・ジャパン。
鈴木貴之編(2023)、『人工知能とどうつきあうか哲学から考える』勁草書房。
Tunstall, Lewis., Von Werra, Leandro and Wolf, Thomas(2022)、『機械学習エンジニアのためのTransformers―最先端の自然言語処理ライブラリによるモデル開発』(中山光樹訳、原著は2022年発行)オライリー・ジャパン。
Huyen, Chip(2023)、『機械学習システムデザイン―実運用レベルのアプリケーションを実現する継続的反復プロセス』(江川崇・平山順一訳、原著は2022年発行)オライリー・ジャパン。
Fregly, Chris., Barth, Antje and Eigenbrode, Shelbee(2024)『AWSではじめる生成AI―RAGアプリケーション開発から、基盤モデルの微調整、マルチモーダルAI活用までを試して学ぶ』(久富木隆一訳、本橋和貴・久保隆宏技術監修、原著は2023年発行)オライリー・ジャパン。
Monarch, R. M.(2023)、『Human-in-the-Loop機械学習―人間参加型AIのための能動学習とアノテーション―』(上田隼也・角野為耶・伊藤寛祥訳、原著は2021年発行)共立出版。
山田育矢・鈴木正敏・山田康輔・李凌寒(2023)、『大規模言語モデル入門』技術評論社。
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... and McGrew, B. (2023). "GPT-4 technical report," arXiv preprint arXiv:2303.08774.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... and Amodei, D. (2020). "Language models are few-shot learners," Advances in neural information processing systems, 33, pp. 1877-1901.
Chinnasamy, S., & Manaf, N. A. (2018). "Social media as political hatred mode in Malaysia's 2018 General Election," SHS Web of Conferences, 53.
Devlin, J., Chang, M. W., Lee, K. and Toutanova, K. (2019). "Bert: Pre-training of deep bidirectional transformers for language understanding," Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171-4186.
Geiger, R. S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R. and Huang, J. (2020). "Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from?," Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 325-336.
Ghanem, M., Ghaith, A. K., El-Hajj, V. G., Bhandarkar, A., De Giorgio, A., Elmi-Terander, A. and Bydon, M. (2023). "Limitations in evaluating machine learning models for imbalanced binary outcome classification in spine surgery: a systematic review," Brain Sciences, 13 (12), 1723.
Grimmer, J., Roberts, M. E. and Stewart, B. M. (2022). Text as data: A new framework for machine learning and the social sciences. Princeton University Press.
Hinojosa Lee, M. C., Braet, J. and Springael, J. (2024). "Performance metrics for multilabel emotion classification: comparing micro, macro, and weighted f1-scores," Applied Sciences, 14 (21), 9863.
Kasmani, M. F. (2020). "How did people Tweet in the 2018 Malaysian general election: Analysis of top Tweets in #PRU14," IIUM Journal of Human Sciences, 2 (1), pp. 39-54.
Kasmani, M. F., Sabran, R. and Ramle, N. (2014). "Can Twitter be an effective platform for political discourse in Malaysia? A study of #PRU13," Procedia-Social and Behavioral Sciences, 155, pp. 348-355.
Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J. and Fernández-Leal, Á. (2023). "Human-in-the-loop machine learning: a state of the art," Artificial Intelligence Review, 56 (4), pp. 3005-3054.
Müller-Hansen, F., Callaghan, M. W. and Minx, J. C. (2020). "Text as big data: Develop codes of practice for rigorous computational text analysis in energy social science," Energy Research & Social Science, 70, 101691.
Silva, M. O., Oliveira, G. P., Costa, L. G. and Pappa, G. L. (2024). "GovBERT-BR: A BERT-Based Language Model for Brazilian Portuguese Governmental Data," Brazilian Conference on Intelligent Systems, pp. 19-32.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... and Polosukhin, I. (2017). "Attention is all you need," Advances in neural information processing systems, 30.
Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., and He, L. (2022). "A survey of human-in-the-loop for machine learning," Future Generation Computer Systems, 135, pp. 364-381.
Zhang, J., Zhao, Y., Saleh, M. and Liu, P. (2020). "Pegasus: Pre-training with extracted gap-sentences for abstractive summarization," International conference on machine learning, pp. 11328-11339.
Zhao, H., Chen, H. and Yoon, H. J. (2023). "Enhancing text classification models with generative ai-aided data augmentation," 2023 IEEE International Conference On Artificial Intelligence Testing (AITest), pp. 138-145.
Downloads
Posted
Submitted: 2025-10-10 07:35:08 UTC
Published: 2025-10-21 01:03:58 UTC — Updated on 2025-10-23 07:38:59 UTC
Versions
- 2025-10-23 07:38:59 UTC (2)
- 2025-10-21 01:03:58 UTC (1)
Reason(s) for revision
I will replace the old one with the revised version, as there was a missing statement in the abstract. It's not : expanding the labeled dataset from 405 to 1,205 and then to 1,705 items. It's expanding the labeled dataset from 405 to 605, 1,205 and then to 1,705 items.License
Copyright (c) 2025
Nobuaki Yagi

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.